summaryrefslogblamecommitdiff
path: root/third-party-projects.md
blob: c1486173662700f35026a78c957f73b72209fc02 (plain) (tree)



















































































                                                                                                                          
---
layout: global
title: Third-Party Projects
type: "page singular"
navigation:
  weight: 5
  show: true
---

This page tracks external software projects that supplement Apache Spark and add to its ecosystem.

<h2>spark-packages.org</h2>

<a href="https://spark-packages.org/">spark-packages.org</a> is an external, 
community-managed list of third-party libraries, add-ons, and applications that work with 
Apache Spark. You can add a package as long as you have a GitHub repository.

<h2>Infrastructure Projects</h2>

- <a href="https://github.com/spark-jobserver/spark-jobserver">Spark Job Server</a> - 
REST interface for managing and submitting Spark jobs on the same cluster 
(see <a href="http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server">blog post</a> 
for details)
- <a href="https://github.com/amplab-extras/SparkR-pkg">SparkR</a> - R frontend for Spark
- <a href="http://mlbase.org/">MLbase</a> - Machine Learning research project on top of Spark
- <a href="http://mesos.apache.org/">Apache Mesos</a> - Cluster management system that supports 
running Spark
- <a href="http://alluxio.org/">Alluxio</a> (née Tachyon) - Memory speed virtual distributed 
storage system that supports running Spark    
- <a href="https://github.com/datastax/spark-cassandra-connector">Spark Cassandra Connector</a> - 
Easily load your Cassandra data into Spark and Spark SQL; from Datastax
- <a href="http://github.com/tuplejump/FiloDB">FiloDB</a> - a Spark integrated analytical/columnar 
database, with in-memory option capable of sub-second concurrent queries
- <a href="http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html#spark-sql">ElasticSearch - 
Spark SQL</a> Integration
- <a href="https://github.com/tresata/spark-scalding">Spark-Scalding</a> - Easily transition 
Cascading/Scalding code to Spark
- <a href="http://zeppelin-project.org/">Zeppelin</a> - an IPython-like notebook for Spark. There 
is also <a href="https://github.com/tribbloid/ISpark">ISpark</a>, and the 
<a href="https://github.com/andypetrella/spark-notebook/">Spark Notebook</a>.
- <a href="http://www.ibm.com/developerworks/servicemanagement/tc/pcs/index.html">IBM Spectrum Conductor with Spark</a> - 
cluster management software that integrates with Spark
- <a href="https://github.com/EclairJS/eclairjs-node">EclairJS</a> - enables Node.js developers to code
against Spark, and data scientists to use Javascript in Jupyter notebooks.
- <a href="https://github.com/SnappyDataInc/snappydata">SnappyData</a> - an open source 
OLTP + OLAP database integrated with Spark on the same JVMs.
- <a href="https://github.com/DataSystemsLab/GeoSpark">GeoSpark</a> - Geospatial RDDs and joins
- <a href="https://github.com/ispras/spark-openstack">Spark Cluster Deploy Tools for OpenStack</a>

<h2>Applications Using Spark</h2>

- <a href="http://mahout.apache.org/">Apache Mahout</a> - Previously on Hadoop MapReduce, 
Mahout has switched to using Spark as the backend
- <a href="https://wiki.apache.org/mrql/">Apache MRQL</a> - A query processing and optimization 
system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, and Spark
- <a href="http://blinkdb.org/">BlinkDB</a> - a massively parallel, approximate query engine built 
on top of Shark and Spark
- <a href="https://github.com/adobe-research/spindle">Spindle</a> - Spark/Parquet-based web 
analytics query engine
- <a href="http://simin.me/projects/spatialspark/">Spark Spatial</a> - Spatial joins and 
processing for Spark
- <a href="https://github.com/thunderain-project/thunderain">Thunderain</a> - a framework 
for combining stream processing with historical data, think Lambda architecture
- <a href="https://github.com/AyasdiOpenSource/df">DF</a> from Ayasdi - a Pandas-like data frame 
implementation for Spark
- <a href="https://github.com/OryxProject/oryx">Oryx</a> -  Lambda architecture on Apache Spark, 
Apache Kafka for real-time large scale machine learning
- <a href="https://github.com/bigdatagenomics/adam">ADAM</a> - A framework and CLI for loading, 
transforming, and analyzing genomic data using Apache Spark

<h2>Additional Language Bindings</h2>

<h3>C# / .NET</h3>

- <a href="https://github.com/Microsoft/SparkCLR">CLR for Spark</a>

<h3>Clojure</h3>

- <a href="https://github.com/TheClimateCorporation/clj-spark">clj-spark</a>
- <a href="http://spark-packages.org/package/21">Sparkling</a>

<h3>Groovy</h3>

- <a href="https://github.com/bunions1/groovy-spark-example">groovy-spark-example</a>