Third-Party Projects | Apache Spark

Latest News

Spark wins CloudSort Benchmark as the most efficient engine (Nov 15, 2016)
Spark 2.0.2 released (Nov 14, 2016)
Spark 1.6.3 released (Nov 07, 2016)
Spark 2.0.1 released (Oct 03, 2016)

spark-packages.org

spark-packages.org is an external, community-managed list of third-party libraries, add-ons, and applications that work with Apache Spark. You can add a package as long as you have a GitHub repository.

Infrastructure Projects

Spark Job Server - REST interface for managing and submitting Spark jobs on the same cluster (see blog post for details)
SparkR - R frontend for Spark
MLbase - Machine Learning research project on top of Spark
Apache Mesos - Cluster management system that supports running Spark
Alluxio (née Tachyon) - Memory speed virtual distributed storage system that supports running Spark
Spark Cassandra Connector - Easily load your Cassandra data into Spark and Spark SQL; from Datastax
FiloDB - a Spark integrated analytical/columnar database, with in-memory option capable of sub-second concurrent queries
ElasticSearch - Spark SQL Integration
Spark-Scalding - Easily transition Cascading/Scalding code to Spark
Zeppelin - an IPython-like notebook for Spark. There is also ISpark, and the Spark Notebook.
IBM Spectrum Conductor with Spark - cluster management software that integrates with Spark
EclairJS - enables Node.js developers to code against Spark, and data scientists to use Javascript in Jupyter notebooks.
SnappyData - an open source OLTP + OLAP database integrated with Spark on the same JVMs.
GeoSpark - Geospatial RDDs and joins
Spark Cluster Deploy Tools for OpenStack

Applications Using Spark

Apache Mahout - Previously on Hadoop MapReduce, Mahout has switched to using Spark as the backend
Apache MRQL - A query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, and Spark
BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark
Spindle - Spark/Parquet-based web analytics query engine
Spark Spatial - Spatial joins and processing for Spark
Thunderain - a framework for combining stream processing with historical data, think Lambda architecture
DF from Ayasdi - a Pandas-like data frame implementation for Spark
Oryx - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
ADAM - A framework and CLI for loading, transforming, and analyzing genomic data using Apache Spark

Latest News

spark-packages.org

Infrastructure Projects

Applications Using Spark

Additional Language Bindings

C# / .NET

Clojure

Groovy