MLlib | Apache Spark

Latest News

Spark 1.2.0 released (Dec 18, 2014)
Spark 1.1.1 released (Nov 26, 2014)
Registration open for Spark Summit East 2015 (Nov 26, 2014)
Spark wins Daytona Gray Sort 100TB Benchmark (Nov 05, 2014)

Ease of Use

Usable in Java, Scala and Python.

MLlib fits into Spark's APIs and interoperates with NumPy in Python (starting in Spark 0.9). You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows.

points = spark.textFile("hdfs://...")
.map(parsePoint)

model = KMeans.train(points, k=10)

Calling MLlib in Python

Performance

High-quality algorithms, 100x faster than MapReduce.

Spark excels at iterative computation, enabling MLlib to run fast. At the same time, we care about algorithmic performance: MLlib contains high-quality algorithms that leverage iteration, and can yield better results than the one-pass approximations sometimes used on MapReduce.

Logistic regression in Hadoop and Spark

Easy to Deploy

Runs on existing Hadoop clusters and data.

If you have a Hadoop 2 cluster, you can run Spark and MLlib without any pre-installation. Otherwise, Spark is easy to run standalone or on EC2 or Mesos. You can read from HDFS, HBase, or any Hadoop data source.

Algorithms

MLlib 1.2 contains the following algorithms:

linear SVM and logistic regression
classification and regression tree
random forests and gradient-boosted trees
k-means clustering and streaming k-means
recommendation via alternating least squares
singular value decomposition
linear regression with L₁- and L₂-regularization
multinomial naive Bayes
basic statistics
feature transformations

Refer to the MLlib guide for usage examples.

Community

MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release.

If you have questions about the library, ask on the Spark mailing lists.

MLlib is still a young project and welcomes contributions. If you'd like to submit an algorithm to MLlib, read how to contribute to Spark and send us a patch!

Getting Started

To get started with MLlib:

Download Spark. MLlib is included as a module.
Read the MLlib guide, which includes various usage examples.
Learn how to deploy Spark on a cluster if you'd like to run in distributed mode. You can also run locally on a multicore machine without any setup.

Download Spark
Includes MLlib