<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>
MLlib | Apache Spark
</title>
<meta name="description" content="MLlib is Apache Spark's scalable machine learning library, with APIs in Java, Scala and Python.">
<!-- Bootstrap core CSS -->
<link href="/css/cerulean.min.css" rel="stylesheet">
<link href="/css/custom.css" rel="stylesheet">
<script type="text/javascript">
<!-- Google Analytics initialization -->
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-32518208-2']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
<!-- Adds slight delay to links to allow async reporting -->
function trackOutboundLink(link, category, action) {
try {
_gaq.push(['_trackEvent', category , action]);
} catch(err){}
setTimeout(function() {
document.location.href = link.href;
}, 100);
}
</script>
<!-- HTML5 shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.3.0/respond.min.js"></script>
<![endif]-->
</head>
<body>
<script src="https://code.jquery.com/jquery.js"></script>
<script src="//netdna.bootstrapcdn.com/bootstrap/3.0.3/js/bootstrap.min.js"></script>
<script src="/js/lang-tabs.js"></script>
<script src="/js/downloads.js"></script>
<div class="container" style="max-width: 1200px;">
<div class="masthead">
<p class="lead">
<a href="/">
<img src="/images/spark-logo-trademark.png"
style="height:100px; width:auto; vertical-align: bottom; margin-top: 20px;"></a>
<a href="#"><span class="subproject">
MLlib
</span></a>
</p>
</div>
<nav class="navbar navbar-default" role="navigation">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse"
data-target="#navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="navbar-collapse-1">
<ul class="nav navbar-nav">
<li><a href="/downloads.html">Download</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">
Libraries <b class="caret"></b>
</a>
<ul class="dropdown-menu">
<li><a href="/sql/">SQL and DataFrames</a></li>
<li><a href="/streaming/">Spark Streaming</a></li>
<li><a href="/mllib/">MLlib (machine learning)</a></li>
<li><a href="/graphx/">GraphX (graph)</a></li>
<li class="divider"></li>
<li><a href="http://spark-packages.org">Third-Party Packages</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">
Documentation <b class="caret"></b>
</a>
<ul class="dropdown-menu">
<li><a href="/docs/latest/">Latest Release (Spark 1.6.0)</a></li>
<li><a href="/documentation.html">Other Resources</a></li>
</ul>
</li>
<li><a href="/examples.html">Examples</a></li>
<li class="dropdown">
<a href="/community.html" class="dropdown-toggle" data-toggle="dropdown">
Community <b class="caret"></b>
</a>
<ul class="dropdown-menu">
<li><a href="/community.html">Mailing Lists</a></li>
<li><a href="/community.html#events">Events and Meetups</a></li>
<li><a href="/community.html#history">Project History</a></li>
<li><a href="https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark">Powered By</a></li>
<li><a href="https://cwiki.apache.org/confluence/display/SPARK/Committers">Project Committers</a></li>
<li><a href="https://issues.apache.org/jira/browse/SPARK">Issue Tracker</a></li>
</ul>
</li>
<li><a href="/faq.html">FAQ</a></li>
</ul>
</div>
<!-- /.navbar-collapse -->
</nav>
<div class="row">
<div class="col-md-3 col-md-push-9">
<div class="news" style="margin-bottom: 20px;">
<h5>Latest News</h5>
<ul class="list-unstyled">
<li><a href="/news/spark-summit-east-agenda-posted.html">Spark Summit East (Feb 16, 2016, New York) agenda posted</a>
<span class="small">(Jan 14, 2016)</span></li>
<li><a href="/news/spark-1-6-0-released.html">Spark 1.6.0 released</a>
<span class="small">(Jan 04, 2016)</span></li>
<li><a href="/news/spark-summit-east-2016-cfp-closing.html">CFP for Spark Summit East 2016 is closing soon!</a>
<span class="small">(Nov 19, 2015)</span></li>
<li><a href="/news/spark-1-5-2-released.html">Spark 1.5.2 released</a>
<span class="small">(Nov 09, 2015)</span></li>
</ul>
<p class="small" style="text-align: right;"><a href="/news/index.html">Archive</a></p>
</div>
<div class="hidden-xs hidden-sm">
<a href="/downloads.html" class="btn btn-success btn-lg btn-block" style="margin-bottom: 30px;">
Download Spark
</a>
<p style="font-size: 16px; font-weight: 500; color: #555;">
Built-in Libraries:
</p>
<ul class="list-none">
<li><a href="/sql/">SQL and DataFrames</a></li>
<li><a href="/streaming/">Spark Streaming</a></li>
<li><a href="/mllib/">MLlib (machine learning)</a></li>
<li><a href="/graphx/">GraphX (graph)</a></li>
</ul>
<a href="http://spark-packages.org">Third-Party Packages</a>
</div>
</div>
<div class="col-md-9 col-md-pull-3">
<div class="jumbotron">
<b>MLlib</b> is Apache Spark's scalable machine learning library.
</div>
<div class="row row-padded">
<div class="col-md-7 col-sm-7">
<h2>Ease of Use</h2>
<p class="lead">
Usable in Java, Scala, Python, and SparkR.
</p>
<p>
MLlib fits into <a href="/">Spark</a>'s
APIs and interoperates with <a href="http://www.numpy.org">NumPy</a> in Python (starting in Spark 0.9).
You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it
easy to plug into Hadoop workflows.
</p>
</div>
<div class="col-md-5 col-sm-5 col-padded-top col-center">
<div style="margin-top: 15px; text-align: left; display: inline-block;">
<div class="code">
points = spark.textFile(<span class="string">"hdfs://..."</span>)<br />
.<span class="sparkop">map</span>(<span class="closure">parsePoint</span>)<br />
<br />
model = KMeans.<span class="sparkop">train</span>(points, k=10)
</div>
<div class="caption">Calling MLlib in Python</div>
</div>
</div>
</div>
<div class="row row-padded">
<div class="col-md-7 col-sm-7">
<h2>Performance</h2>
<p class="lead">
High-quality algorithms, 100x faster than MapReduce.
</p>
<p>
Spark excels at iterative computation, enabling MLlib to run fast.
At the same time, we care about algorithmic performance:
MLlib contains high-quality algorithms that leverage iteration, and
can yield better results than the one-pass approximations sometimes used on MapReduce.
</p>
</div>
<div class="col-md-5 col-sm-5 col-padded-top col-center">
<div style="width: 100%; max-width: 272px; display: inline-block; text-align: center;">
<img src="/images/logistic-regression.png" style="width: 100%; max-width: 250px;" />
<div class="caption" style="min-width: 272px;">Logistic regression in Hadoop and Spark</div>
</div>
</div>
</div>
<div class="row row-padded" style="margin-bottom: 15px;">
<div class="col-md-7 col-sm-7">
<h2>Easy to Deploy</h2>
<p class="lead">
Runs on existing Hadoop clusters and data.
</p>
<p>
If you have a Hadoop 2 cluster, you can run Spark and MLlib without any pre-installation.
Otherwise, Spark is easy to run <a href="/docs/latest/spark-standalone.html">standalone</a>
or on <a href="/docs/latest/ec2-scripts.html">EC2</a> or <a href="http://mesos.apache.org">Mesos</a>.
You can read from <a href="http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">HDFS</a>, <a href="http://hbase.apache.org">HBase</a>, or any Hadoop data source.
</p>
</div>
<div class="col-md-5 col-sm-5 col-padded-top col-center">
<img src="/images/hadoop.jpg" style="width: 100%; max-width: 280px;" />
</div>
</div>
</div>
</div>
<div class="row">
<div class="col-md-4 col-padded">
<h3>Algorithms</h3>
<p>
MLlib contains the following algorithms and utilities:
</p>
<ul class="list-narrow">
<li>logistic regression and linear support vector machine (SVM)</li>
<li>classification and regression tree</li>
<li>random forest and gradient-boosted trees</li>
<li>recommendation via alternating least squares (ALS)</li>
<li>clustering via k-means, bisecting k-means, Gaussian mixtures (GMM), and power iteration clustering</li>
<li>topic modeling via latent Dirichlet allocation (LDA)</li>
<li>survival analysis via accelerated failure time model</li>
<li>singular value decomposition (SVD) and QR decomposition</li>
<li>principal component analysis (PCA)</li>
<li>linear regression with L<sub>1</sub>, L<sub>2</sub>, and elastic-net regularization</li>
<li>isotonic regression</li>
<li>multinomial/binomial naive Bayes</li>
<li>frequent itemset mining via FP-growth and association rules</li>
<li>sequential pattern mining via PrefixSpan</li>
<li>summary statistics and hypothesis testing</li>
<li>feature transformations</li>
<li>model evaluation and hyper-parameter tuning</li>
</ul>
<p>Refer to the <a href="/docs/latest/mllib-guide.html">MLlib guide</a> for usage examples.</p>
</div>
<div class="col-md-4 col-padded">
<h3>Community</h3>
<p>
MLlib is developed as part of the Apache Spark project. It thus gets
tested and updated with each Spark release.
</p>
<p>
If you have questions about the library, ask on the
<a href="/community.html#mailing-lists">Spark mailing lists</a>.
</p>
<p>
MLlib is still a young project and welcomes contributions. If you'd like to submit an algorithm to MLlib,
read <a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">how to
contribute to Spark</a> and send us a patch!
</p>
</div>
<div class="col-md-4 col-padded">
<h3>Getting Started</h3>
<p>
To get started with MLlib:
</p>
<ul class="list-narrow">
<li><a href="/downloads.html">Download Spark</a>. MLlib is included as a module.</li>
<li>Read the <a href="/docs/latest/mllib-guide.html">MLlib guide</a>, which includes
various usage examples.</li>
<li>Learn how to <a href="/docs/latest/#launching-on-a-cluster">deploy</a> Spark on a cluster
if you'd like to run in distributed mode. You can also run locally on a multicore machine
without any setup.
</li>
</ul>
</div>
</div>
<div class="row">
<div class="col-sm-12 col-center">
<a href="/downloads.html" class="btn btn-success btn-lg btn-multiline">
Download Spark<br /><span class="small">Includes MLlib</span>
</a>
</div>
</div>
<footer class="small">
<hr>
Apache Spark, Spark, Apache, and the Spark logo are trademarks of
<a href="http://www.apache.org">The Apache Software Foundation</a>.
</footer>
</div>
</body>
</html>