path: root/site/mllib/index.html
blob: 71a7042f689668a432382f8bac4633a0661e15c0 (plain) (tree)








































<!DOCTYPE html>
<html lang="en">
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">

     MLlib | Apache Spark


    <meta name="description" content="MLlib is Apache Spark's scalable machine learning library, with APIs in Java, Scala, Python, and R.">

  <!-- Bootstrap core CSS -->
  <link href="/css/cerulean.min.css" rel="stylesheet">
  <link href="/css/custom.css" rel="stylesheet">

  <!-- Code highlighter CSS -->
  <link href="/css/pygments-default.css" rel="stylesheet">

  <script type="text/javascript">
  <!-- Google Analytics initialization -->
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-32518208-2']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);

  <!-- Adds slight delay to links to allow async reporting -->
  function trackOutboundLink(link, category, action) {
    try {
      _gaq.push(['_trackEvent', category , action]);
    } catch(err){}

    setTimeout(function() {
      document.location.href = link.href;
    }, 100);

  <!-- HTML5 shim and Respond.js IE8 support of HTML5 elements and media queries -->
  <!--[if lt IE 9]>
  <script src=""></script>
  <script src=""></script>


<script src=""></script>
<script src="//"></script>
<script src="/js/lang-tabs.js"></script>
<script src="/js/downloads.js"></script>

<div class="container" style="max-width: 1200px;">

<div class="masthead">
    <p class="lead">
      <a href="/">
      <img src="/images/spark-logo-trademark.png"
      style="height:100px; width:auto; vertical-align: bottom; margin-top: 20px;"></a>
      <a href="#"><span class="subproject">

<nav class="navbar navbar-default" role="navigation">
  <!-- Brand and toggle get grouped for better mobile display -->
  <div class="navbar-header">
    <button type="button" class="navbar-toggle" data-toggle="collapse"
      <span class="sr-only">Toggle navigation</span>
      <span class="icon-bar"></span>
      <span class="icon-bar"></span>
      <span class="icon-bar"></span>

  <!-- Collect the nav links, forms, and other content for toggling -->
  <div class="collapse navbar-collapse" id="navbar-collapse-1">
    <ul class="nav navbar-nav">
      <li><a href="/downloads.html">Download</a></li>
      <li class="dropdown">
        <a href="#" class="dropdown-toggle" data-toggle="dropdown">
          Libraries <b class="caret"></b>
        <ul class="dropdown-menu">
          <li><a href="/sql/">SQL and DataFrames</a></li>
          <li><a href="/streaming/">Spark Streaming</a></li>
          <li><a href="/mllib/">MLlib (machine learning)</a></li>
          <li><a href="/graphx/">GraphX (graph)</a></li>
          <li class="divider"></li>
          <li><a href="">Third-Party Packages</a></li>
      <li class="dropdown">
        <a href="#" class="dropdown-toggle" data-toggle="dropdown">
          Documentation <b class="caret"></b>
        <ul class="dropdown-menu">
          <li><a href="/docs/latest/">Latest Release (Spark 2.0.0)</a></li>
          <li><a href="/documentation.html">Older Versions and Other Resources</a></li>
      <li><a href="/examples.html">Examples</a></li>
      <li class="dropdown">
        <a href="/community.html" class="dropdown-toggle" data-toggle="dropdown">
          Community <b class="caret"></b>
        <ul class="dropdown-menu">
          <li><a href="/community.html">Mailing Lists</a></li>
          <li><a href="/community.html#events">Events and Meetups</a></li>
          <li><a href="/community.html#history">Project History</a></li>
          <li><a href="">Powered By</a></li>
          <li><a href="">Project Committers</a></li>
          <li><a href="">Issue Tracker</a></li>
      <li><a href="/faq.html">FAQ</a></li>
    <ul class="nav navbar-nav navbar-right">
      <li class="dropdown">
        <a href="" class="dropdown-toggle" data-toggle="dropdown">
          Apache Software Foundation <b class="caret"></b></a>
        <ul class="dropdown-menu">
          <li><a href="">Apache Homepage</a></li>
          <li><a href="">License</a></li>
          <li><a href="">Sponsorship</a></li>
          <li><a href="">Thanks</a></li>
          <li><a href="">Security</a></li>
  <!-- /.navbar-collapse -->

<div class="row">
  <div class="col-md-3 col-md-push-9">
    <div class="news" style="margin-bottom: 20px;">
      <h5>Latest News</h5>
      <ul class="list-unstyled">
          <li><a href="/news/spark-2-0-0-released.html">Spark 2.0.0 released</a>
          <span class="small">(Jul 26, 2016)</span></li>
          <li><a href="/news/spark-1-6-2-released.html">Spark 1.6.2 released</a>
          <span class="small">(Jun 25, 2016)</span></li>
          <li><a href="/news/submit-talks-to-spark-summit-eu-2016.html">Call for Presentations for Spark Summit EU is Open</a>
          <span class="small">(Jun 16, 2016)</span></li>
          <li><a href="/news/spark-2.0.0-preview.html">Preview release of Spark 2.0</a>
          <span class="small">(May 26, 2016)</span></li>
      <p class="small" style="text-align: right;"><a href="/news/index.html">Archive</a></p>
    <div class="hidden-xs hidden-sm">
      <a href="/downloads.html" class="btn btn-success btn-lg btn-block" style="margin-bottom: 30px;">
        Download Spark
      <p style="font-size: 16px; font-weight: 500; color: #555;">
        Built-in Libraries:
      <ul class="list-none">
        <li><a href="/sql/">SQL and DataFrames</a></li>
        <li><a href="/streaming/">Spark Streaming</a></li>
        <li><a href="/mllib/">MLlib (machine learning)</a></li>
        <li><a href="/graphx/">GraphX (graph)</a></li>
      <a href="">Third-Party Packages</a>

  <div class="col-md-9 col-md-pull-3">
    <div class="jumbotron">
  <b>MLlib</b> is Apache Spark's scalable machine learning library.

<div class="row row-padded">
  <div class="col-md-7 col-sm-7">
    <h2>Ease of Use</h2>
    <p class="lead">
      Usable in Java, Scala, Python, and R.
      MLlib fits into <a href="/">Spark</a>'s
      APIs and interoperates with <a href="">NumPy</a>
      in Python (as of Spark 0.9) and R libraries (as of Spark 1.5).
      You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it
      easy to plug into Hadoop workflows.
  <div class="col-md-5 col-sm-5 col-padded-top col-center">

    <div style="margin-top: 15px; text-align: left; display: inline-block;">
      <div class="code">
        data =<span class="string">"libsvm"</span>)\<br />
	    &nbsp;&nbsp;.load(<span class="string">"hdfs://..."</span>)<br />
        <br />
        model = <span class="sparkop">KMeans</span>(data, k=10)
      <div class="caption">Calling MLlib in Python</div>

<div class="row row-padded">
  <div class="col-md-7 col-sm-7">
    <p class="lead">
      High-quality algorithms, 100x faster than MapReduce.
      Spark excels at iterative computation, enabling MLlib to run fast.
      At the same time, we care about algorithmic performance:
      MLlib contains high-quality algorithms that leverage iteration, and
      can yield better results than the one-pass approximations sometimes used on MapReduce.
  <div class="col-md-5 col-sm-5 col-padded-top col-center">
    <div style="width: 100%; max-width: 272px; display: inline-block; text-align: center;">
      <img src="/images/logistic-regression.png" style="width: 100%; max-width: 250px;" />
      <div class="caption" style="min-width: 272px;">Logistic regression in Hadoop and Spark</div>

<div class="row row-padded" style="margin-bottom: 15px;">
  <div class="col-md-7 col-sm-7">
    <h2>Easy to Deploy</h2>
    <p class="lead">
      Runs on existing Hadoop clusters and data.
      If you have a Hadoop 2 cluster, you can run Spark and MLlib without any pre-installation.
      Otherwise, Spark is easy to run <a href="/docs/latest/spark-standalone.html">standalone</a>
      or on <a href="/docs/latest/ec2-scripts.html">EC2</a> or <a href="">Mesos</a>.
      You can read from <a href="">HDFS</a>, <a href="">HBase</a>, or any Hadoop data source.
  <div class="col-md-5 col-sm-5 col-padded-top col-center">
    <img src="/images/hadoop.jpg" style="width: 100%; max-width: 280px;" />

<div class="row">
  <div class="col-md-4 col-padded">
      MLlib contains many algorithms and utilities, including:
    <ul class="list-narrow">
      <li>Classification: logistic regression, naive Bayes,...</li>
      <li>Regression: generalized linear regression, isotonic regression,...</li>
      <li>Decision trees, random forests, and gradient-boosted trees</li>
      <li>Recommendation: alternating least squares (ALS)</li>
      <li>Clustering: K-means, Gaussian mixtures (GMMs),...</li>
      <li>Topic modeling: latent Dirichlet allocation (LDA)</li>
      <li>Feature transformations: standardization, normalization, hashing,...</li>
      <li>Model evaluation and hyper-parameter tuning</li>
      <li>ML Pipeline construction</li>
      <li>ML persistence: saving and loading models and Pipelines</li>
      <li>Survival analysis: accelerated failure time model</li>
      <li>Frequent itemset and sequential pattern mining: FP-growth, association rules, PrefixSpan</li>
      <li>Distributed linear algebra: singular value decomposition (SVD), principal component analysis (PCA),...</li>
      <li>Statistics: summary statistics, hypothesis testing,...</li>
    <p>Refer to the <a href="/docs/latest/mllib-guide.html">MLlib guide</a> for usage examples.</p>

  <div class="col-md-4 col-padded">
      MLlib is developed as part of the Apache Spark project. It thus gets
      tested and updated with each Spark release.
      If you have questions about the library, ask on the
      <a href="/community.html#mailing-lists">Spark mailing lists</a>.
      MLlib is still a rapidly growing project and welcomes contributions. If you'd like to submit an algorithm to MLlib,
      read <a href="">how to
      contribute to Spark</a> and send us a patch!

  <div class="col-md-4 col-padded">
    <h3>Getting Started</h3>
      To get started with MLlib:
    <ul class="list-narrow">
      <li><a href="/downloads.html">Download Spark</a>. MLlib is included as a module.</li>
      <li>Read the <a href="/docs/latest/mllib-guide.html">MLlib guide</a>, which includes
      various usage examples.</li>
      <li>Learn how to <a href="/docs/latest/#launching-on-a-cluster">deploy</a> Spark on a cluster
        if you'd like to run in distributed mode. You can also run locally on a multicore machine
        without any setup.

<div class="row">
  <div class="col-sm-12 col-center">
    <a href="/downloads.html" class="btn btn-success btn-lg btn-multiline">
      Download Apache Spark<br /><span class="small">Includes MLlib</span>


<footer class="small">
  Apache Spark, Spark, Apache, and the Spark logo are <a href="">trademarks</a> of
  <a href="">The Apache Software Foundation</a>.

