diff options
Diffstat (limited to 'site/examples.html')
-rw-r--r-- | site/examples.html | 535 |
1 files changed, 346 insertions, 189 deletions
diff --git a/site/examples.html b/site/examples.html index a10741676..f76965748 100644 --- a/site/examples.html +++ b/site/examples.html @@ -18,6 +18,9 @@ <link href="/css/cerulean.min.css" rel="stylesheet"> <link href="/css/custom.css" rel="stylesheet"> + <!-- Code highlighter CSS --> + <link href="/css/pygments-default.css" rel="stylesheet"> + <script type="text/javascript"> <!-- Google Analytics initialization --> var _gaq = _gaq || []; @@ -167,254 +170,409 @@ </div> <div class="col-md-9 col-md-pull-3"> - <h2>Spark Examples</h2> + <h1>Spark Examples</h1> <p>These examples give a quick overview of the Spark API. Spark is built on the concept of <em>distributed datasets</em>, which contain arbitrary Java or Python objects. You create a dataset from external data, then apply parallel operations -to it. There are two types of operations: <em>transformations</em>, which define a new dataset based on -previous ones, and <em>actions</em>, which kick off a job to execute on a cluster.</p> - -<h3>Text Search</h3> +to it. The building block of the Spark API is its <a href="http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds">RDD API</a>. +In the RDD API, +there are two types of operations: <em>transformations</em>, which define a new dataset based on previous ones, +and <em>actions</em>, which kick off a job to execute on a cluster. +On top of Spark’s RDD API, high level APIs are provided, e.g. +<a href="http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes">DataFrame API</a> and +<a href="http://spark.apache.org/docs/latest/mllib-guide.html">Machine Learning API</a>. +These high level APIs provide a concise way to conduct certain data operations. +In this page, we will show examples using RDD API as well as examples using high level APIs.</p> + +<h2>RDD API Examples</h2> -<p>In this example, we search through the error messages in a log file:</p> +<h3>Word Count</h3> +<p>In this example, we use a few transformations to build a dataset of (String, Int) pairs called <code>counts</code> and then save it to a file.</p> <ul class="nav nav-tabs"> <li class="lang-tab lang-tab-python active"><a href="#">Python</a></li> <li class="lang-tab lang-tab-scala"><a href="#">Scala</a></li> <li class="lang-tab lang-tab-java"><a href="#">Java</a></li> </ul> + <div class="tab-content"> - <div class="tab-pane tab-pane-python active"> - <div class="code code-tab"> - text_file = spark.textFile(<span class="string">"hdfs://..."</span>)<br /> - errors = text_file.<span class="sparkop">filter</span>(<span class="closure">lambda line: "ERROR" in line</span>)<br /> - <span class="comment"># Count all the errors</span><br /> - errors.<span class="sparkop">count</span>()<br /> - <span class="comment"># Count errors mentioning MySQL</span><br /> - errors.<span class="sparkop">filter</span>(<span class="closure">lambda line: "MySQL" in line</span>).<span class="sparkop">count</span>()<br /> - <span class="comment"># Fetch the MySQL errors as an array of strings</span><br /> - errors.<span class="sparkop">filter</span>(<span class="closure">lambda line: "MySQL" in line</span>).<span class="sparkop">collect</span>()<br /> - </div> - </div> - <div class="tab-pane tab-pane-scala"> - <div class="code code-tab"> - <span class="keyword">val</span> textFile = spark.textFile(<span class="string">"hdfs://..."</span>)<br /> - <span class="keyword">val</span> errors = textFile.<span class="sparkop">filter</span>(<span class="closure">line => line.contains("ERROR")</span>)<br /> - <span class="comment">// Count all the errors</span><br /> - errors.<span class="sparkop">count</span>()<br /> - <span class="comment">// Count errors mentioning MySQL</span><br /> - errors.<span class="sparkop">filter</span>(<span class="closure">line => line.contains("MySQL")</span>).<span class="sparkop">count</span>()<br /> - <span class="comment">// Fetch the MySQL errors as an array of strings</span><br /> - errors.<span class="sparkop">filter</span>(<span class="closure">line => line.contains("MySQL")</span>).<span class="sparkop">collect</span>()<br /> - </div> - </div> - <div class="tab-pane tab-pane-java"> - <div class="code code-tab"> - JavaRDD<String> textFile = spark.textFile(<span class="string">"hdfs://..."</span>);<br /> - JavaRDD<String> errors = textFile.<span class="sparkop">filter</span>(<span class="closure">new Function<String, Boolean>() {<br /> - public Boolean call(String s) { return s.contains("ERROR"); }<br /> - }</span>);<br /> - <span class="comment">// Count all the errors</span><br /> - errors.<span class="sparkop">count</span>();<br /> - <span class="comment">// Count errors mentioning MySQL</span><br /> - errors.<span class="sparkop">filter</span>(<span class="closure">new Function<String, Boolean>() {<br /> - public Boolean call(String s) { return s.contains("MySQL"); }<br /> - }</span>).<span class="sparkop">count</span>();<br /> - <span class="comment">// Fetch the MySQL errors as an array of strings</span><br /> - errors.<span class="sparkop">filter</span>(<span class="closure">new Function<String, Boolean>() {<br /> - public Boolean call(String s) { return s.contains("MySQL"); }<br /> - }</span>).<span class="sparkop">collect</span>();<br /> - </div> - </div> +<div class="tab-pane tab-pane-python active"> +<div class="code code-tab"> + +<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">text_file</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s">"hdfs://..."</span><span class="p">)</span> +<span class="n">counts</span> <span class="o">=</span> <span class="n">text_file</span><span class="o">.</span><span class="n">flatMap</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">))</span> \ + <span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">word</span><span class="p">:</span> <span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span> \ + <span class="o">.</span><span class="n">reduceByKey</span><span class="p">(</span><span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">)</span> +<span class="n">counts</span><span class="o">.</span><span class="n">saveAsTextFile</span><span class="p">(</span><span class="s">"hdfs://..."</span><span class="p">)</span></code></pre></div> + </div> +</div> + +<div class="tab-pane tab-pane-scala"> +<div class="code code-tab"> -<p>The red code fragments are function literals (closures) that get passed automatically to the cluster. The blue ones are Spark operations.</p> +<div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">val</span> <span class="n">textFile</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"hdfs://..."</span><span class="o">)</span> +<span class="k">val</span> <span class="n">counts</span> <span class="k">=</span> <span class="n">textFile</span><span class="o">.</span><span class="n">flatMap</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">" "</span><span class="o">))</span> + <span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">word</span> <span class="k">=></span> <span class="o">(</span><span class="n">word</span><span class="o">,</span> <span class="mi">1</span><span class="o">))</span> + <span class="o">.</span><span class="n">reduceByKey</span><span class="o">(</span><span class="k">_</span> <span class="o">+</span> <span class="k">_</span><span class="o">)</span> +<span class="n">counts</span><span class="o">.</span><span class="n">saveAsTextFile</span><span class="o">(</span><span class="s">"hdfs://..."</span><span class="o">)</span></code></pre></div> -<h3>In-Memory Text Search</h3> +</div> +</div> -<p>Spark can <em>cache</em> datasets in memory to speed up reuse. In the example above, we can load just the error messages in RAM using:</p> +<div class="tab-pane tab-pane-java"> +<div class="code code-tab"> + +<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="n">JavaRDD</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">textFile</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="na">textFile</span><span class="o">(</span><span class="s">"hdfs://..."</span><span class="o">);</span> +<span class="n">JavaRDD</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">words</span> <span class="o">=</span> <span class="n">textFile</span><span class="o">.</span><span class="na">flatMap</span><span class="o">(</span><span class="k">new</span> <span class="n">FlatMapFunction</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">String</span><span class="o">>()</span> <span class="o">{</span> + <span class="kd">public</span> <span class="n">Iterable</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="nf">call</span><span class="o">(</span><span class="n">String</span> <span class="n">s</span><span class="o">)</span> <span class="o">{</span> <span class="k">return</span> <span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span><span class="n">s</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">" "</span><span class="o">));</span> <span class="o">}</span> +<span class="o">});</span> +<span class="n">JavaPairRDD</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">Integer</span><span class="o">></span> <span class="n">pairs</span> <span class="o">=</span> <span class="n">words</span><span class="o">.</span><span class="na">mapToPair</span><span class="o">(</span><span class="k">new</span> <span class="n">PairFunction</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">String</span><span class="o">,</span> <span class="n">Integer</span><span class="o">>()</span> <span class="o">{</span> + <span class="kd">public</span> <span class="n">Tuple2</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">Integer</span><span class="o">></span> <span class="nf">call</span><span class="o">(</span><span class="n">String</span> <span class="n">s</span><span class="o">)</span> <span class="o">{</span> <span class="k">return</span> <span class="k">new</span> <span class="n">Tuple2</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">Integer</span><span class="o">>(</span><span class="n">s</span><span class="o">,</span> <span class="mi">1</span><span class="o">);</span> <span class="o">}</span> +<span class="o">});</span> +<span class="n">JavaPairRDD</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">Integer</span><span class="o">></span> <span class="n">counts</span> <span class="o">=</span> <span class="n">pairs</span><span class="o">.</span><span class="na">reduceByKey</span><span class="o">(</span><span class="k">new</span> <span class="n">Function2</span><span class="o"><</span><span class="n">Integer</span><span class="o">,</span> <span class="n">Integer</span><span class="o">,</span> <span class="n">Integer</span><span class="o">>()</span> <span class="o">{</span> + <span class="kd">public</span> <span class="n">Integer</span> <span class="nf">call</span><span class="o">(</span><span class="n">Integer</span> <span class="n">a</span><span class="o">,</span> <span class="n">Integer</span> <span class="n">b</span><span class="o">)</span> <span class="o">{</span> <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="o">;</span> <span class="o">}</span> +<span class="o">});</span> +<span class="n">counts</span><span class="o">.</span><span class="na">saveAsTextFile</span><span class="o">(</span><span class="s">"hdfs://..."</span><span class="o">);</span></code></pre></div> + +</div> +</div> +</div> + +<h3>Pi Estimation</h3> +<p>Spark can also be used for compute-intensive tasks. This code estimates <span style="font-family: serif; font-size: 120%;">π</span> by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be <span style="font-family: serif; font-size: 120%;">π / 4</span>, so we use this to get our estimate.</p> <ul class="nav nav-tabs"> <li class="lang-tab lang-tab-python active"><a href="#">Python</a></li> <li class="lang-tab lang-tab-scala"><a href="#">Scala</a></li> <li class="lang-tab lang-tab-java"><a href="#">Java</a></li> </ul> + <div class="tab-content"> - <div class="tab-pane tab-pane-python active"> - <div class="code code-tab"> - errors.<span class="sparkop">cache</span>() - </div> - </div> - <div class="tab-pane tab-pane-scala"> - <div class="code code-tab"> - errors.<span class="sparkop">cache</span>() - </div> - </div> - <div class="tab-pane tab-pane-java"> - <div class="code code-tab"> - errors.<span class="sparkop">cache</span>(); - </div> - </div> +<div class="tab-pane tab-pane-python active"> +<div class="code code-tab"> + +<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">sample</span><span class="p">(</span><span class="n">p</span><span class="p">):</span> + <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">random</span><span class="p">(),</span> <span class="n">random</span><span class="p">()</span> + <span class="k">return</span> <span class="mi">1</span> <span class="k">if</span> <span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span> <span class="o"><</span> <span class="mi">1</span> <span class="k">else</span> <span class="mi">0</span> + +<span class="n">count</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span><span class="nb">xrange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">NUM_SAMPLES</span><span class="p">))</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span> \ + <span class="o">.</span><span class="n">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">)</span> +<span class="k">print</span> <span class="s">"Pi is roughly </span><span class="si">%f</span><span class="s">"</span> <span class="o">%</span> <span class="p">(</span><span class="mf">4.0</span> <span class="o">*</span> <span class="n">count</span> <span class="o">/</span> <span class="n">NUM_SAMPLES</span><span class="p">)</span></code></pre></div> + +</div> </div> -<p>After the first action that uses <code>errors</code>, later ones will be much faster.</p> +<div class="tab-pane tab-pane-scala"> +<div class="code code-tab"> -<h3>Word Count</h3> +<div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">val</span> <span class="n">count</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span><span class="mi">1</span> <span class="n">to</span> <span class="nc">NUM_SAMPLES</span><span class="o">).</span><span class="n">map</span><span class="o">{</span><span class="n">i</span> <span class="k">=></span> + <span class="k">val</span> <span class="n">x</span> <span class="k">=</span> <span class="nc">Math</span><span class="o">.</span><span class="n">random</span><span class="o">()</span> + <span class="k">val</span> <span class="n">y</span> <span class="k">=</span> <span class="nc">Math</span><span class="o">.</span><span class="n">random</span><span class="o">()</span> + <span class="k">if</span> <span class="o">(</span><span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span> <span class="o"><</span> <span class="mi">1</span><span class="o">)</span> <span class="mi">1</span> <span class="k">else</span> <span class="mi">0</span> +<span class="o">}.</span><span class="n">reduce</span><span class="o">(</span><span class="k">_</span> <span class="o">+</span> <span class="k">_</span><span class="o">)</span> +<span class="n">println</span><span class="o">(</span><span class="s">"Pi is roughly "</span> <span class="o">+</span> <span class="mf">4.0</span> <span class="o">*</span> <span class="n">count</span> <span class="o">/</span> <span class="nc">NUM_SAMPLES</span><span class="o">)</span></code></pre></div> + +</div> +</div> + +<div class="tab-pane tab-pane-java"> +<div class="code code-tab"> + +<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="n">List</span><span class="o"><</span><span class="n">Integer</span><span class="o">></span> <span class="n">l</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">Integer</span><span class="o">>(</span><span class="n">NUM_SAMPLES</span><span class="o">);</span> +<span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">NUM_SAMPLES</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span> + <span class="n">l</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">i</span><span class="o">);</span> +<span class="o">}</span> + +<span class="kt">long</span> <span class="n">count</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="na">parallelize</span><span class="o">(</span><span class="n">l</span><span class="o">).</span><span class="na">filter</span><span class="o">(</span><span class="k">new</span> <span class="n">Function</span><span class="o"><</span><span class="n">Integer</span><span class="o">,</span> <span class="n">Boolean</span><span class="o">>()</span> <span class="o">{</span> + <span class="kd">public</span> <span class="n">Boolean</span> <span class="nf">call</span><span class="o">(</span><span class="n">Integer</span> <span class="n">i</span><span class="o">)</span> <span class="o">{</span> + <span class="kt">double</span> <span class="n">x</span> <span class="o">=</span> <span class="n">Math</span><span class="o">.</span><span class="na">random</span><span class="o">();</span> + <span class="kt">double</span> <span class="n">y</span> <span class="o">=</span> <span class="n">Math</span><span class="o">.</span><span class="na">random</span><span class="o">();</span> + <span class="k">return</span> <span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span> <span class="o"><</span> <span class="mi">1</span><span class="o">;</span> + <span class="o">}</span> +<span class="o">}).</span><span class="na">count</span><span class="o">();</span> +<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Pi is roughly "</span> <span class="o">+</span> <span class="mf">4.0</span> <span class="o">*</span> <span class="n">count</span> <span class="o">/</span> <span class="n">NUM_SAMPLES</span><span class="o">);</span></code></pre></div> -<p>In this example, we use a few more transformations to build a dataset of (String, Int) pairs called <code>counts</code> and then save it to a file.</p> +</div> +</div> +</div> + +<h2>DataFrame API Examples</h2> +<p> +In Spark, a <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes">DataFrame</a> +is a distributed collection of data organized into named columns. +Users can use DataFrame API to perform various relational operations on both external +data sources and Spark’s built-in distributed collections without providing specific procedures for processing data. +Also, programs based on DataFrame API will be automatically optimized by Spark’s built-in optimizer, Catalyst. +</p> + +<h3>Text Search</h3> +<p>In this example, we search through the error messages in a log file.</p> <ul class="nav nav-tabs"> <li class="lang-tab lang-tab-python active"><a href="#">Python</a></li> <li class="lang-tab lang-tab-scala"><a href="#">Scala</a></li> <li class="lang-tab lang-tab-java"><a href="#">Java</a></li> </ul> + <div class="tab-content"> - <div class="tab-pane tab-pane-python active"> - <div class="code code-tab"> - text_file = spark.textFile(<span class="string">"hdfs://..."</span>)<br /> - counts = text_file.<span class="sparkop">flatMap</span>(<span class="closure">lambda line: line.split(" ")</span>) \<br /> - .<span class="sparkop">map</span>(<span class="closure">lambda word: (word, 1)</span>) \<br /> - .<span class="sparkop">reduceByKey</span>(<span class="closure">lambda a, b: a + b</span>)<br /> - counts.<span class="sparkop">saveAsTextFile</span>(<span class="string">"hdfs://..."</span>) - </div> - </div> - <div class="tab-pane tab-pane-scala"> - <div class="code code-tab"> - <span class="keyword">val</span> textFile = spark.textFile(<span class="string">"hdfs://..."</span>)<br /> - <span class="keyword">val</span> counts = textFile.<span class="sparkop">flatMap</span>(<span class="closure">line => line.split(" ")</span>)<br /> - .<span class="sparkop">map</span>(<span class="closure">word => (word, 1)</span>)<br /> - .<span class="sparkop">reduceByKey</span>(<span class="closure">_ + _</span>)<br /> - counts.<span class="sparkop">saveAsTextFile</span>(<span class="string">"hdfs://..."</span>) - </div> - </div> - <div class="tab-pane tab-pane-java"> - <div class="code code-tab"> - JavaRDD<String> textFile = spark.textFile(<span class="string">"hdfs://..."</span>);<br /> - JavaRDD<String> words = textFile.<span class="sparkop">flatMap</span>(<span class="closure">new FlatMapFunction<String, String>() {<br /> - public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }<br /> - }</span>);<br /> - JavaPairRDD<String, Integer> pairs = words.<span class="sparkop">mapToPair</span>(<span class="closure">new PairFunction<String, String, Integer>() {<br /> - public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }<br /> - }</span>);<br /> - JavaPairRDD<String, Integer> counts = pairs.<span class="sparkop">reduceByKey</span>(<span class="closure">new Function2<Integer, Integer, Integer>() {<br /> - public Integer call(Integer a, Integer b) { return a + b; }<br /> - }</span>);<br /> - counts.<span class="sparkop">saveAsTextFile</span>(<span class="string">"hdfs://..."</span>); - </div> - </div> +<div class="tab-pane tab-pane-python active"> +<div class="code code-tab"> + +<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">textFile</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s">"hdfs://..."</span><span class="p">)</span> + +<span class="c"># Creates a DataFrame having a single column named "line"</span> +<span class="n">df</span> <span class="o">=</span> <span class="n">textFile</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">r</span><span class="p">:</span> <span class="n">Row</span><span class="p">(</span><span class="n">r</span><span class="p">))</span><span class="o">.</span><span class="n">toDF</span><span class="p">([</span><span class="s">"line"</span><span class="p">])</span> +<span class="n">errors</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">col</span><span class="p">(</span><span class="s">"line"</span><span class="p">)</span><span class="o">.</span><span class="n">like</span><span class="p">(</span><span class="s">"</span><span class="si">%E</span><span class="s">RROR%"</span><span class="p">))</span> +<span class="c"># Counts all the errors</span> +<span class="n">errors</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> +<span class="c"># Counts errors mentioning MySQL</span> +<span class="n">errors</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">col</span><span class="p">(</span><span class="s">"line"</span><span class="p">)</span><span class="o">.</span><span class="n">like</span><span class="p">(</span><span class="s">"%MySQL%"</span><span class="p">))</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> +<span class="c"># Fetches the MySQL errors as an array of strings</span> +<span class="n">errors</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">col</span><span class="p">(</span><span class="s">"line"</span><span class="p">)</span><span class="o">.</span><span class="n">like</span><span class="p">(</span><span class="s">"%MySQL%"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span></code></pre></div> + +</div> </div> -<h3>Estimating Pi</h3> +<div class="tab-pane tab-pane-scala"> +<div class="code code-tab"> -<p>Spark can also be used for compute-intensive tasks. This code estimates <span style="font-family: serif; font-size: 120%;">π</span> by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be <span style="font-family: serif; font-size: 120%;">π / 4</span>, so we use this to get our estimate.</p> +<div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">val</span> <span class="n">textFile</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"hdfs://..."</span><span class="o">)</span> + +<span class="c1">// Creates a DataFrame having a single column named "line"</span> +<span class="k">val</span> <span class="n">df</span> <span class="k">=</span> <span class="n">textFile</span><span class="o">.</span><span class="n">toDF</span><span class="o">(</span><span class="s">"line"</span><span class="o">)</span> +<span class="k">val</span> <span class="n">errors</span> <span class="k">=</span> <span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="o">(</span><span class="n">col</span><span class="o">(</span><span class="s">"line"</span><span class="o">).</span><span class="n">like</span><span class="o">(</span><span class="s">"%ERROR%"</span><span class="o">))</span> +<span class="c1">// Counts all the errors</span> +<span class="n">errors</span><span class="o">.</span><span class="n">count</span><span class="o">()</span> +<span class="c1">// Counts errors mentioning MySQL</span> +<span class="n">errors</span><span class="o">.</span><span class="n">filter</span><span class="o">(</span><span class="n">col</span><span class="o">(</span><span class="s">"line"</span><span class="o">).</span><span class="n">like</span><span class="o">(</span><span class="s">"%MySQL%"</span><span class="o">)).</span><span class="n">count</span><span class="o">()</span> +<span class="c1">// Fetches the MySQL errors as an array of strings</span> +<span class="n">errors</span><span class="o">.</span><span class="n">filter</span><span class="o">(</span><span class="n">col</span><span class="o">(</span><span class="s">"line"</span><span class="o">).</span><span class="n">like</span><span class="o">(</span><span class="s">"%MySQL%"</span><span class="o">)).</span><span class="n">collect</span><span class="o">()</span></code></pre></div> + +</div> +</div> + +<div class="tab-pane tab-pane-java"> +<div class="code code-tab"> + +<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="c1">// Creates a DataFrame having a single column named "line"</span> +<span class="n">JavaRDD</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">textFile</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="na">textFile</span><span class="o">(</span><span class="s">"hdfs://..."</span><span class="o">);</span> +<span class="n">JavaRDD</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">rowRDD</span> <span class="o">=</span> <span class="n">textFile</span><span class="o">.</span><span class="na">map</span><span class="o">(</span> + <span class="k">new</span> <span class="n">Function</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">Row</span><span class="o">>()</span> <span class="o">{</span> + <span class="kd">public</span> <span class="n">Row</span> <span class="nf">call</span><span class="o">(</span><span class="n">String</span> <span class="n">line</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">Exception</span> <span class="o">{</span> + <span class="k">return</span> <span class="n">RowFactory</span><span class="o">.</span><span class="na">create</span><span class="o">(</span><span class="n">line</span><span class="o">);</span> + <span class="o">}</span> + <span class="o">});</span> +<span class="n">List</span><span class="o"><</span><span class="n">StructField</span><span class="o">></span> <span class="n">fields</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">StructField</span><span class="o">>();</span> +<span class="n">fields</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">DataTypes</span><span class="o">.</span><span class="na">createStructField</span><span class="o">(</span><span class="s">"line"</span><span class="o">,</span> <span class="n">DataTypes</span><span class="o">.</span><span class="na">StringType</span><span class="o">,</span> <span class="kc">true</span><span class="o">));</span> +<span class="n">StructType</span> <span class="n">schema</span> <span class="o">=</span> <span class="n">DataTypes</span><span class="o">.</span><span class="na">createStructType</span><span class="o">(</span><span class="n">fields</span><span class="o">);</span> +<span class="n">DataFrame</span> <span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="na">createDataFrame</span><span class="o">(</span><span class="n">rowRDD</span><span class="o">,</span> <span class="n">schema</span><span class="o">);</span> + +<span class="n">DataFrame</span> <span class="n">errors</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="na">filter</span><span class="o">(</span><span class="n">col</span><span class="o">(</span><span class="s">"line"</span><span class="o">).</span><span class="na">like</span><span class="o">(</span><span class="s">"%ERROR%"</span><span class="o">));</span> +<span class="c1">// Counts all the errors</span> +<span class="n">errors</span><span class="o">.</span><span class="na">count</span><span class="o">();</span> +<span class="c1">// Counts errors mentioning MySQL</span> +<span class="n">errors</span><span class="o">.</span><span class="na">filter</span><span class="o">(</span><span class="n">col</span><span class="o">(</span><span class="s">"line"</span><span class="o">).</span><span class="na">like</span><span class="o">(</span><span class="s">"%MySQL%"</span><span class="o">)).</span><span class="na">count</span><span class="o">();</span> +<span class="c1">// Fetches the MySQL errors as an array of strings</span> +<span class="n">errors</span><span class="o">.</span><span class="na">filter</span><span class="o">(</span><span class="n">col</span><span class="o">(</span><span class="s">"line"</span><span class="o">).</span><span class="na">like</span><span class="o">(</span><span class="s">"%MySQL%"</span><span class="o">)).</span><span class="na">collect</span><span class="o">();</span></code></pre></div> + +</div> +</div> +</div> + +<h3>Simple Data Operations</h3> +<p> +In this example, we read a table stored in a database and calculate the number of people for every age. +Finally, we save the calculated result to S3 in the format of JSON. +A simple MySQL table "people" is used in the example and this table has two columns, +"name" and "age". +</p> <ul class="nav nav-tabs"> <li class="lang-tab lang-tab-python active"><a href="#">Python</a></li> <li class="lang-tab lang-tab-scala"><a href="#">Scala</a></li> <li class="lang-tab lang-tab-java"><a href="#">Java</a></li> </ul> + <div class="tab-content"> - <div class="tab-pane tab-pane-python active"> - <div class="code code-tab"> - <span class="keyword">def</span> sample(p):<br /> - x, y = random(), random()<br /> - <span class="keyword">return</span> 1 <span class="keyword">if</span> x*x + y*y < 1 <span class="keyword">else</span> 0<br /><br /> - count = spark.parallelize(xrange(0, NUM_SAMPLES)).<span class="sparkop">map</span>(<span class="closure">sample</span>) \<br /> - .<span class="sparkop">reduce</span>(<span class="closure">lambda a, b: a + b</span>)<br /> - print <span class="string">"Pi is roughly %f"</span> % (4.0 * count / NUM_SAMPLES)<br /> - </div> - </div> - <div class="tab-pane tab-pane-scala"> - <div class="code code-tab"> - <span class="keyword">val</span> count = spark.parallelize(1 to NUM_SAMPLES).<span class="sparkop">map</span>{<span class="closure">i =><br /> - val x = Math.random()<br /> - val y = Math.random()<br /> - if (x*x + y*y < 1) 1 else 0<br /> - </span>}.<span class="sparkop">reduce</span>(<span class="closure">_ + _</span>)<br /> - println(<span class="string">"Pi is roughly "</span> + 4.0 * count / NUM_SAMPLES)<br /> - </div> - </div> - <div class="tab-pane tab-pane-java"> - <div class="code code-tab"> - <span class="keyword">int</span> count = spark.parallelize(makeRange(1, NUM_SAMPLES)).<span class="sparkop">filter</span>(<span class="closure">new Function<Integer, Boolean>() {<br /> - public Boolean call(Integer i) {<br /> - double x = Math.random();<br /> - double y = Math.random();<br /> - return x*x + y*y < 1;<br /> - }<br /> - }</span>).<span class="sparkop">count</span>();<br /> - System.out.println(<span class="string">"Pi is roughly "</span> + 4 * count / NUM_SAMPLES);<br /> - </div> - </div> +<div class="tab-pane tab-pane-python active"> +<div class="code code-tab"> + +<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Creates a DataFrame based on a table named "people"</span> +<span class="c"># stored in a MySQL database.</span> +<span class="n">url</span> <span class="o">=</span> \ + <span class="s">"jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword"</span> +<span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span> \ + <span class="o">.</span><span class="n">read</span> \ + <span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">"jdbc"</span><span class="p">)</span> \ + <span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s">"url"</span><span class="p">,</span> <span class="n">url</span><span class="p">)</span> \ + <span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s">"dbtable"</span><span class="p">,</span> <span class="s">"people"</span><span class="p">)</span> \ + <span class="o">.</span><span class="n">load</span><span class="p">()</span> + +<span class="c"># Looks the schema of this DataFrame.</span> +<span class="n">df</span><span class="o">.</span><span class="n">printSchema</span><span class="p">()</span> + +<span class="c"># Counts people by age</span> +<span class="n">countsByAge</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s">"age"</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> +<span class="n">countsByAge</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> + +<span class="c"># Saves countsByAge to S3 in the JSON format.</span> +<span class="n">countsByAge</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">"json"</span><span class="p">)</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">"s3a://..."</span><span class="p">)</span></code></pre></div> + +</div> </div> -<h3>Logistic Regression</h3> +<div class="tab-pane tab-pane-scala"> +<div class="code code-tab"> + +<div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="c1">// Creates a DataFrame based on a table named "people"</span> +<span class="c1">// stored in a MySQL database.</span> +<span class="k">val</span> <span class="n">url</span> <span class="k">=</span> + <span class="s">"jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword"</span> +<span class="k">val</span> <span class="n">df</span> <span class="k">=</span> <span class="n">sqlContext</span> + <span class="o">.</span><span class="n">read</span> + <span class="o">.</span><span class="n">format</span><span class="o">(</span><span class="s">"jdbc"</span><span class="o">)</span> + <span class="o">.</span><span class="n">option</span><span class="o">(</span><span class="s">"url"</span><span class="o">,</span> <span class="n">url</span><span class="o">)</span> + <span class="o">.</span><span class="n">option</span><span class="o">(</span><span class="s">"dbtable"</span><span class="o">,</span> <span class="s">"people"</span><span class="o">)</span> + <span class="o">.</span><span class="n">load</span><span class="o">()</span> + +<span class="c1">// Looks the schema of this DataFrame.</span> +<span class="n">df</span><span class="o">.</span><span class="n">printSchema</span><span class="o">()</span> -<p>This is an iterative machine learning algorithm that seeks to find the best hyperplane that separates two sets of points in a multi-dimensional feature space. It can be used to classify messages into spam vs non-spam, for example. Because the algorithm applies the same MapReduce operation repeatedly to the same dataset, it benefits greatly from caching the input in RAM across iterations.</p> +<span class="c1">// Counts people by age</span> +<span class="k">val</span> <span class="n">countsByAge</span> <span class="k">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="o">(</span><span class="s">"age"</span><span class="o">).</span><span class="n">count</span><span class="o">()</span> +<span class="n">countsByAge</span><span class="o">.</span><span class="n">show</span><span class="o">()</span> + +<span class="c1">// Saves countsByAge to S3 in the JSON format.</span> +<span class="n">countsByAge</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">format</span><span class="o">(</span><span class="s">"json"</span><span class="o">).</span><span class="n">save</span><span class="o">(</span><span class="s">"s3a://..."</span><span class="o">)</span></code></pre></div> + +</div> +</div> + +<div class="tab-pane tab-pane-java"> +<div class="code code-tab"> + +<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="c1">// Creates a DataFrame based on a table named "people"</span> +<span class="c1">// stored in a MySQL database.</span> +<span class="n">String</span> <span class="n">url</span> <span class="o">=</span> + <span class="s">"jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword"</span><span class="o">;</span> +<span class="n">DataFrame</span> <span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span> + <span class="o">.</span><span class="na">read</span><span class="o">()</span> + <span class="o">.</span><span class="na">format</span><span class="o">(</span><span class="s">"jdbc"</span><span class="o">)</span> + <span class="o">.</span><span class="na">option</span><span class="o">(</span><span class="s">"url"</span><span class="o">,</span> <span class="n">url</span><span class="o">)</span> + <span class="o">.</span><span class="na">option</span><span class="o">(</span><span class="s">"dbtable"</span><span class="o">,</span> <span class="s">"people"</span><span class="o">)</span> + <span class="o">.</span><span class="na">load</span><span class="o">();</span> + +<span class="c1">// Looks the schema of this DataFrame.</span> +<span class="n">df</span><span class="o">.</span><span class="na">printSchema</span><span class="o">();</span> + +<span class="c1">// Counts people by age</span> +<span class="n">DataFrame</span> <span class="n">countsByAge</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="na">groupBy</span><span class="o">(</span><span class="s">"age"</span><span class="o">).</span><span class="na">count</span><span class="o">();</span> +<span class="n">countsByAge</span><span class="o">.</span><span class="na">show</span><span class="o">();</span> + +<span class="c1">// Saves countsByAge to S3 in the JSON format.</span> +<span class="n">countsByAge</span><span class="o">.</span><span class="na">write</span><span class="o">().</span><span class="na">format</span><span class="o">(</span><span class="s">"json"</span><span class="o">).</span><span class="na">save</span><span class="o">(</span><span class="s">"s3a://..."</span><span class="o">);</span></code></pre></div> + +</div> +</div> +</div> + +<h2>Machine Learning Example</h2> +<p> +<a href="http://spark.apache.org/docs/latest/mllib-guide.html">MLlib</a>, Spark’s Machine Learning (ML) library, provides many distributed ML algorithms. +These algorithms cover tasks such as feature extraction, classification, regression, clustering, +recommendation, and more. +MLlib also provides tools such as ML Pipelines for building workflows, CrossValidator for tuning parameters, +and model persistence for saving and loading models. +</p> + +<h3>Prediction with Logistic Regression</h3> +<p> +In this example, we take a dataset of labels and feature vectors. +We learn to predict the labels from feature vectors using the Logistic Regression algorithm. +</p> <ul class="nav nav-tabs"> <li class="lang-tab lang-tab-python active"><a href="#">Python</a></li> <li class="lang-tab lang-tab-scala"><a href="#">Scala</a></li> <li class="lang-tab lang-tab-java"><a href="#">Java</a></li> </ul> + <div class="tab-content"> - <div class="tab-pane tab-pane-python active"> - <div class="code code-tab"> - points = spark.textFile(...).<span class="sparkop">map</span>(parsePoint).<span class="sparkop">cache</span>()<br /> - w = numpy.random.ranf(size = D) <span class="comment"># current separating plane</span><br /> - <span class="keyword">for</span> i <span class="keyword">in</span> range(ITERATIONS):<br /> - gradient = points.<span class="sparkop">map</span>(<span class="closure"><br /> - lambda p: (1 / (1 + exp(-p.y*(w.dot(p.x)))) - 1) * p.y * p.x<br /> - </span>).<span class="sparkop">reduce</span>(<span class="closure">lambda a, b: a + b</span>)<br /> - w -= gradient<br /> - print <span class="string">"Final separating plane: %s"</span> % w<br /> - </div> - </div> - <div class="tab-pane tab-pane-scala"> - <div class="code code-tab"> - <span class="keyword">val</span> points = spark.textFile(...).<span class="sparkop">map</span>(parsePoint).<span class="sparkop">cache</span>()<br /> - <span class="keyword">var</span> w = Vector.random(D) <span class="comment">// current separating plane</span><br /> - <span class="keyword">for</span> (i <- 1 to ITERATIONS) {<br /> - <span class="keyword">val</span> gradient = points.<span class="sparkop">map</span>(<span class="closure">p =><br /> - (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x<br /> - </span>).<span class="sparkop">reduce</span>(<span class="closure">_ + _</span>)<br /> - w -= gradient<br /> - }<br /> - println(<span class="string">"Final separating plane: "</span> + w)<br /> - </div> - </div> - <div class="tab-pane tab-pane-java"> - <div class="code code-tab"> - <span class="keyword">class</span> ComputeGradient <span class="keyword">extends</span> Function<DataPoint, Vector> {<br /> - <span class="keyword">private</span> Vector w;<br /> - ComputeGradient(Vector w) { <span class="keyword">this</span>.w = w; }<br /> - <span class="keyword">public</span> Vector call(DataPoint p) {<br /> - <span class="keyword">return</span> p.x.times(p.y * (1 / (1 + Math.exp(w.dot(p.x))) - 1));<br /> - }<br /> - }<br /> - <br /> - JavaRDD<DataPoint> points = spark.textFile(...).<span class="sparkop">map</span>(<span class="closure">new ParsePoint()</span>).<span class="sparkop">cache</span>();<br /> - Vector w = Vector.random(D); <span class="comment">// current separating plane</span><br /> - <span class="keyword">for</span> (<span class="keyword">int</span> i = 0; i < ITERATIONS; i++) {<br /> - Vector gradient = points.<span class="sparkop">map</span>(<span class="closure">new ComputeGradient(w)</span>).<span class="sparkop">reduce</span>(<span class="closure">new AddVectors()</span>);<br /> - w = w.subtract(gradient);<br /> - }<br /> - System.out.println(<span class="string">"Final separating plane: "</span> + w);<br /> - </div> - </div> +<div class="tab-pane tab-pane-python active"> +<div class="code code-tab"> + +<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Every record of this DataFrame contains the label and</span> +<span class="c"># features represented by a vector.</span> +<span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="p">[</span><span class="s">"label"</span><span class="p">,</span> <span class="s">"features"</span><span class="p">])</span> + +<span class="c"># Set parameters for the algorithm.</span> +<span class="c"># Here, we limit the number of iterations to 10.</span> +<span class="n">lr</span> <span class="o">=</span> <span class="n">LogisticRegression</span><span class="p">(</span><span class="n">maxIter</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> + +<span class="c"># Fit the model to the data.</span> +<span class="n">model</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">df</span><span class="p">)</span> + +<span class="c"># Given a dataset, predict each point's label, and show the results.</span> +<span class="n">model</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span></code></pre></div> + +</div> </div> -<p>Note that the current separating plane, <code>w</code>, gets shipped automatically to the cluster with every <code>map</code> call.</p> +<div class="tab-pane tab-pane-scala"> +<div class="code code-tab"> -<p>The graph below compares the running time per iteration of this Spark program against a Hadoop implementation on 100 GB of data on a 100-node cluster, showing the benefit of in-memory caching:</p> +<div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="c1">// Every record of this DataFrame contains the label and</span> +<span class="c1">// features represented by a vector.</span> +<span class="k">val</span> <span class="n">df</span> <span class="k">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="o">(</span><span class="n">data</span><span class="o">).</span><span class="n">toDF</span><span class="o">(</span><span class="s">"label"</span><span class="o">,</span> <span class="s">"features"</span><span class="o">)</span> -<p style="margin-top: 20px; margin-bottom: 30px;"> -<img src="/images/logistic-regression.png" alt="Logistic regression performance in Spark vs Hadoop" /> -</p> +<span class="c1">// Set parameters for the algorithm.</span> +<span class="c1">// Here, we limit the number of iterations to 10.</span> +<span class="k">val</span> <span class="n">lr</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">LogisticRegression</span><span class="o">().</span><span class="n">setMaxIter</span><span class="o">(</span><span class="mi">10</span><span class="o">)</span> + +<span class="c1">// Fit the model to the data.</span> +<span class="k">val</span> <span class="n">model</span> <span class="k">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="o">(</span><span class="n">df</span><span class="o">)</span> + +<span class="c1">// Inspect the model: get the feature weights.</span> +<span class="k">val</span> <span class="n">weights</span> <span class="k">=</span> <span class="n">model</span><span class="o">.</span><span class="n">weights</span> + +<span class="c1">// Given a dataset, predict each point's label, and show the results.</span> +<span class="n">model</span><span class="o">.</span><span class="n">transform</span><span class="o">(</span><span class="n">df</span><span class="o">).</span><span class="n">show</span><span class="o">()</span></code></pre></div> + +</div> +</div> + +<div class="tab-pane tab-pane-java"> +<div class="code code-tab"> + +<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="c1">// Every record of this DataFrame contains the label and</span> +<span class="c1">// features represented by a vector.</span> +<span class="n">StructType</span> <span class="n">schema</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">StructType</span><span class="o">(</span><span class="k">new</span> <span class="n">StructField</span><span class="o">[]{</span> + <span class="k">new</span> <span class="nf">StructField</span><span class="o">(</span><span class="s">"label"</span><span class="o">,</span> <span class="n">DataTypes</span><span class="o">.</span><span class="na">DoubleType</span><span class="o">,</span> <span class="kc">false</span><span class="o">,</span> <span class="n">Metadata</span><span class="o">.</span><span class="na">empty</span><span class="o">()),</span> + <span class="k">new</span> <span class="nf">StructField</span><span class="o">(</span><span class="s">"features"</span><span class="o">,</span> <span class="k">new</span> <span class="nf">VectorUDT</span><span class="o">(),</span> <span class="kc">false</span><span class="o">,</span> <span class="n">Metadata</span><span class="o">.</span><span class="na">empty</span><span class="o">()),</span> +<span class="o">});</span> +<span class="n">DataFrame</span> <span class="n">df</span> <span class="o">=</span> <span class="n">jsql</span><span class="o">.</span><span class="na">createDataFrame</span><span class="o">(</span><span class="n">data</span><span class="o">,</span> <span class="n">schema</span><span class="o">);</span> + +<span class="c1">// Set parameters for the algorithm.</span> +<span class="c1">// Here, we limit the number of iterations to 10.</span> +<span class="n">LogisticRegression</span> <span class="n">lr</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">LogisticRegression</span><span class="o">().</span><span class="na">setMaxIter</span><span class="o">(</span><span class="mi">10</span><span class="o">);</span> + +<span class="c1">// Fit the model to the data.</span> +<span class="n">LogisticRegressionModel</span> <span class="n">model</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="na">fit</span><span class="o">(</span><span class="n">df</span><span class="o">);</span> + +<span class="c1">// Inspect the model: get the feature weights.</span> +<span class="n">Vector</span> <span class="n">weights</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="na">weights</span><span class="o">();</span> + +<span class="c1">// Given a dataset, predict each point's label, and show the results.</span> +<span class="n">model</span><span class="o">.</span><span class="na">transform</span><span class="o">(</span><span class="n">df</span><span class="o">).</span><span class="na">show</span><span class="o">();</span></code></pre></div> + +</div> +</div> +</div> <p><a name="additional"></a></p> -<h2>Additional Examples</h2> +<h1>Additional Examples</h1> <p>Many additional examples are distributed with Spark:</p> @@ -423,7 +581,6 @@ previous ones, and <em>actions</em>, which kick off a job to execute on a cluste <li>Spark Streaming: <a href="https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming">Scala examples</a>, <a href="https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples/streaming">Java examples</a></li> </ul> - </div> </div> |