diff options
author | Matei Zaharia <matei@databricks.com> | 2014-05-30 00:34:33 -0700 |
---|---|---|
committer | Patrick Wendell <pwendell@gmail.com> | 2014-05-30 00:34:33 -0700 |
commit | c8bf4131bc2a2e147e977159fc90e94b85738830 (patch) | |
tree | a2f885df8fb6654bd7750bb344b97a6cb6889bf3 /docs/quick-start.md | |
parent | eeee978a348ec2a35cc27865cea6357f9db75b74 (diff) | |
download | spark-c8bf4131bc2a2e147e977159fc90e94b85738830.tar.gz spark-c8bf4131bc2a2e147e977159fc90e94b85738830.tar.bz2 spark-c8bf4131bc2a2e147e977159fc90e94b85738830.zip |
[SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes #896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
Diffstat (limited to 'docs/quick-start.md')
-rw-r--r-- | docs/quick-start.md | 39 |
1 files changed, 26 insertions, 13 deletions
diff --git a/docs/quick-start.md b/docs/quick-start.md index 20e17ebf70..6402399477 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -9,7 +9,7 @@ title: Quick Start This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's interactive shell (in Python or Scala), then show how to write standalone applications in Java, Scala, and Python. -See the [programming guide](scala-programming-guide.html) for a more complete reference. +See the [programming guide](programming-guide.html) for a more complete reference. To follow along with this guide, first download a packaged release of Spark from the [Spark website](http://spark.apache.org/downloads.html). Since we won't be using HDFS, @@ -35,7 +35,7 @@ scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 {% endhighlight %} -RDDs have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions: +RDDs have _[actions](programming-guide.html#actions)_, which return values, and _[transformations](programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions: {% highlight scala %} scala> textFile.count() // Number of items in this RDD @@ -45,7 +45,7 @@ scala> textFile.first() // First item in this RDD res1: String = # Apache Spark {% endhighlight %} -Now let's use a transformation. We will use the [`filter`](scala-programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file. +Now let's use a transformation. We will use the [`filter`](programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file. {% highlight scala %} scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) @@ -70,7 +70,7 @@ Spark's primary abstraction is a distributed collection of items called a Resili >>> textFile = sc.textFile("README.md") {% endhighlight %} -RDDs have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions: +RDDs have _[actions](programming-guide.html#actions)_, which return values, and _[transformations](programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions: {% highlight python %} >>> textFile.count() # Number of items in this RDD @@ -80,7 +80,7 @@ RDDs have _[actions](scala-programming-guide.html#actions)_, which return values u'# Apache Spark' {% endhighlight %} -Now let's use a transformation. We will use the [`filter`](scala-programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file. +Now let's use a transformation. We will use the [`filter`](programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file. {% highlight python %} >>> linesWithSpark = textFile.filter(lambda line: "Spark" in line) @@ -125,7 +125,7 @@ scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (w wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8 {% endhighlight %} -Here, we combined the [`flatMap`](scala-programming-guide.html#transformations), [`map`](scala-programming-guide.html#transformations) and [`reduceByKey`](scala-programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the [`collect`](scala-programming-guide.html#actions) action: +Here, we combined the [`flatMap`](programming-guide.html#transformations), [`map`](programming-guide.html#transformations) and [`reduceByKey`](programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the [`collect`](programming-guide.html#actions) action: {% highlight scala %} scala> wordCounts.collect() @@ -162,7 +162,7 @@ One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can i >>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) {% endhighlight %} -Here, we combined the [`flatMap`](scala-programming-guide.html#transformations), [`map`](scala-programming-guide.html#transformations) and [`reduceByKey`](scala-programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the [`collect`](scala-programming-guide.html#actions) action: +Here, we combined the [`flatMap`](programming-guide.html#transformations), [`map`](programming-guide.html#transformations) and [`reduceByKey`](programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the [`collect`](programming-guide.html#actions) action: {% highlight python %} >>> wordCounts.collect() @@ -192,7 +192,7 @@ res9: Long = 15 It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting `bin/spark-shell` to -a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark). +a cluster, as described in the [programming guide](programming-guide.html#initializing-spark). </div> <div data-lang="python" markdown="1"> @@ -210,7 +210,7 @@ a cluster, as described in the [programming guide](scala-programming-guide.html# It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting `bin/pyspark` to -a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark). +a cluster, as described in the [programming guide](programming-guide.html#initializing-spark). </div> </div> @@ -336,7 +336,7 @@ As with the Scala example, we initialize a SparkContext, though we use the speci `JavaSparkContext` class to get a Java-friendly one. We also create RDDs (represented by `JavaRDD`) and run transformations on them. Finally, we pass functions to Spark by creating classes that extend `spark.api.java.function.Function`. The -[Java programming guide](java-programming-guide.html) describes these differences in more detail. +[Spark programming guide](programming-guide.html) describes these differences in more detail. To build the program, we also write a Maven `pom.xml` file that lists Spark as a dependency. Note that Spark artifacts are tagged with a Scala version. @@ -442,6 +442,19 @@ Lines with a: 46, Lines with b: 23 # Where to Go from Here Congratulations on running your first Spark application! -* For an in-depth overview of the API see "Programming Guides" menu section. -* For running applications on a cluster head to the [deployment overview](cluster-overview.html). -* For configuration options available to Spark applications see the [configuration page](configuration.html). +* For an in-depth overview of the API, start with the [Spark programming guide](programming-guide.html), + or see "Programming Guides" menu for other components. +* For running applications on a cluster, head to the [deployment overview](cluster-overview.html). +* Finally, Spark includes several samples in the `examples` directory +([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples), + [Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples), + [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python)). +You can run them as follows: + +{% highlight bash %} +# For Scala and Java, use run-example: +./bin/run-example SparkPi + +# For Python examples, use spark-submit directly: +./bin/spark-submit examples/src/main/python/pi.py +{% endhighlight %} |