diff options
author | Matei Zaharia <matei@eecs.berkeley.edu> | 2013-01-12 23:49:36 -0800 |
---|---|---|
committer | Matei Zaharia <matei@eecs.berkeley.edu> | 2013-01-12 23:49:36 -0800 |
commit | fbb3fc41436db475f9aba7e94bc52e6e76b62894 (patch) | |
tree | 933ca6c0666bd09232220f6435f3a338dcf4d060 /docs/quick-start.md | |
parent | 44b3e41f2ede20c30bc540439d705e9ff8075ee1 (diff) | |
parent | 49c74ba2af2ab6fe5eda16dbcd35b30b46072a3a (diff) | |
download | spark-fbb3fc41436db475f9aba7e94bc52e6e76b62894.tar.gz spark-fbb3fc41436db475f9aba7e94bc52e6e76b62894.tar.bz2 spark-fbb3fc41436db475f9aba7e94bc52e6e76b62894.zip |
Merge pull request #346 from JoshRosen/python-api
Python API (PySpark)
Diffstat (limited to 'docs/quick-start.md')
-rw-r--r-- | docs/quick-start.md | 40 |
1 files changed, 39 insertions, 1 deletions
diff --git a/docs/quick-start.md b/docs/quick-start.md index d46dc2da3f..a4c4c9a8fb 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -6,7 +6,8 @@ title: Quick Start * This will become a table of contents (this text will be scraped). {:toc} -This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's interactive Scala shell (don't worry if you don't know Scala -- you will not need much for this), then show how to write standalone jobs in Scala and Java. See the [programming guide](scala-programming-guide.html) for a more complete reference. +This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's interactive Scala shell (don't worry if you don't know Scala -- you will not need much for this), then show how to write standalone jobs in Scala, Java, and Python. +See the [programming guide](scala-programming-guide.html) for a more complete reference. To follow along with this guide, you only need to have successfully built Spark on one machine. Simply go into your Spark directory and run: @@ -240,3 +241,40 @@ Lines with a: 8422, Lines with b: 1836 {% endhighlight %} This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS. + +# A Standalone Job In Python +Now we will show how to write a standalone job using the Python API (PySpark). + +As an example, we'll create a simple Spark job, `SimpleJob.py`: + +{% highlight python %} +"""SimpleJob.py""" +from pyspark import SparkContext + +logFile = "/var/log/syslog" # Should be some file on your system +sc = SparkContext("local", "Simple job") +logData = sc.textFile(logFile).cache() + +numAs = logData.filter(lambda s: 'a' in s).count() +numBs = logData.filter(lambda s: 'b' in s).count() + +print "Lines with a: %i, lines with b: %i" % (numAs, numBs) +{% endhighlight %} + + +This job simply counts the number of lines containing 'a' and the number containing 'b' in a system log file. +Like in the Scala and Java examples, we use a SparkContext to create RDDs. +We can pass Python functions to Spark, which are automatically serialized along with any variables that they reference. +For jobs that use custom classes or third-party libraries, we can add those code dependencies to SparkContext to ensure that they will be available on remote machines; this is described in more detail in the [Python programming guide](python-programming-guide). +`SimpleJob` is simple enough that we do not need to specify any code dependencies. + +We can run this job using the `pyspark` script: + +{% highlight python %} +$ cd $SPARK_HOME +$ ./pyspark SimpleJob.py +... +Lines with a: 8422, Lines with b: 1836 +{% endhighlight python %} + +This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS. |