Merge branch 'streaming' into ScrapCode-streaming

Conflicts: streaming/src/main/scala/spark/streaming/dstream/KafkaInputDStream.scala streaming/src/main/scala/spark/streaming/dstream/NetworkInputDStream.scala
author: Tathagata Das <tathagata.das1565@gmail.com> 2013-02-18 13:26:12 -0800
committer: Tathagata Das <tathagata.das1565@gmail.com> 2013-02-18 13:26:12 -0800
commit: 6a6e6bda5713ccc6da9ca977321a1fcc6d38a1c1 (patch)
tree: 3848e9e09a2c8b7537f4a0635ea0a32daee1f9a8 /docs/quick-start.md
parent: 56b9bd197c522f33e354c2e9ad7e76440cf817e9 (diff)
parent: 8ad561dc7d6475d7b217ec3f57bac3b584fed31a (diff)
download: spark-6a6e6bda5713ccc6da9ca977321a1fcc6d38a1c1.tar.gz
spark-6a6e6bda5713ccc6da9ca977321a1fcc6d38a1c1.tar.bz2
spark-6a6e6bda5713ccc6da9ca977321a1fcc6d38a1c1.zip
1 files changed, 49 insertions, 1 deletions
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 177cb14551..a4c4c9a8fb 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -6,7 +6,8 @@ title: Quick Start
 * This will become a table of contents (this text will be scraped).
 {:toc}
 
-This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's interactive Scala shell (don't worry if you don't know Scala -- you will not need much for this), then show how to write standalone jobs in Scala and Java. See the [programming guide](scala-programming-guide.html) for a more complete reference.
+This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's interactive Scala shell (don't worry if you don't know Scala -- you will not need much for this), then show how to write standalone jobs in Scala, Java, and Python.
+See the [programming guide](scala-programming-guide.html) for a more complete reference.
 
 To follow along with this guide, you only need to have successfully built Spark on one machine. Simply go into your Spark directory and run:
 
@@ -200,6 +201,16 @@ To build the job, we also write a Maven `pom.xml` file that lists Spark as a dep
   <name>Simple Project</name>
   <packaging>jar</packaging>
   <version>1.0</version>
+  <repositories>
+    <repository>
+      <id>Spray.cc repository</id>
+      <url>http://repo.spray.cc</url>
+    </repository>
+    <repository>
+      <id>Typesafe repository</id>
+      <url>http://repo.typesafe.com/typesafe/releases</url>
+    </repository>
+  </repositories>
   <dependencies>
     <dependency> <!-- Spark dependency -->
       <groupId>org.spark-project</groupId>
@@ -230,3 +241,40 @@ Lines with a: 8422, Lines with b: 1836
 {% endhighlight %}
 
 This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS.
+
+# A Standalone Job In Python
+Now we will show how to write a standalone job using the Python API (PySpark).
+
+As an example, we'll create a simple Spark job, `SimpleJob.py`:
+
+{% highlight python %}
+"""SimpleJob.py"""
+from pyspark import SparkContext
+
+logFile = "/var/log/syslog"  # Should be some file on your system
+sc = SparkContext("local", "Simple job")
+logData = sc.textFile(logFile).cache()
+
+numAs = logData.filter(lambda s: 'a' in s).count()
+numBs = logData.filter(lambda s: 'b' in s).count()
+
+print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
+{% endhighlight %}
+
+
+This job simply counts the number of lines containing 'a' and the number containing 'b' in a system log file.
+Like in the Scala and Java examples, we use a SparkContext to create RDDs.
+We can pass Python functions to Spark, which are automatically serialized along with any variables that they reference.
+For jobs that use custom classes or third-party libraries, we can add those code dependencies to SparkContext to ensure that they will be available on remote machines; this is described in more detail in the [Python programming guide](python-programming-guide).
+`SimpleJob` is simple enough that we do not need to specify any code dependencies.
+
+We can run this job using the `pyspark` script:
+
+{% highlight python %}
+$ cd $SPARK_HOME
+$ ./pyspark SimpleJob.py
+...
+Lines with a: 8422, Lines with b: 1836
+{% endhighlight python %}
+
+This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS.
author	Tathagata Das <tathagata.das1565@gmail.com>	2013-02-18 13:26:12 -0800
committer	Tathagata Das <tathagata.das1565@gmail.com>	2013-02-18 13:26:12 -0800
commit	6a6e6bda5713ccc6da9ca977321a1fcc6d38a1c1 (patch)
tree	3848e9e09a2c8b7537f4a0635ea0a32daee1f9a8 /docs/quick-start.md
parent	56b9bd197c522f33e354c2e9ad7e76440cf817e9 (diff)
parent	8ad561dc7d6475d7b217ec3f57bac3b584fed31a (diff)
download	spark-6a6e6bda5713ccc6da9ca977321a1fcc6d38a1c1.tar.gz spark-6a6e6bda5713ccc6da9ca977321a1fcc6d38a1c1.tar.bz2 spark-6a6e6bda5713ccc6da9ca977321a1fcc6d38a1c1.zip