From c8bf4131bc2a2e147e977159fc90e94b85738830 Mon Sep 17 00:00:00 2001 From: Matei Zaharia Date: Fri, 30 May 2014 00:34:33 -0700 Subject: [SPARK-1566] consolidate programming guide, and general doc updates This is a fairly large PR to clean up and update the docs for 1.0. The major changes are: * A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs * New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark * Spark-submit guide moved to a separate page and expanded slightly * Various cleanups of the menu system, security docs, and others * Updated look of title bar to differentiate the docs from previous Spark versions You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html. Author: Matei Zaharia Closes #896 from mateiz/1.0-docs and squashes the following commits: 03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs 0779508 [Matei Zaharia] tweak ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks 1bf4112 [Matei Zaharia] Review comments 4414f88 [Matei Zaharia] tweaks d04e979 [Matei Zaharia] Fix some old links to Java guide a34ed33 [Matei Zaharia] tweak 541bb3b [Matei Zaharia] miscellaneous changes fcefdec [Matei Zaharia] Moved submitting apps to separate doc 61d72b4 [Matei Zaharia] stuff 181f217 [Matei Zaharia] migration guide, remove old language guides e11a0da [Matei Zaharia] Add more API functions 6a030a9 [Matei Zaharia] tweaks 8db0ae3 [Matei Zaharia] Added key-value pairs section 318d2c9 [Matei Zaharia] tweaks 1c81477 [Matei Zaharia] New section on basics and function syntax e38f559 [Matei Zaharia] Actually added programming guide to Git a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout 3b6a876 [Matei Zaharia] More CSS tweaks 01ec8bf [Matei Zaharia] More CSS tweaks e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0 --- docs/index.md | 79 +++++++++++++++++++++++++++++++---------------------------- 1 file changed, 41 insertions(+), 38 deletions(-) (limited to 'docs/index.md') diff --git a/docs/index.md b/docs/index.md index c9b10376cc..1a4ff3dbf5 100644 --- a/docs/index.md +++ b/docs/index.md @@ -4,23 +4,23 @@ title: Spark Overview --- Apache Spark is a fast and general-purpose cluster computing system. -It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. -It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html). +It provides high-level APIs in Java, Scala and Python, +and an optimized engine that supports general execution graphs. +It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [Spark SQL](sql-programming-guide.html) for structured data, [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html). # Downloading -Get Spark by visiting the [downloads page](http://spark.apache.org/downloads.html) of the Apache Spark site. This documentation is for Spark version {{site.SPARK_VERSION}}. The downloads page +Get Spark from the [downloads page](http://spark.apache.org/downloads.html) of the project website. This documentation is for Spark version {{site.SPARK_VERSION}}. The downloads page contains Spark packages for many popular HDFS versions. If you'd like to build Spark from -scratch, visit the [building with Maven](building-with-maven.html) page. +scratch, visit [building Spark with Maven](building-with-maven.html). -Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). All you need to run it is -to have `java` to installed on your system `PATH`, or the `JAVA_HOME` environment variable -pointing to a Java installation. +Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy to run +locally on one machine --- all you need is to have `java` installed on your system `PATH`, +or the `JAVA_HOME` environment variable pointing to a Java installation. -For its Scala API, Spark {{site.SPARK_VERSION}} depends on Scala {{site.SCALA_BINARY_VERSION}}. -If you write applications in Scala, you will need to use a compatible Scala version -(e.g. {{site.SCALA_BINARY_VERSION}}.X) -- newer major versions may not work. You can get the -right version of Scala from [scala-lang.org](http://www.scala-lang.org/download/). +Spark runs on Java 6+ and Python 2.6+. For the Scala API, Spark {{site.SPARK_VERSION}} uses +Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a compatible Scala version +({{site.SCALA_BINARY_VERSION}}.x). # Running the Examples and Shell @@ -28,24 +28,23 @@ Spark comes with several sample programs. Scala, Java and Python examples are i `examples/src/main` directory. To run one of the Java or Scala sample programs, use `bin/run-example [params]` in the top-level Spark directory. (Behind the scenes, this invokes the more general -[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit) for +[`spark-submit` script](submitting-applications.html) for launching applications). For example, ./bin/run-example SparkPi 10 -You can also run Spark interactively through modified versions of the Scala shell. This is a +You can also run Spark interactively through a modified version of the Scala shell. This is a great way to learn the framework. ./bin/spark-shell --master local[2] The `--master` option specifies the -[master URL for a distributed cluster](scala-programming-guide.html#master-urls), or `local` to run +[master URL for a distributed cluster](submitting-applications.html#master-urls), or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using `local` for testing. For a full list of options, run Spark shell with the `--help` option. -Spark also provides a Python interface. To run Spark interactively in a Python interpreter, use -`bin/pyspark`. As in Spark shell, you can also pass in the `--master` option to configure your -master URL. +Spark also provides a Python API. To run Spark interactively in a Python interpreter, use +`bin/pyspark`: ./bin/pyspark --master local[2] @@ -66,17 +65,17 @@ options for deployment: # Where to Go from Here -**Programming guides:** +**Programming Guides:** * [Quick Start](quick-start.html): a quick introduction to the Spark API; start here! -* [Spark Programming Guide](scala-programming-guide.html): an overview of Spark concepts, and details on the Scala API - * [Java Programming Guide](java-programming-guide.html): using Spark from Java - * [Python Programming Guide](python-programming-guide.html): using Spark from Python -* [Spark Streaming](streaming-programming-guide.html): Spark's API for processing data streams -* [Spark SQL](sql-programming-guide.html): Support for running relational queries on Spark -* [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library -* [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model -* [GraphX (Graphs on Spark)](graphx-programming-guide.html): Spark's new API for graphs +* [Spark Programming Guide](programming-guide.html): detailed overview of Spark + in all supported languages (Scala, Java, Python) +* Modules built on Spark: + * [Spark Streaming](streaming-programming-guide.html): processing real-time data streams + * [Spark SQL](sql-programming-guide.html): support for structured data and relational queries + * [MLlib](mllib-guide.html): built-in machine learning library + * [GraphX](graphx-programming-guide.html): Spark's new API for graph processing + * [Bagel (Pregel on Spark)](bagel-programming-guide.html): older, simple graph processing model **API Docs:** @@ -84,26 +83,30 @@ options for deployment: * [Spark Java API (Javadoc)](api/java/index.html) * [Spark Python API (Epydoc)](api/python/index.html) -**Deployment guides:** +**Deployment Guides:** * [Cluster Overview](cluster-overview.html): overview of concepts and components when running on a cluster -* [Amazon EC2](ec2-scripts.html): scripts that let you launch a cluster on EC2 in about 5 minutes -* [Standalone Deploy Mode](spark-standalone.html): launch a standalone cluster quickly without a third-party cluster manager -* [Mesos](running-on-mesos.html): deploy a private cluster using - [Apache Mesos](http://mesos.apache.org) -* [YARN](running-on-yarn.html): deploy Spark on top of Hadoop NextGen (YARN) +* [Submitting Applications](submitting-applications.html): packaging and deploying applications +* Deployment modes: + * [Amazon EC2](ec2-scripts.html): scripts that let you launch a cluster on EC2 in about 5 minutes + * [Standalone Deploy Mode](spark-standalone.html): launch a standalone cluster quickly without a third-party cluster manager + * [Mesos](running-on-mesos.html): deploy a private cluster using + [Apache Mesos](http://mesos.apache.org) + * [YARN](running-on-yarn.html): deploy Spark on top of Hadoop NextGen (YARN) -**Other documents:** +**Other Documents:** * [Configuration](configuration.html): customize Spark via its configuration system +* [Monitoring](monitoring.html): track the behavior of your applications * [Tuning Guide](tuning.html): best practices to optimize performance and memory use +* [Job Scheduling](job-scheduling.html): scheduling resources across and within Spark applications * [Security](security.html): Spark security support * [Hardware Provisioning](hardware-provisioning.html): recommendations for cluster hardware -* [Job Scheduling](job-scheduling.html): scheduling resources across and within Spark applications +* [3rd Party Hadoop Distributions](hadoop-third-party-distributions.html): using common Hadoop distributions * [Building Spark with Maven](building-with-maven.html): build Spark using the Maven system * [Contributing to Spark](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) -**External resources:** +**External Resources:** * [Spark Homepage](http://spark.apache.org) * [Shark](http://shark.cs.berkeley.edu): Apache Hive over Spark @@ -112,9 +115,9 @@ options for deployment: exercises about Spark, Shark, Spark Streaming, Mesos, and more. [Videos](http://ampcamp.berkeley.edu/3/), [slides](http://ampcamp.berkeley.edu/3/) and [exercises](http://ampcamp.berkeley.edu/3/exercises/) are available online for free. -* [Code Examples](http://spark.apache.org/examples.html): more are also available in the [examples subfolder](https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/) of Spark -* [Paper Describing Spark](http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) -* [Paper Describing Spark Streaming](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) +* [Code Examples](http://spark.apache.org/examples.html): more are also available in the `examples` subfolder of Spark ([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples), + [Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples), + [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python)) # Community -- cgit v1.2.3