diff options
author | Matei Zaharia <matei@databricks.com> | 2014-05-30 00:34:33 -0700 |
---|---|---|
committer | Patrick Wendell <pwendell@gmail.com> | 2014-05-30 00:34:33 -0700 |
commit | c8bf4131bc2a2e147e977159fc90e94b85738830 (patch) | |
tree | a2f885df8fb6654bd7750bb344b97a6cb6889bf3 /docs/sql-programming-guide.md | |
parent | eeee978a348ec2a35cc27865cea6357f9db75b74 (diff) | |
download | spark-c8bf4131bc2a2e147e977159fc90e94b85738830.tar.gz spark-c8bf4131bc2a2e147e977159fc90e94b85738830.tar.bz2 spark-c8bf4131bc2a2e147e977159fc90e94b85738830.zip |
[SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes #896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
Diffstat (limited to 'docs/sql-programming-guide.md')
-rw-r--r-- | docs/sql-programming-guide.md | 29 |
1 files changed, 15 insertions, 14 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 8a785450ad..a506457eba 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -2,7 +2,6 @@ layout: global title: Spark SQL Programming Guide --- -**Spark SQL is currently an Alpha component. Therefore, the APIs may be changed in future releases.** * This will become a table of contents (this text will be scraped). {:toc} @@ -17,10 +16,10 @@ Spark. At the core of this component is a new type of RDD, [SchemaRDD](api/scala/index.html#org.apache.spark.sql.SchemaRDD). SchemaRDDs are composed [Row](api/scala/index.html#org.apache.spark.sql.catalyst.expressions.Row) objects along with a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table -in a traditional relational database. A SchemaRDD can be created from an existing RDD, parquet +in a traditional relational database. A SchemaRDD can be created from an existing RDD, [Parquet](http://parquet.io) file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/). -**All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`.** +All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`. </div> @@ -30,7 +29,7 @@ Spark. At the core of this component is a new type of RDD, [JavaSchemaRDD](api/scala/index.html#org.apache.spark.sql.api.java.JavaSchemaRDD). JavaSchemaRDDs are composed [Row](api/scala/index.html#org.apache.spark.sql.api.java.Row) objects along with a schema that describes the data types of each column in the row. A JavaSchemaRDD is similar to a table -in a traditional relational database. A JavaSchemaRDD can be created from an existing RDD, parquet +in a traditional relational database. A JavaSchemaRDD can be created from an existing RDD, [Parquet](http://parquet.io) file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/). </div> @@ -41,13 +40,15 @@ Spark. At the core of this component is a new type of RDD, [SchemaRDD](api/python/pyspark.sql.SchemaRDD-class.html). SchemaRDDs are composed [Row](api/python/pyspark.sql.Row-class.html) objects along with a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table -in a traditional relational database. A SchemaRDD can be created from an existing RDD, parquet +in a traditional relational database. A SchemaRDD can be created from an existing RDD, [Parquet](http://parquet.io) file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/). -**All of the examples on this page use sample data included in the Spark distribution and can be run in the `pyspark` shell.** +All of the examples on this page use sample data included in the Spark distribution and can be run in the `pyspark` shell. </div> </div> +**Spark SQL is currently an alpha component. While we will minimize API changes, some APIs may change in future releases.** + *************************************************************************************************** # Getting Started @@ -240,8 +241,8 @@ Users that want a more complete dialect of SQL should look at the HiveQL support ## Using Parquet -Parquet is a columnar format that is supported by many other data processing systems. Spark SQL -provides support for both reading and writing parquet files that automatically preserves the schema +[Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems. +Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. Using the data from the above example: <div class="codetabs"> @@ -254,11 +255,11 @@ import sqlContext._ val people: RDD[Person] = ... // An RDD of case class objects, from the previous example. -// The RDD is implicitly converted to a SchemaRDD, allowing it to be stored using parquet. +// The RDD is implicitly converted to a SchemaRDD, allowing it to be stored using Parquet. people.saveAsParquetFile("people.parquet") // Read in the parquet file created above. Parquet files are self-describing so the schema is preserved. -// The result of loading a parquet file is also a JavaSchemaRDD. +// The result of loading a Parquet file is also a JavaSchemaRDD. val parquetFile = sqlContext.parquetFile("people.parquet") //Parquet files can also be registered as tables and then used in SQL statements. @@ -275,10 +276,10 @@ teenagers.collect().foreach(println) JavaSchemaRDD schemaPeople = ... // The JavaSchemaRDD from the previous example. -// JavaSchemaRDDs can be saved as parquet files, maintaining the schema information. +// JavaSchemaRDDs can be saved as Parquet files, maintaining the schema information. schemaPeople.saveAsParquetFile("people.parquet"); -// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved. +// Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved. // The result of loading a parquet file is also a JavaSchemaRDD. JavaSchemaRDD parquetFile = sqlCtx.parquetFile("people.parquet"); @@ -297,10 +298,10 @@ JavaSchemaRDD teenagers = sqlCtx.sql("SELECT name FROM parquetFile WHERE age >= peopleTable # The SchemaRDD from the previous example. -# SchemaRDDs can be saved as parquet files, maintaining the schema information. +# SchemaRDDs can be saved as Parquet files, maintaining the schema information. peopleTable.saveAsParquetFile("people.parquet") -# Read in the parquet file created above. Parquet files are self-describing so the schema is preserved. +# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved. # The result of loading a parquet file is also a SchemaRDD. parquetFile = sqlCtx.parquetFile("people.parquet") |