aboutsummaryrefslogtreecommitdiff
path: root/docs/sql-programming-guide.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/sql-programming-guide.md')
-rw-r--r--docs/sql-programming-guide.md143
1 files changed, 143 insertions, 0 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
new file mode 100644
index 0000000000..b6f21a5dc6
--- /dev/null
+++ b/docs/sql-programming-guide.md
@@ -0,0 +1,143 @@
+---
+layout: global
+title: Spark SQL Programming Guide
+---
+**Spark SQL is currently an Alpha component. Therefore, the APIs may be changed in future releases.**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+# Overview
+Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using
+Spark. At the core of this component is a new type of RDD,
+[SchemaRDD](api/sql/core/index.html#org.apache.spark.sql.SchemaRDD). SchemaRDDs are composed
+[Row](api/sql/catalyst/index.html#org.apache.spark.sql.catalyst.expressions.Row) objects along with
+a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table
+in a traditional relational database. A SchemaRDD can be created from an existing RDD, parquet
+file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).
+
+**All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell.**
+
+***************************************************************************************************
+
+# Getting Started
+
+The entry point into all relational functionallity in Spark is the
+[SQLContext](api/sql/core/index.html#org.apache.spark.sql.SQLContext) class, or one of its
+decendents. To create a basic SQLContext, all you need is a SparkContext.
+
+{% highlight scala %}
+val sc: SparkContext // An existing SparkContext.
+val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+
+// Importing the SQL context gives access to all the public SQL functions and implicit conversions.
+import sqlContext._
+{% endhighlight %}
+
+## Running SQL on RDDs
+One type of table that is supported by Spark SQL is an RDD of Scala case classetees. The case class
+defines the schema of the table. The names of the arguments to the case class are read using
+reflection and become the names of the columns. Case classes can also be nested or contain complex
+types such as Sequences or Arrays. This RDD can be implicitly converted to a SchemaRDD and then be
+registered as a table. Tables can used in subsequent SQL statements.
+
+{% highlight scala %}
+val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+import sqlContext._
+
+// Define the schema using a case class.
+case class Person(name: String, age: Int)
+
+// Create an RDD of Person objects and register it as a table.
+val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
+people.registerAsTable("people")
+
+// SQL statements can be run by using the sql methods provided by sqlContext.
+val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
+
+// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
+// The columns of a row in the result can be accessed by ordinal.
+teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
+{% endhighlight %}
+
+**Note that Spark SQL currently uses a very basic SQL parser, and the keywords are case sensitive.**
+Users that want a more complete dialect of SQL should look at the HiveQL support provided by
+`HiveContext`.
+
+## Using Parquet
+
+Parquet is a columnar format that is supported by many other data processing systems. Spark SQL
+provides support for both reading and writing parquet files that automatically preserves the schema
+of the original data. Using the data from the above example:
+
+{% highlight scala %}
+val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+import sqlContext._
+
+val people: RDD[Person] // An RDD of case class objects, from the previous example.
+
+// The RDD is implicitly converted to a SchemaRDD, allowing it to be stored using parquet.
+people.saveAsParquetFile("people.parquet")
+
+// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
+// The result of loading a parquet file is also a SchemaRDD.
+val parquetFile = sqlContext.parquetFile("people.parquet")
+
+//Parquet files can also be registered as tables and then used in SQL statements.
+parquetFile.registerAsTable("parquetFile")
+val teenagers = sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
+teenagers.collect().foreach(println)
+{% endhighlight %}
+
+## Writing Language-Integrated Relational Queries
+
+Spark SQL also supports a domain specific language for writing queries. Once again,
+using the data from the above examples:
+
+{% highlight scala %}
+val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+import sqlContext._
+val people: RDD[Person] // An RDD of case class objects, from the first example.
+
+// The following is the same as 'SELECT name FROM people WHERE age >= 10 AND age <= 19'
+val teenagers = people.where('age >= 10).where('age <= 19).select('name)
+{% endhighlight %}
+
+The DSL uses Scala symbols to represent columns in the underlying table, which are identifiers
+prefixed with a tick (`'`). Implicit conversions turn these symbols into expressions that are
+evaluated by the SQL execution engine. A full list of the functions supported can be found in the
+[ScalaDoc](api/sql/core/index.html#org.apache.spark.sql.SchemaRDD).
+
+<!-- TODO: Include the table of operations here. -->
+
+# Hive Support
+
+Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/).
+However, since Hive has a large number of dependencies, it is not included in the default Spark assembly.
+In order to use Hive you must first run '`sbt/sbt hive/assembly`'. This command builds a new assembly
+jar that includes Hive. When this jar is present, Spark will use the Hive
+assembly instead of the normal Spark assembly. Note that this Hive assembly jar must also be present
+on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries
+(SerDes) in order to acccess data stored in Hive.
+
+Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`.
+
+When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
+adds support for finding tables in in the MetaStore and writing queries using HiveQL. Users who do
+not have an existing Hive deployment can also experiment with the `LocalHiveContext`,
+which is similar to `HiveContext`, but creates a local copy of the `metastore` and `warehouse`
+automatically.
+
+{% highlight scala %}
+val sc: SparkContext // An existing SparkContext.
+val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
+
+// Importing the SQL context gives access to all the public SQL functions and implicit conversions.
+import hiveContext._
+
+sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
+sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
+
+// Queries are expressed in HiveQL
+sql("SELECT key, value FROM src").collect().foreach(println)
+{% endhighlight %} \ No newline at end of file