aboutsummaryrefslogtreecommitdiff
path: root/docs/sql-programming-guide.md
diff options
context:
space:
mode:
authorCheng Lian <lian@databricks.com>2016-07-23 11:41:24 -0700
committerReynold Xin <rxin@databricks.com>2016-07-23 11:41:24 -0700
commit53b2456d1de38b9d4f18509e7b36eb3fbe09e050 (patch)
tree7a783f09648b4c86ec04b9fd26e9ef6871f2d352 /docs/sql-programming-guide.md
parent86c275206605c44e1ebca2f166d62868e44bf029 (diff)
downloadspark-53b2456d1de38b9d4f18509e7b36eb3fbe09e050.tar.gz
spark-53b2456d1de38b9d4f18509e7b36eb3fbe09e050.tar.bz2
spark-53b2456d1de38b9d4f18509e7b36eb3fbe09e050.zip
[SPARK-16380][EXAMPLES] Update SQL examples and programming guide for Python language binding
This PR is based on PR #14098 authored by wangmiao1981. ## What changes were proposed in this pull request? This PR replaces the original Python Spark SQL example file with the following three files: - `sql/basic.py` Demonstrates basic Spark SQL features. - `sql/datasource.py` Demonstrates various Spark SQL data sources. - `sql/hive.py` Demonstrates Spark SQL Hive interaction. This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag. ## How was this patch tested? Manually tested. Author: wm624@hotmail.com <wm624@hotmail.com> Author: Cheng Lian <lian@databricks.com> Closes #14317 from liancheng/py-examples-update.
Diffstat (limited to 'docs/sql-programming-guide.md')
-rw-r--r--docs/sql-programming-guide.md229
1 files changed, 13 insertions, 216 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 3af935a952..ad123d7cea 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -79,7 +79,7 @@ The entry point into all functionality in Spark is the [`SparkSession`](api/java
The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder`:
-{% include_example init_session python/sql.py %}
+{% include_example init_session python/sql/basic.py %}
</div>
<div data-lang="r" markdown="1">
@@ -123,14 +123,7 @@ from a Hive table, or from [Spark data sources](#data-sources).
As an example, the following creates a DataFrame based on the content of a JSON file:
-{% highlight python %}
-# spark is an existing SparkSession
-df = spark.read.json("examples/src/main/resources/people.json")
-
-# Displays the content of the DataFrame to stdout
-df.show()
-{% endhighlight %}
-
+{% include_example create_df python/sql/basic.py %}
</div>
<div data-lang="r" markdown="1">
@@ -178,53 +171,7 @@ interactive data exploration, users are highly encouraged to use the
latter form, which is future proof and won't break with column names that
are also attributes on the DataFrame class.
-{% highlight python %}
-# spark is an existing SparkSession
-
-# Create the DataFrame
-df = spark.read.json("examples/src/main/resources/people.json")
-
-# Show the content of the DataFrame
-df.show()
-## age name
-## null Michael
-## 30 Andy
-## 19 Justin
-
-# Print the schema in a tree format
-df.printSchema()
-## root
-## |-- age: long (nullable = true)
-## |-- name: string (nullable = true)
-
-# Select only the "name" column
-df.select("name").show()
-## name
-## Michael
-## Andy
-## Justin
-
-# Select everybody, but increment the age by 1
-df.select(df['name'], df['age'] + 1).show()
-## name (age + 1)
-## Michael null
-## Andy 31
-## Justin 20
-
-# Select people older than 21
-df.filter(df['age'] > 21).show()
-## age name
-## 30 Andy
-
-# Count people by age
-df.groupBy("age").count().show()
-## age count
-## null 1
-## 19 1
-## 30 1
-
-{% endhighlight %}
-
+{% include_example untyped_ops python/sql/basic.py %}
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/pyspark.sql.html#pyspark.sql.DataFrame).
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/pyspark.sql.html#module-pyspark.sql.functions).
@@ -261,10 +208,7 @@ The `sql` function on a `SparkSession` enables applications to run SQL queries p
<div data-lang="python" markdown="1">
The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
-{% highlight python %}
-# spark is an existing SparkSession
-df = spark.sql("SELECT * FROM table")
-{% endhighlight %}
+{% include_example run_sql python/sql/basic.py %}
</div>
<div data-lang="r" markdown="1">
@@ -339,29 +283,7 @@ Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the dataty
key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
and the types are inferred by sampling the whole datase, similar to the inference that is performed on JSON files.
-{% highlight python %}
-# spark is an existing SparkSession.
-from pyspark.sql import Row
-sc = spark.sparkContext
-
-# Load a text file and convert each line to a Row.
-lines = sc.textFile("examples/src/main/resources/people.txt")
-parts = lines.map(lambda l: l.split(","))
-people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
-
-# Infer the schema, and register the DataFrame as a table.
-schemaPeople = spark.createDataFrame(people)
-schemaPeople.createOrReplaceTempView("people")
-
-# SQL can be run over DataFrames that have been registered as a table.
-teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
-
-# The results of SQL queries are RDDs and support all the normal RDD operations.
-teenNames = teenagers.map(lambda p: "Name: " + p.name)
-for teenName in teenNames.collect():
- print(teenName)
-{% endhighlight %}
-
+{% include_example schema_inferring python/sql/basic.py %}
</div>
</div>
@@ -419,39 +341,8 @@ tuples or lists in the RDD created in the step 1.
3. Apply the schema to the RDD via `createDataFrame` method provided by `SparkSession`.
For example:
-{% highlight python %}
-# Import SparkSession and data types
-from pyspark.sql.types import *
-
-# spark is an existing SparkSession.
-sc = spark.sparkContext
-
-# Load a text file and convert each line to a tuple.
-lines = sc.textFile("examples/src/main/resources/people.txt")
-parts = lines.map(lambda l: l.split(","))
-people = parts.map(lambda p: (p[0], p[1].strip()))
-
-# The schema is encoded in a string.
-schemaString = "name age"
-
-fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
-schema = StructType(fields)
-
-# Apply the schema to the RDD.
-schemaPeople = spark.createDataFrame(people, schema)
-
-# Creates a temporary view using the DataFrame
-schemaPeople.createOrReplaceTempView("people")
-
-# SQL can be run over DataFrames that have been registered as a table.
-results = spark.sql("SELECT name FROM people")
-
-# The results of SQL queries are RDDs and support all the normal RDD operations.
-names = results.map(lambda p: "Name: " + p.name)
-for name in names.collect():
- print(name)
-{% endhighlight %}
+{% include_example programmatic_schema python/sql/basic.py %}
</div>
</div>
@@ -481,13 +372,7 @@ In the simplest form, the default data source (`parquet` unless otherwise config
<div data-lang="python" markdown="1">
-{% highlight python %}
-
-df = spark.read.load("examples/src/main/resources/users.parquet")
-df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
-
-{% endhighlight %}
-
+{% include_example generic_load_save_functions python/sql/datasource.py %}
</div>
<div data-lang="r" markdown="1">
@@ -516,13 +401,7 @@ using this syntax.
<div data-lang="python" markdown="1">
-{% highlight python %}
-
-df = spark.read.load("examples/src/main/resources/people.json", format="json")
-df.select("name", "age").write.save("namesAndAges.parquet", format="parquet")
-
-{% endhighlight %}
-
+{% include_example manual_load_options python/sql/datasource.py %}
</div>
<div data-lang="r" markdown="1">
@@ -547,10 +426,7 @@ file directly with SQL.
<div data-lang="python" markdown="1">
-{% highlight python %}
-df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
-{% endhighlight %}
-
+{% include_example direct_sql python/sql/datasource.py %}
</div>
<div data-lang="r" markdown="1">
@@ -642,26 +518,7 @@ Using the data from the above example:
<div data-lang="python" markdown="1">
-{% highlight python %}
-# spark from the previous example is used in this example.
-
-schemaPeople # The DataFrame from the previous example.
-
-# DataFrames can be saved as Parquet files, maintaining the schema information.
-schemaPeople.write.parquet("people.parquet")
-
-# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
-# The result of loading a parquet file is also a DataFrame.
-parquetFile = spark.read.parquet("people.parquet")
-
-# Parquet files can also be used to create a temporary view and then used in SQL statements.
-parquetFile.createOrReplaceTempView("parquetFile");
-teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
-teenNames = teenagers.map(lambda p: "Name: " + p.name)
-for teenName in teenNames.collect():
- print(teenName)
-{% endhighlight %}
-
+{% include_example basic_parquet_example python/sql/datasource.py %}
</div>
<div data-lang="r" markdown="1">
@@ -775,33 +632,7 @@ turned it off by default starting from 1.5.0. You may enable it by
<div data-lang="python" markdown="1">
-{% highlight python %}
-# spark from the previous example is used in this example.
-
-# Create a simple DataFrame, stored into a partition directory
-df1 = spark.createDataFrame(sc.parallelize(range(1, 6))\
- .map(lambda i: Row(single=i, double=i * 2)))
-df1.write.parquet("data/test_table/key=1")
-
-# Create another DataFrame in a new partition directory,
-# adding a new column and dropping an existing column
-df2 = spark.createDataFrame(sc.parallelize(range(6, 11))
- .map(lambda i: Row(single=i, triple=i * 3)))
-df2.write.parquet("data/test_table/key=2")
-
-# Read the partitioned table
-df3 = spark.read.option("mergeSchema", "true").parquet("data/test_table")
-df3.printSchema()
-
-# The final schema consists of all 3 columns in the Parquet files together
-# with the partitioning column appeared in the partition directory paths.
-# root
-# |-- single: int (nullable = true)
-# |-- double: int (nullable = true)
-# |-- triple: int (nullable = true)
-# |-- key : int (nullable = true)
-{% endhighlight %}
-
+{% include_example schema_merging python/sql/datasource.py %}
</div>
<div data-lang="r" markdown="1">
@@ -996,31 +827,7 @@ Note that the file that is offered as _a json file_ is not a typical JSON file.
line must contain a separate, self-contained valid JSON object. As a consequence,
a regular multi-line JSON file will most often fail.
-{% highlight python %}
-# spark is an existing SparkSession.
-
-# A JSON dataset is pointed to by path.
-# The path can be either a single text file or a directory storing text files.
-people = spark.read.json("examples/src/main/resources/people.json")
-
-# The inferred schema can be visualized using the printSchema() method.
-people.printSchema()
-# root
-# |-- age: long (nullable = true)
-# |-- name: string (nullable = true)
-
-# Creates a temporary view using the DataFrame.
-people.createOrReplaceTempView("people")
-
-# SQL statements can be run by using the sql methods provided by `spark`.
-teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
-
-# Alternatively, a DataFrame can be created for a JSON dataset represented by
-# an RDD[String] storing one JSON object per string.
-anotherPeopleRDD = sc.parallelize([
- '{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}'])
-anotherPeople = spark.jsonRDD(anotherPeopleRDD)
-{% endhighlight %}
+{% include_example json_dataset python/sql/datasource.py %}
</div>
<div data-lang="r" markdown="1">
@@ -1110,17 +917,7 @@ the `hive.metastore.warehouse.dir` property in `hive-site.xml` is deprecated sin
Instead, use `spark.sql.warehouse.dir` to specify the default location of database in warehouse.
You may need to grant write privilege to the user who starts the spark application.
-{% highlight python %}
-# spark is an existing SparkSession
-
-spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
-spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
-
-# Queries can be expressed in HiveQL.
-results = spark.sql("FROM src SELECT key, value").collect()
-
-{% endhighlight %}
-
+{% include_example spark_hive python/sql/hive.py %}
</div>
<div data-lang="r" markdown="1">