From 6e3f0c7810a6721698b0ed51cfbd41a0cd07a4a3 Mon Sep 17 00:00:00 2001 From: Cheng Lian Date: Sat, 30 May 2015 12:16:09 -0700 Subject: [SPARK-7849] [SQL] [Docs] Updates SQL programming guide for 1.4 Author: Cheng Lian Closes #6520 from liancheng/spark-7849 and squashes the following commits: 705264b [Cheng Lian] Updates SQL programming guide for 1.4 --- docs/sql-programming-guide.md | 91 ++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 85 insertions(+), 6 deletions(-) (limited to 'docs/sql-programming-guide.md') diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 7cc0a87fd5..4ec3d83016 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -11,6 +11,7 @@ title: Spark SQL and DataFrames Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. +For how to enable Hive support, please refer to the [Hive Tables](#hive-tables) section. # DataFrames @@ -906,7 +907,7 @@ new data. Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected to not save the contents of the DataFrame and to not - change the existing data. This is similar to a `CREATE TABLE IF NOT EXISTS` in SQL. + change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL. @@ -1030,7 +1031,7 @@ teenagers <- sql(sqlContext, "SELECT name FROM parquetFile WHERE age >= 13 AND a teenNames <- map(teenagers, function(p) { paste("Name:", p$name)}) for (teenName in collect(teenNames)) { cat(teenName, "\n") -} +} {% endhighlight %} @@ -1502,7 +1503,7 @@ Row[] results = sqlContext.sql("FROM src SELECT key, value").collect();
When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and -adds support for finding tables in the MetaStore and writing queries using HiveQL. +adds support for finding tables in the MetaStore and writing queries using HiveQL. {% highlight python %} # sc is an existing SparkContext. from pyspark.sql import HiveContext @@ -1537,6 +1538,82 @@ results = sqlContext.sql("FROM src SELECT key, value").collect()
+### Interacting with Different Versions of Hive Metastore + +One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore, +which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.2.0, Spark SQL can +talk to two versions of Hive metastore, either 0.12.0 or 0.13.1, default to the latter. However, to +switch to desired Hive metastore version, users have to rebuild the assembly jar with proper profile +flags (either `-Phive-0.12.0` or `-Phive-0.13.1`), which is quite inconvenient. + +Starting from 1.4.0, users no longer need to rebuild the assembly jar to switch Hive metastore +version. Instead, configuration properties described in the table below can be used to specify +desired Hive metastore version. Currently, supported versions are still limited to 0.13.1 and +0.12.0, but we are working on a more generalized mechanism to support a wider range of versions. + +Internally, Spark SQL 1.4.0 uses two Hive clients, one for executing native Hive commands like `SET` +and `DESCRIBE`, the other dedicated for communicating with Hive metastore. The former uses Hive +jars of version 0.13.1, which are bundled with Spark 1.4.0. The latter uses Hive jars of the +version specified by users. An isolated classloader is used here to avoid dependency conflicts. + + + + + + + + + + + + + + + + + + + + + + + +
Property NameMeaning
spark.sql.hive.metastore.version + The version of the hive client that will be used to communicate with the metastore. Available + options are 0.12.0 and 0.13.1. Defaults to 0.13.1. +
spark.sql.hive.metastore.jars + The location of the jars that should be used to instantiate the HiveMetastoreClient. This + property can be one of three options: +
    +
  1. builtin
  2. + Use Hive 0.13.1, which is bundled with the Spark assembly jar when -Phive is + enabled. When this option is chosen, spark.sql.hive.metastore.version must be + either 0.13.1 or not defined. +
  3. maven
  4. + Use Hive jars of specified version downloaded from Maven repositories. +
  5. A classpath in the standard format for both Hive and Hadoop.
  6. +
+ Defaults to builtin. +
spark.sql.hive.metastore.sharedPrefixes +

+ A comma separated list of class prefixes that should be loaded using the classloader that is + shared between Spark SQL and a specific version of Hive. An example of classes that should + be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need + to be shared are those that interact with classes that are already shared. For example, + custom appenders that are used by log4j. +

+

+ Defaults to com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc. +

+
spark.sql.hive.metastore.barrierPrefixes +

+ A comma separated list of class prefixes that should explicitly be reloaded for each version + of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a + prefix that typically would be shared (i.e. org.apache.spark.*). +

+

Defaults to empty.

+
+ ## JDBC To Other Databases Spark SQL also includes a data source that can read data from other databases using JDBC. This @@ -1570,7 +1647,7 @@ the Data Sources API. The following options are supported: dbtable - The JDBC table that should be read. Note that anything that is valid in a `FROM` clause of + The JDBC table that should be read. Note that anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses. @@ -1714,7 +1791,7 @@ that these options will be deprecated in future release as more optimizations ar Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command - `ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan` has been run. + ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan has been run. @@ -1737,7 +1814,9 @@ that these options will be deprecated in future release as more optimizations ar # Distributed SQL Engine -Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface. In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries, without the need to write any code. +Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface. +In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries, +without the need to write any code. ## Running the Thrift JDBC/ODBC server -- cgit v1.2.3