aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorCheng Lian <lian@databricks.com>2015-05-30 12:16:09 -0700
committerReynold Xin <rxin@databricks.com>2015-05-30 12:16:09 -0700
commit6e3f0c7810a6721698b0ed51cfbd41a0cd07a4a3 (patch)
treef5ce8b62156e30630ebb0a5b30c2977bf202cb28 /docs
parentd34b43bd5964e1feb03a17937de87a3f718806a5 (diff)
downloadspark-6e3f0c7810a6721698b0ed51cfbd41a0cd07a4a3.tar.gz
spark-6e3f0c7810a6721698b0ed51cfbd41a0cd07a4a3.tar.bz2
spark-6e3f0c7810a6721698b0ed51cfbd41a0cd07a4a3.zip
[SPARK-7849] [SQL] [Docs] Updates SQL programming guide for 1.4
Author: Cheng Lian <lian@databricks.com> Closes #6520 from liancheng/spark-7849 and squashes the following commits: 705264b [Cheng Lian] Updates SQL programming guide for 1.4
Diffstat (limited to 'docs')
-rw-r--r--docs/sql-programming-guide.md91
1 files changed, 85 insertions, 6 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 7cc0a87fd5..4ec3d83016 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -11,6 +11,7 @@ title: Spark SQL and DataFrames
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
+For how to enable Hive support, please refer to the [Hive Tables](#hive-tables) section.
# DataFrames
@@ -906,7 +907,7 @@ new data.
<td>
Ignore mode means that when saving a DataFrame to a data source, if data already exists,
the save operation is expected to not save the contents of the DataFrame and to not
- change the existing data. This is similar to a `CREATE TABLE IF NOT EXISTS` in SQL.
+ change the existing data. This is similar to a <code>CREATE TABLE IF NOT EXISTS</code> in SQL.
</td>
</tr>
</table>
@@ -1030,7 +1031,7 @@ teenagers <- sql(sqlContext, "SELECT name FROM parquetFile WHERE age >= 13 AND a
teenNames <- map(teenagers, function(p) { paste("Name:", p$name)})
for (teenName in collect(teenNames)) {
cat(teenName, "\n")
-}
+}
{% endhighlight %}
</div>
@@ -1502,7 +1503,7 @@ Row[] results = sqlContext.sql("FROM src SELECT key, value").collect();
<div data-lang="python" markdown="1">
When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
-adds support for finding tables in the MetaStore and writing queries using HiveQL.
+adds support for finding tables in the MetaStore and writing queries using HiveQL.
{% highlight python %}
# sc is an existing SparkContext.
from pyspark.sql import HiveContext
@@ -1537,6 +1538,82 @@ results = sqlContext.sql("FROM src SELECT key, value").collect()
</div>
</div>
+### Interacting with Different Versions of Hive Metastore
+
+One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,
+which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.2.0, Spark SQL can
+talk to two versions of Hive metastore, either 0.12.0 or 0.13.1, default to the latter. However, to
+switch to desired Hive metastore version, users have to rebuild the assembly jar with proper profile
+flags (either `-Phive-0.12.0` or `-Phive-0.13.1`), which is quite inconvenient.
+
+Starting from 1.4.0, users no longer need to rebuild the assembly jar to switch Hive metastore
+version. Instead, configuration properties described in the table below can be used to specify
+desired Hive metastore version. Currently, supported versions are still limited to 0.13.1 and
+0.12.0, but we are working on a more generalized mechanism to support a wider range of versions.
+
+Internally, Spark SQL 1.4.0 uses two Hive clients, one for executing native Hive commands like `SET`
+and `DESCRIBE`, the other dedicated for communicating with Hive metastore. The former uses Hive
+jars of version 0.13.1, which are bundled with Spark 1.4.0. The latter uses Hive jars of the
+version specified by users. An isolated classloader is used here to avoid dependency conflicts.
+
+<table class="table">
+ <tr><th>Property Name</th><th>Meaning</th></tr>
+ <tr>
+ <td><code>spark.sql.hive.metastore.version</code></td>
+ <td>
+ The version of the hive client that will be used to communicate with the metastore. Available
+ options are <code>0.12.0</code> and <code>0.13.1</code>. Defaults to <code>0.13.1</code>.
+ </td>
+ </tr>
+
+ <tr>
+ <td><code>spark.sql.hive.metastore.jars</code></td>
+ <td>
+ The location of the jars that should be used to instantiate the HiveMetastoreClient. This
+ property can be one of three options:
+ <ol>
+ <li><code>builtin</code></li>
+ Use Hive 0.13.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is
+ enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
+ either <code>0.13.1</code> or not defined.
+ <li><code>maven</code></li>
+ Use Hive jars of specified version downloaded from Maven repositories.
+ <li>A classpath in the standard format for both Hive and Hadoop.</li>
+ </ol>
+ Defaults to <code>builtin</code>.
+ </td>
+ </tr>
+
+ <tr>
+ <td><code>spark.sql.hive.metastore.sharedPrefixes</code></td>
+
+ <td>
+ <p>
+ A comma separated list of class prefixes that should be loaded using the classloader that is
+ shared between Spark SQL and a specific version of Hive. An example of classes that should
+ be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need
+ to be shared are those that interact with classes that are already shared. For example,
+ custom appenders that are used by log4j.
+ </p>
+ <p>
+ Defaults to <code>com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc</code>.
+ </p>
+ </td>
+ </tr>
+
+ <tr>
+ <td><code>spark.sql.hive.metastore.barrierPrefixes</code></td>
+ <td>
+ <p>
+ A comma separated list of class prefixes that should explicitly be reloaded for each version
+ of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a
+ prefix that typically would be shared (i.e. <code>org.apache.spark.*</code>).
+ </p>
+ <p>Defaults to empty.</p>
+ </td>
+ </tr>
+</table>
+
## JDBC To Other Databases
Spark SQL also includes a data source that can read data from other databases using JDBC. This
@@ -1570,7 +1647,7 @@ the Data Sources API. The following options are supported:
<tr>
<td><code>dbtable</code></td>
<td>
- The JDBC table that should be read. Note that anything that is valid in a `FROM` clause of
+ The JDBC table that should be read. Note that anything that is valid in a <code>FROM</code> clause of
a SQL query can be used. For example, instead of a full table you could also use a
subquery in parentheses.
</td>
@@ -1714,7 +1791,7 @@ that these options will be deprecated in future release as more optimizations ar
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when
performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently
statistics are only supported for Hive Metastore tables where the command
- `ANALYZE TABLE &lt;tableName&gt; COMPUTE STATISTICS noscan` has been run.
+ <code>ANALYZE TABLE &lt;tableName&gt; COMPUTE STATISTICS noscan</code> has been run.
</td>
</tr>
<tr>
@@ -1737,7 +1814,9 @@ that these options will be deprecated in future release as more optimizations ar
# Distributed SQL Engine
-Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface. In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries, without the need to write any code.
+Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface.
+In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries,
+without the need to write any code.
## Running the Thrift JDBC/ODBC server