aboutsummaryrefslogtreecommitdiff
path: root/docs/sql-programming-guide.md
diff options
context:
space:
mode:
authorhyukjinkwon <gurwls223@gmail.com>2016-10-10 22:22:41 -0700
committergatorsmile <gatorsmile@gmail.com>2016-10-10 22:22:41 -0700
commit0c0ad436ad909364915b910867d08262c62bc95d (patch)
treeb8afcc1aa41d83596258fc541ab99e01231b851e /docs/sql-programming-guide.md
parent19a5bae47f69929d00d9de43387c7df37a05ee25 (diff)
downloadspark-0c0ad436ad909364915b910867d08262c62bc95d.tar.gz
spark-0c0ad436ad909364915b910867d08262c62bc95d.tar.bz2
spark-0c0ad436ad909364915b910867d08262c62bc95d.zip
[SPARK-17719][SPARK-17776][SQL] Unify and tie up options in a single place in JDBC datasource package
## What changes were proposed in this pull request? This PR proposes to fix arbitrary usages among `Map[String, String]`, `Properties` and `JDBCOptions` instances for options in `execution/jdbc` package and make the connection properties exclude Spark-only options. This PR includes some changes as below: - Unify `Map[String, String]`, `Properties` and `JDBCOptions` in `execution/jdbc` package to `JDBCOptions`. - Move `batchsize`, `fetchszie`, `driver` and `isolationlevel` options into `JDBCOptions` instance. - Document `batchSize` and `isolationlevel` with marking both read-only options and write-only options. Also, this includes minor types and detailed explanation for some statements such as url. - Throw exceptions fast by checking arguments first rather than in execution time (e.g. for `fetchsize`). - Exclude Spark-only options in connection properties. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15292 from HyukjinKwon/SPARK-17719.
Diffstat (limited to 'docs/sql-programming-guide.md')
-rw-r--r--docs/sql-programming-guide.md36
1 files changed, 27 insertions, 9 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 835cb6981f..d0f43ab0a9 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1049,16 +1049,20 @@ bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.
{% endhighlight %}
Tables from the remote database can be loaded as a DataFrame or Spark SQL Temporary table using
-the Data Sources API. The following options are supported:
+the Data Sources API. Users can specify the JDBC connection properties in the data source options.
+<code>user</code> and <code>password</code> are normally provided as connection properties for
+logging into the data sources. In addition to the connection properties, Spark also supports
+the following case-sensitive options:
<table class="table">
<tr><th>Property Name</th><th>Meaning</th></tr>
<tr>
<td><code>url</code></td>
<td>
- The JDBC URL to connect to.
+ The JDBC URL to connect to. The source-specific connection properties may be specified in the URL. e.g., <code>jdbc:postgresql://localhost/test?user=fred&password=secret</code>
</td>
</tr>
+
<tr>
<td><code>dbtable</code></td>
<td>
@@ -1083,28 +1087,42 @@ the Data Sources API. The following options are supported:
<code>partitionColumn</code> must be a numeric column from the table in question. Notice
that <code>lowerBound</code> and <code>upperBound</code> are just used to decide the
partition stride, not for filtering the rows in table. So all rows in the table will be
- partitioned and returned.
+ partitioned and returned. This option applies only to reading.
</td>
</tr>
<tr>
<td><code>fetchsize</code></td>
<td>
- The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows).
+ The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows). This option applies only to reading.
</td>
</tr>
<tr>
+ <td><code>batchsize</code></td>
+ <td>
+ The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option applies only to writing. It defaults to <code>1000</code>.
+ </td>
+ </tr>
+
+ <tr>
+ <td><code>isolationLevel</code></td>
+ <td>
+ The transaction isolation level, which applies to current connection. It can be one of <code>NONE<code>, <code>READ_COMMITTED<code>, <code>READ_UNCOMMITTED<code>, <code>REPEATABLE_READ<code>, or <code>SERIALIZABLE<code>, corresponding to standard transaction isolation levels defined by JDBC's Connection object, with default of <code>READ_UNCOMMITTED<code>. This option applies only to writing. Please refer the documentation in <code>java.sql.Connection</code>.
+ </td>
+ </tr>
+
+ <tr>
<td><code>truncate</code></td>
<td>
- This is a JDBC writer related option. When <code>SaveMode.Overwrite</code> is enabled, this option causes Spark to truncate an existing table instead of dropping and recreating it. This can be more efficient, and prevents the table metadata (e.g. indices) from being removed. However, it will not work in some cases, such as when the new data has a different schema. It defaults to <code>false</code>.
+ This is a JDBC writer related option. When <code>SaveMode.Overwrite</code> is enabled, this option causes Spark to truncate an existing table instead of dropping and recreating it. This can be more efficient, and prevents the table metadata (e.g., indices) from being removed. However, it will not work in some cases, such as when the new data has a different schema. It defaults to <code>false</code>. This option applies only to writing.
</td>
</tr>
<tr>
<td><code>createTableOptions</code></td>
<td>
- This is a JDBC writer related option. If specified, this option allows setting of database-specific table and partition options when creating a table. For example: <code>CREATE TABLE t (name string) ENGINE=InnoDB.</code>
+ This is a JDBC writer related option. If specified, this option allows setting of database-specific table and partition options when creating a table (e.g., <code>CREATE TABLE t (name string) ENGINE=InnoDB.</code>). This option applies only to writing.
</td>
</tr>
</table>
@@ -1328,7 +1346,7 @@ options.
- Dataset API and DataFrame API are unified. In Scala, `DataFrame` becomes a type alias for
`Dataset[Row]`, while Java API users must replace `DataFrame` with `Dataset<Row>`. Both the typed
- transformations (e.g. `map`, `filter`, and `groupByKey`) and untyped transformations (e.g.
+ transformations (e.g., `map`, `filter`, and `groupByKey`) and untyped transformations (e.g.,
`select` and `groupBy`) are available on the Dataset class. Since compile-time type-safety in
Python and R is not a language feature, the concept of Dataset does not apply to these languages’
APIs. Instead, `DataFrame` remains the primary programing abstraction, which is analogous to the
@@ -1377,7 +1395,7 @@ options.
- Timestamps are now stored at a precision of 1us, rather than 1ns
- In the `sql` dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains
unchanged.
- - The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM).
+ - The canonical name of SQL/DataFrame functions are now lower case (e.g., sum vs SUM).
- JSON data source will not automatically load new files that are created by other applications
(i.e. files that are not inserted to the dataset through Spark SQL).
For a JSON persistent table (i.e. the metadata of the table is stored in Hive Metastore),
@@ -1392,7 +1410,7 @@ options.
Based on user feedback, we created a new, more fluid API for reading data in (`SQLContext.read`)
and writing data out (`DataFrame.write`),
-and deprecated the old APIs (e.g. `SQLContext.parquetFile`, `SQLContext.jsonFile`).
+and deprecated the old APIs (e.g., `SQLContext.parquetFile`, `SQLContext.jsonFile`).
See the API docs for `SQLContext.read` (
<a href="api/scala/index.html#org.apache.spark.sql.SQLContext@read:DataFrameReader">Scala</a>,