aboutsummaryrefslogtreecommitdiff
path: root/docs/sql-programming-guide.md
Commit message (Collapse)AuthorAgeFilesLines
* Fixed typos in docsymahajan2017-04-191-1/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Typos at a couple of place in the docs. ## How was this patch tested? build including docs Please review http://spark.apache.org/contributing.html before opening a pull request. Author: ymahajan <ymahajan@snappydata.io> Closes #17690 from ymahajan/master.
* [MINOR][DOCS] JSON APIs related documentation fixeshyukjinkwon2017-04-121-2/+2
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR proposes corrections related to JSON APIs as below: - Rendering links in Python documentation - Replacing `RDD` to `Dataset` in programing guide - Adding missing description about JSON Lines consistently in `DataFrameReader.json` in Python API - De-duplicating little bit of `DataFrameReader.json` in Scala/Java API ## How was this patch tested? Manually build the documentation via `jekyll build`. Corresponding snapstops will be left on the codes. Note that currently there are Javadoc8 breaks in several places. These are proposed to be handled in https://github.com/apache/spark/pull/17477. So, this PR does not fix those. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17602 from HyukjinKwon/minor-json-documentation.
* [MINOR][DOCS] Update supported versions for Hive MetastoreDongjoon Hyun2017-04-111-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Since SPARK-18112 and SPARK-13446, Apache Spark starts to support reading Hive metastore 2.0 ~ 2.1.1. This updates the docs. ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #17612 from dongjoon-hyun/metastore.
* [SPARK-10849][SQL] Adds option to the JDBC data source write for user to ↵sureshthalamati2017-03-231-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | specify database column type for the create table ## What changes were proposed in this pull request? Currently JDBC data source creates tables in the target database using the default type mapping, and the JDBC dialect mechanism.  If users want to specify different database data type for only some of columns, there is no option available. In scenarios where default mapping does not work, users are forced to create tables on the target database before writing. This workaround is probably not acceptable from a usability point of view. This PR is to provide a user-defined type mapping for specific columns. The solution is to allow users to specify database column data type for the create table as JDBC datasource option(createTableColumnTypes) on write. Data type information can be specified in the same format as table schema DDL format (e.g: `name CHAR(64), comments VARCHAR(1024)`). All supported target database types can not be specified , the data types has to be valid spark sql data types also. For example user can not specify target database CLOB data type. This will be supported in the follow-up PR. Example: ```Scala df.write .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)") .jdbc(url, "TEST.DBCOLTYPETEST", properties) ``` ## How was this patch tested? Added new test cases to the JDBCWriteSuite Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #16209 from sureshthalamati/jdbc_custom_dbtype_option_json-spark-10849.
* [SPARK-18352][DOCS] wholeFile JSON update doc and programming guideFelix Cheung2017-03-021-11/+15
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Update doc for R, programming guide. Clarify default behavior for all languages. ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17128 from felixcheung/jsonwholefiledoc.
* [MINOR][DOCS] Fixes two problems in the SQL programing guide pageBoaz Mohar2017-02-251-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Removed duplicated lines in sql python example and found a typo. ## How was this patch tested? Searched for other typo's in the page to minimize PR's. Author: Boaz Mohar <boazmohar@gmail.com> Closes #17066 from boazmohar/doc-fix.
* [SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTable api call in the docSunitha Kambhampati2017-02-131-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory In the doc, the call spark.cacheTable(“tableName”) and spark.uncacheTable(“tableName”) actually needs to be spark.catalog.cacheTable and spark.catalog.uncacheTable ## How was this patch tested? Built the docs and verified the change shows up fine. Author: Sunitha Kambhampati <skambha@us.ibm.com> Closes #16919 from skambha/docChange.
* [SPARK-19396][DOC] JDBC Options are Case In-sensitivegatorsmile2017-01-301-1/+1
| | | | | | | | | | | | ### What changes were proposed in this pull request? The case are not sensitive in JDBC options, after the PR https://github.com/apache/spark/pull/15884 is merged to Spark 2.1. ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #16734 from gatorsmile/fixDocCaseInsensitive.
* [SPARK-16046][DOCS] Aggregations in the Spark SQL programming guideaokolnychyi2017-01-241-0/+46
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? - A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own. - Examples of using the `UserDefinedAggregateFunction` abstract class for untyped aggregations in Java and Scala. - Examples of using the `Aggregator` abstract class for type-safe aggregations in Java and Scala. - Python is not covered. - The PR might not resolve the ticket since I do not know what exactly was planned by the author. In total, there are four new standalone examples that can be executed via `spark-submit` or `run-example`. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets. ## How was this patch tested? The patch was tested locally by building the docs. The examples were run as well. ![image](https://cloud.githubusercontent.com/assets/6235869/21292915/04d9d084-c515-11e6-811a-999d598dffba.png) Author: aokolnychyi <okolnychyyanton@gmail.com> Closes #16329 from aokolnychyi/SPARK-16046.
* [SPARK-18941][SQL][DOC] Add a new behavior document on `CREATE/DROP TABLE` ↵Dongjoon Hyun2017-01-071-0/+8
| | | | | | | | | | | | | | | | | | | | | with `LOCATION` ## What changes were proposed in this pull request? This PR adds a new behavior change description on `CREATE TABLE ... LOCATION` at `sql-programming-guide.md` clearly under `Upgrading From Spark SQL 1.6 to 2.0`. This change is introduced at Apache Spark 2.0.0 as [SPARK-15276](https://issues.apache.org/jira/browse/SPARK-15276). ## How was this patch tested? ``` SKIP_API=1 jekyll build ``` **Newly Added Description** <img width="913" alt="new" src="https://cloud.githubusercontent.com/assets/9700541/21743606/7efe2b12-d4ba-11e6-8a0d-551222718ea2.png"> Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16400 from dongjoon-hyun/SPARK-18941.
* [SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde ↵Wenchen Fan2017-01-051-8/+52
| | | | | | | | | | | | | | | | | | | | | | | tables ## What changes were proposed in this pull request? Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source. Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for details. TODO(for follow-up PRs): 1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later. 2. `SHOW CREATE TABLE` should be updated to use the new syntax. 3. we should decide if we wanna change the behavior of `SET LOCATION`. ## How was this patch tested? new tests Author: Wenchen Fan <wenchen@databricks.com> Closes #16296 from cloud-fan/create-table.
* [SPARK-19016][SQL][DOC] Document scalable partition handlingCheng Lian2016-12-301-4/+11
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR documents the scalable partition handling feature in the body of the programming guide. Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1. ## How was this patch tested? N/A. Author: Cheng Lian <lian@databricks.com> Closes #16424 from liancheng/scalable-partition-handling-doc.
* Update Spark documentation to provide information on how to create External ↵c-sahuja2016-12-061-0/+5
| | | | | | | | | | | | | | | Table ## What changes were proposed in this pull request? Although, currently, the saveAsTable does not provide an API to save the table as an external table from a DataFrame, we can achieve this functionality by using options on DataFrameWriter where the key for the map is the String: "path" and the value is another String which is the location of the external table itself. This can be provided before the call to saveAsTable is performed. ## How was this patch tested? Documentation was reviewed for formatting and content after the push was performed on the branch. ![updated documentation](https://cloud.githubusercontent.com/assets/15376052/20953147/4cfcf308-bc57-11e6-807c-e21fb774a760.PNG) Author: c-sahuja <sahuja@cloudera.com> Closes #16185 from c-sahuja/createExternalTable.
* [MINOR][DOC] Use SparkR `TRUE` value and add default values for ↵Dongjoon Hyun2016-12-051-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `StructField` in SQL Guide. ## What changes were proposed in this pull request? In `SQL Programming Guide`, this PR uses `TRUE` instead of `True` in SparkR and adds default values of `nullable` for `StructField` in Scala/Python/R (i.e., "Note: The default value of nullable is true."). In Java API, `nullable` is not optional. **BEFORE** * SPARK 2.1.0 RC1 http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/sql-programming-guide.html#data-types **AFTER** * R <img width="916" alt="screen shot 2016-12-04 at 11 58 19 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877443/abba19a6-ba7d-11e6-8984-afbe00333fb0.png"> * Scala <img width="914" alt="screen shot 2016-12-04 at 11 57 37 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877433/99ce734a-ba7d-11e6-8bb5-e8619041b09b.png"> * Python <img width="914" alt="screen shot 2016-12-04 at 11 58 04 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877440/a5c89338-ba7d-11e6-8f92-6c0ae9388d7e.png"> ## How was this patch tested? Manual. ``` cd docs SKIP_API=1 jekyll build open _site/index.html ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16141 from dongjoon-hyun/SPARK-SQL-GUIDE.
* [SPARK-18145] Update documentation for hive partition management in 2.1Eric Liang2016-11-291-0/+9
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This documents the partition handling changes for Spark 2.1 and how to migrate existing tables. ## How was this patch tested? Built docs locally. rxin Author: Eric Liang <ekl@databricks.com> Closes #16074 from ericl/spark-18145.
* [WIP][SQL][DOC] Fix incorrect `code` tagWeiqing Yang2016-11-261-1/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? This PR is to fix incorrect `code` tag in `sql-programming-guide.md` ## How was this patch tested? Manually. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #15941 from weiqingy/fixtag.
* [SPARK-18413][SQL][FOLLOW-UP] Use `numPartitions` instead of `maxConnections`Dongjoon Hyun2016-11-251-10/+14
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is a follow-up PR of #15868 to merge `maxConnections` option into `numPartitions` options. ## How was this patch tested? Pass the existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15966 from dongjoon-hyun/SPARK-18413-2.
* [SPARK-18413][SQL] Add `maxConnections` JDBCOptionDongjoon Hyun2016-11-211-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds a new JDBCOption `maxConnections` which means the maximum number of simultaneous JDBC connections allowed. This option applies only to writing with coalesce operation if needed. It defaults to the number of partitions of RDD. Previously, SQL users cannot cannot control this while Scala/Java/Python users can use `coalesce` (or `repartition`) API. **Reported Scenario** For the following cases, the number of connections becomes 200 and database cannot handle all of them. ```sql CREATE OR REPLACE TEMPORARY VIEW resultview USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:oracle:thin:10.129.10.111:1521:BKDB", dbtable "result", user "HIVE", password "HIVE" ); -- set spark.sql.shuffle.partitions=200 INSERT OVERWRITE TABLE resultview SELECT g, count(1) AS COUNT FROM tnet.DT_LIVE_INFO GROUP BY g ``` ## How was this patch tested? Manual. Do the followings and see Spark UI. **Step 1 (MySQL)** ``` CREATE TABLE t1 (a INT); CREATE TABLE data (a INT); INSERT INTO data VALUES (1); INSERT INTO data VALUES (2); INSERT INTO data VALUES (3); ``` **Step 2 (Spark)** ```scala SPARK_HOME=$PWD bin/spark-shell --driver-memory 4G --driver-class-path mysql-connector-java-5.1.40-bin.jar scala> sql("SET spark.sql.shuffle.partitions=3") scala> sql("CREATE OR REPLACE TEMPORARY VIEW data USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 'data', user 'root', password '')") scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '1')") scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a") scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '2')") scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a") scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '3')") scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a") scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '4')") scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a") ``` ![maxconnections](https://cloud.githubusercontent.com/assets/9700541/20287987/ed8409c2-aa84-11e6-8aab-ae28e63fe54d.png) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15868 from dongjoon-hyun/SPARK-18413.
* [MINOR][DOC] Fix typos in the 'configuration', 'monitoring' and ↵Weiqing Yang2016-11-161-3/+3
| | | | | | | | | | | | | | | 'sql-programming-guide' documentation ## What changes were proposed in this pull request? Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' documentation. ## How was this patch tested? Manually. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #15886 from weiqingy/fixTypo.
* [SQL][DOC] updating doc for JSON source to link to jsonlines.orgFelix Cheung2016-10-261-9/+13
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? API and programming guide doc changes for Scala, Python and R. ## How was this patch tested? manual test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15629 from felixcheung/jsondoc.
* [SPARK-17810][SQL] Default spark.sql.warehouse.dir is relative to local FS ↵Sean Owen2016-10-241-28/+5
| | | | | | | | | | | | | | | | but can resolve as HDFS path ## What changes were proposed in this pull request? Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #15382 from srowen/SPARK-17810.
* [SPARK-18001][DOCUMENT] fix broke link to SparkDataFrameTommy YU2016-10-181-1/+1
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section "Untyped Dataset Operations (aka DataFrame Operations)" Link to R DataFrame doesn't work that return The requested URL /docs/latest/api/R/DataFrame.html was not found on this server. Correct link is SparkDataFrame.html for spark 2.0 ## How was this patch tested? Manual checked. Author: Tommy YU <tummyyu@163.com> Closes #15543 from Wenpei/spark-18001.
* [MINOR][DOC] Add more built-in sources in sql-programming-guide.mdWeiqing Yang2016-10-181-2/+2
| | | | | | | | | | | | ## What changes were proposed in this pull request? Add more built-in sources in sql-programming-guide.md. ## How was this patch tested? Manually. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #15522 from weiqingy/dsDoc.
* [DOC] Fix typo in sql hive docDhruve Ashar2016-10-141-1/+1
| | | | | | | | Change is too trivial to file a JIRA. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #15485 from dhruve/master.
* [SPARK-17719][SPARK-17776][SQL] Unify and tie up options in a single place ↵hyukjinkwon2016-10-101-9/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | in JDBC datasource package ## What changes were proposed in this pull request? This PR proposes to fix arbitrary usages among `Map[String, String]`, `Properties` and `JDBCOptions` instances for options in `execution/jdbc` package and make the connection properties exclude Spark-only options. This PR includes some changes as below: - Unify `Map[String, String]`, `Properties` and `JDBCOptions` in `execution/jdbc` package to `JDBCOptions`. - Move `batchsize`, `fetchszie`, `driver` and `isolationlevel` options into `JDBCOptions` instance. - Document `batchSize` and `isolationlevel` with marking both read-only options and write-only options. Also, this includes minor types and detailed explanation for some statements such as url. - Throw exceptions fast by checking arguments first rather than in execution time (e.g. for `fetchsize`). - Exclude Spark-only options in connection properties. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15292 from HyukjinKwon/SPARK-17719.
* [SPARK-17338][SQL] add global temp viewWenchen Fan2016-10-101-5/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1. changes for `SessionCatalog`: 1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name. 2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved. 3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved. 4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views. 5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view. 6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views. 7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views. changes for SQL commands: 1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views 2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views. 3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc. changes for other public API 1. add a new method `dropGlobalTempView` in `Catalog` 2. `Catalog.findTable` can find global temp view 3. add a new method `createGlobalTempView` in `Dataset` ## How was this patch tested? new tests in `SQLViewSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #14897 from cloud-fan/global-temp-view.
* [SPARK-14525][SQL] Make DataFrameWrite.save work for jdbcJustin Pihony2016-09-261-1/+5
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This change modifies the implementation of DataFrameWriter.save such that it works with jdbc, and the call to jdbc merely delegates to save. ## How was this patch tested? This was tested via unit tests in the JDBCWriteSuite, of which I added one new test to cover this scenario. ## Additional details rxin This seems to have been most recently touched by you and was also commented on in the JIRA. This contribution is my original work and I license the work to the project under the project's open source license. Author: Justin Pihony <justin.pihony@gmail.com> Author: Justin Pihony <justin.pihony@typesafe.com> Closes #12601 from JustinPihony/jdbc_reconciliation.
* Correct fetchsize property name in docsDaniel Darabos2016-09-171-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Replace `fetchSize` with `fetchsize` in the docs. ## How was this patch tested? I manually tested `fetchSize` and `fetchsize`. The latter has an effect. See also [`JdbcUtils.scala#L38`](https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L38) for the definition of the property. Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #14975 from darabos/patch-3.
* [SPARK-16968] Document additional options in jdbc WriterGraceH2016-08-221-0/+14
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) This is the document for previous JDBC Writer options. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Unit test has been added in previous PR. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: GraceH <jhuang1@paypal.com> Closes #14683 from GraceH/jdbc_options.
* [SPARK-16870][DOCS] Summary:add "spark.sql.broadcastTimeout" into ↵keliang2016-08-071-0/+9
| | | | | | | | | | | | | | | | | | | | | | | docs/sql-programming-gu… ## What changes were proposed in this pull request? default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned ## How was this patch tested? not need (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) …ide.md JIRA_ID:SPARK-16870 Description:default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned Test:done Author: keliang <keliang@cmss.chinamobile.com> Closes #14477 from biglobster/keliang.
* [SPARK-16734][EXAMPLES][SQL] Revise examples of all language bindingsCheng Lian2016-08-021-42/+14
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed. ## How was this patch tested? Manually tested. Author: Cheng Lian <lian@databricks.com> Closes #14368 from liancheng/revise-examples.
* [SQL][DOC] Fix a default name for parquet compressionTakeshi YAMAMURO2016-07-251-1/+1
| | | | | | | | | ## What changes were proposed in this pull request? This pr is to fix a wrong description for parquet default compression. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #14351 from maropu/FixParquetDoc.
* [SPARK-16380][EXAMPLES] Update SQL examples and programming guide for Python ↵Cheng Lian2016-07-231-216/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | language binding This PR is based on PR #14098 authored by wangmiao1981. ## What changes were proposed in this pull request? This PR replaces the original Python Spark SQL example file with the following three files: - `sql/basic.py` Demonstrates basic Spark SQL features. - `sql/datasource.py` Demonstrates various Spark SQL data sources. - `sql/hive.py` Demonstrates Spark SQL Hive interaction. This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag. ## How was this patch tested? Manually tested. Author: wm624@hotmail.com <wm624@hotmail.com> Author: Cheng Lian <lian@databricks.com> Closes #14317 from liancheng/py-examples-update.
* [SPARK-16568][SQL][DOCUMENTATION] update sql programming guide refreshTable ↵WeichenXu2016-07-191-2/+2
| | | | | | | | | | | | | | | | | | API in python code ## What changes were proposed in this pull request? update `refreshTable` API in python code of the sql-programming-guide. This API is added in SPARK-15820 ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14220 from WeichenXu123/update_sql_doc_catalog.
* [SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example updateCheng Lian2016-07-181-29/+28
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL". ## How was this patch tested? Manually verified the generated HTML page. Author: Cheng Lian <lian@databricks.com> Closes #14245 from liancheng/minor-scala-example-update.
* [SPARK-16553][DOCS] Fix SQL example file name in docsShivaram Venkataraman2016-07-141-1/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixes a typo in the sql programming guide ## How was this patch tested? Building docs locally (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14208 from shivaram/spark-sql-doc-fix.
* [SPARK-16303][DOCS][EXAMPLES] Updated SQL programming guide and examplesaokolnychyi2016-07-131-537/+35
| | | | | | | | | | | | | | - Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project. - Removed the inconsistency between Scala and Java Spark SQL examples - Scala and Java Spark SQL examples were updated The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review. ![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png) Author: aokolnychyi <okolnychyyanton@gmail.com> Closes #14119 from aokolnychyi/spark_16303.
* [SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose ↵Lianhui Wang2016-07-121-0/+12
| | | | | | | | | | | | | | | | children are deterministic project or filter operators. ## What changes were proposed in this pull request? when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003. ## How was this patch tested? add unit tests Author: Lianhui Wang <lianhuiwang09@gmail.com> Author: Wenchen Fan <wenchen@databricks.com> Author: Lianhui Wang <lianhuiwang@users.noreply.github.com> Closes #13494 from lianhuiwang/metadata-only.
* [SPARK-16381][SQL][SPARKR] Update SQL examples and programming guide for R ↵Xin Ren2016-07-111-142/+13
| | | | | | | | | | | | | | | | | | | | | | | language binding https://issues.apache.org/jira/browse/SPARK-16381 ## What changes were proposed in this pull request? Update SQL examples and programming guide for R language binding. Here I just follow example https://github.com/apache/spark/compare/master...liancheng:example-snippet-extraction, created a separate R file to store all the example code. ## How was this patch tested? Manual test on my local machine. Screenshot as below: ![screen shot 2016-07-06 at 4 52 25 pm](https://cloud.githubusercontent.com/assets/3925641/16638180/13925a58-439a-11e6-8d57-8451a63dcae9.png) Author: Xin Ren <iamshrek@126.com> Closes #14082 from keypointt/SPARK-16381.
* [SPARK-16294][SQL] Labelling support for the include_example Jekyll pluginCheng Lian2016-06-291-35/+6
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds labelling support for the `include_example` Jekyll plugin, so that we may split a single source file into multiple line blocks with different labels, and include them in multiple code snippets in the generated HTML page. ## How was this patch tested? Manually tested. <img width="923" alt="screenshot at jun 29 19-53-21" src="https://cloud.githubusercontent.com/assets/230655/16451099/66a76db2-3e33-11e6-84fb-63104c2f0688.png"> Author: Cheng Lian <lian@databricks.com> Closes #13972 from liancheng/include-example-with-labels.
* [SPARK-15863][SQL][DOC][FOLLOW-UP] Update SQL programming guide.Yin Huai2016-06-271-18/+16
| | | | | | | | | ## What changes were proposed in this pull request? This PR makes several updates to SQL programming guide. Author: Yin Huai <yhuai@databricks.com> Closes #13938 from yhuai/doc.
* [SQL][DOC] SQL programming guide add deprecated methods in 2.0.0Felix Cheung2016-06-221-1/+5
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Doc changes ## How was this patch tested? manual liancheng Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13827 from felixcheung/sqldocdeprecate.
* [SPARK-15894][SQL][DOC] Update docs for controlling #partitionsTakeshi YAMAMURO2016-06-211-0/+17
| | | | | | | | | | | | ## What changes were proposed in this pull request? Update docs for two parameters `spark.sql.files.maxPartitionBytes` and `spark.sql.files.openCostInBytes ` in Other Configuration Options. ## How was this patch tested? N/A Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13797 from maropu/SPARK-15894-2.
* [SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include ↵Felix Cheung2016-06-211-18/+16
| | | | | | | | | | | | | | | | | | sparkSession in R ## What changes were proposed in this pull request? Update doc as per discussion in PR #13592 ## How was this patch tested? manual shivaram liancheng Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13799 from felixcheung/rsqlprogrammingguide.
* [SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0Cheng Lian2016-06-201-288/+317
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 migration guide are still incomplete. We may also want to add more examples for Scala/Java Dataset typed transformations. ## How was this patch tested? N/A Author: Cheng Lian <lian@databricks.com> Closes #13592 from liancheng/sql-programming-guide-2.0.
* [DOCUMENTATION] fixed groupby aggregation example for pysparkMortada Mehyar2016-06-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? fixing documentation for the groupby/agg example in python ## How was this patch tested? the existing example in the documentation dose not contain valid syntax (missing parenthesis) and is not using `Column` in the expression for `agg()` after the fix here's how I tested it: ``` In [1]: from pyspark.sql import Row In [2]: import pyspark.sql.functions as func In [3]: %cpaste Pasting code; enter '--' alone on the line to stop or use Ctrl-D. :records = [{'age': 19, 'department': 1, 'expense': 100}, : {'age': 20, 'department': 1, 'expense': 200}, : {'age': 21, 'department': 2, 'expense': 300}, : {'age': 22, 'department': 2, 'expense': 300}, : {'age': 23, 'department': 3, 'expense': 300}] :-- In [4]: df = sqlContext.createDataFrame([Row(**d) for d in records]) In [5]: df.groupBy("department").agg(df["department"], func.max("age"), func.sum("expense")).show() +----------+----------+--------+------------+ |department|department|max(age)|sum(expense)| +----------+----------+--------+------------+ | 1| 1| 20| 300| | 2| 2| 22| 600| | 3| 3| 23| 300| +----------+----------+--------+------------+ Author: Mortada Mehyar <mortada.mehyar@gmail.com> Closes #13587 from mortada/groupby_agg_doc_fix.
* [SPARK-15396][SQL][DOC] It can't connect hive metastore databasegatorsmile2016-05-211-29/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? The `hive.metastore.warehouse.dir` property in hive-site.xml is deprecated since Spark 2.0.0. Users might not be able to connect to the existing metastore if they do not use the new conf parameter `spark.sql.warehouse.dir`. This PR is to update the document and example for explaining the latest changes in the configuration of default location of database. Below is the screenshot of the latest generated docs: <img width="681" alt="screenshot 2016-05-20 08 38 10" src="https://cloud.githubusercontent.com/assets/11567269/15433296/a05c4ace-1e66-11e6-8d2b-73682b32e9c2.png"> <img width="789" alt="screenshot 2016-05-20 08 53 26" src="https://cloud.githubusercontent.com/assets/11567269/15433734/645dc42e-1e68-11e6-9476-effc9f8721bb.png"> <img width="789" alt="screenshot 2016-05-20 08 53 37" src="https://cloud.githubusercontent.com/assets/11567269/15433738/68569f92-1e68-11e6-83d3-ef5bb221a8d8.png"> No change is made in the R's example. <img width="860" alt="screenshot 2016-05-20 08 54 38" src="https://cloud.githubusercontent.com/assets/11567269/15433779/965b8312-1e68-11e6-8bc4-53c88ceacde2.png"> #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #13225 from gatorsmile/document.
* [SPARK-15171][SQL] Remove the references to deprecated method ↵Sean Zhong2016-05-181-24/+24
| | | | | | | | | | | | | | | | | dataset.registerTempTable ## What changes were proposed in this pull request? Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`. ## How was this patch tested? This PR only changes the unit test code, examples, and comments. It should be safe. This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged. Author: Sean Zhong <seanzhong@databricks.com> Closes #13098 from clockfly/spark-15171-remove-deprecation.
* [SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR.Sun Rui2016-04-291-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame. The function signature is: dapply(df, function(localDF) {}, schema = NULL) R function input: local data.frame from the partition on local node R function output: local data.frame Schema specifies the Row format of the resulting DataFrame. It must match the R function's output. If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply(). ## How was this patch tested? SparkR unit tests. Author: Sun Rui <rui.sun@intel.com> Author: Sun Rui <sunrui2016@gmail.com> Closes #12493 from sun-rui/SPARK-12919.
* [SPARK-14883][DOCS] Fix wrong R examples and make them up-to-dateDongjoon Hyun2016-04-241-17/+13
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules. - Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later. - Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency - Fix datatypes in `sparkr.md`. - Update a data result in `sparkr.md`. - Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`. - Other minor syntax fixes and a typo. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12649 from dongjoon-hyun/SPARK-14883.