diff options
author | Dongjoon Hyun <dongjoon@apache.org> | 2016-03-16 15:50:24 -0700 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2016-03-16 15:50:24 -0700 |
commit | 4ce2d24e2a03966fde9a5be2d11395200f5dc4f6 (patch) | |
tree | 87cca48ad942f2d0408b151eb7fe62c3106fa56e /docs/sql-programming-guide.md | |
parent | 27e1f38851a8f28a28544b2021b3c5641d0ff3ab (diff) | |
download | spark-4ce2d24e2a03966fde9a5be2d11395200f5dc4f6.tar.gz spark-4ce2d24e2a03966fde9a5be2d11395200f5dc4f6.tar.bz2 spark-4ce2d24e2a03966fde9a5be2d11395200f5dc4f6.zip |
[SPARK-13942][CORE][DOCS] Remove Shark-related docs for 2.x
## What changes were proposed in this pull request?
`Shark` was merged into `Spark SQL` since [July 2014](https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html). The followings seem to be the only legacy. For Spark 2.x, we had better clean up those docs.
**Migration Guide**
```
- ## Migration Guide for Shark Users
- ...
- ### Scheduling
- ...
- ### Reducer number
- ...
- ### Caching
```
## How was this patch tested?
Pass the Jenkins test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #11770 from dongjoon-hyun/SPARK-13942.
Diffstat (limited to 'docs/sql-programming-guide.md')
-rw-r--r-- | docs/sql-programming-guide.md | 45 |
1 files changed, 0 insertions, 45 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 89fe873851..3138fd5fb4 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -2356,51 +2356,6 @@ Python UDF registration is unchanged. When using DataTypes in Python you will need to construct them (i.e. `StringType()`) instead of referencing a singleton. -## Migration Guide for Shark Users - -### Scheduling -To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a JDBC client session, -users can set the `spark.sql.thriftserver.scheduler.pool` variable: - - SET spark.sql.thriftserver.scheduler.pool=accounting; - -### Reducer number - -In Shark, default reducer number is 1 and is controlled by the property `mapred.reduce.tasks`. Spark -SQL deprecates this property in favor of `spark.sql.shuffle.partitions`, whose default value -is 200. Users may customize this property via `SET`: - - SET spark.sql.shuffle.partitions=10; - SELECT page, count(*) c - FROM logs_last_month_cached - GROUP BY page ORDER BY c DESC LIMIT 10; - -You may also put this property in `hive-site.xml` to override the default value. - -For now, the `mapred.reduce.tasks` property is still recognized, and is converted to -`spark.sql.shuffle.partitions` automatically. - -### Caching - -The `shark.cache` table property no longer exists, and tables whose name end with `_cached` are no -longer automatically cached. Instead, we provide `CACHE TABLE` and `UNCACHE TABLE` statements to -let user control table caching explicitly: - - CACHE TABLE logs_last_month; - UNCACHE TABLE logs_last_month; - -**NOTE:** `CACHE TABLE tbl` is now __eager__ by default not __lazy__. Don't need to trigger cache materialization manually anymore. - -Spark SQL newly introduced a statement to let user control table caching whether or not lazy since Spark 1.2.0: - - CACHE [LAZY] TABLE [AS SELECT] ... - -Several caching related features are not supported yet: - -* User defined partition level cache eviction policy -* RDD reloading -* In-memory cache write through policy - ## Compatibility with Apache Hive Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. |