From 4ce2d24e2a03966fde9a5be2d11395200f5dc4f6 Mon Sep 17 00:00:00 2001 From: Dongjoon Hyun Date: Wed, 16 Mar 2016 15:50:24 -0700 Subject: [SPARK-13942][CORE][DOCS] Remove Shark-related docs for 2.x ## What changes were proposed in this pull request? `Shark` was merged into `Spark SQL` since [July 2014](https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html). The followings seem to be the only legacy. For Spark 2.x, we had better clean up those docs. **Migration Guide** ``` - ## Migration Guide for Shark Users - ... - ### Scheduling - ... - ### Reducer number - ... - ### Caching ``` ## How was this patch tested? Pass the Jenkins test. Author: Dongjoon Hyun Closes #11770 from dongjoon-hyun/SPARK-13942. --- docs/sql-programming-guide.md | 45 ------------------------------------------- 1 file changed, 45 deletions(-) (limited to 'docs') diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 89fe873851..3138fd5fb4 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -2356,51 +2356,6 @@ Python UDF registration is unchanged. When using DataTypes in Python you will need to construct them (i.e. `StringType()`) instead of referencing a singleton. -## Migration Guide for Shark Users - -### Scheduling -To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a JDBC client session, -users can set the `spark.sql.thriftserver.scheduler.pool` variable: - - SET spark.sql.thriftserver.scheduler.pool=accounting; - -### Reducer number - -In Shark, default reducer number is 1 and is controlled by the property `mapred.reduce.tasks`. Spark -SQL deprecates this property in favor of `spark.sql.shuffle.partitions`, whose default value -is 200. Users may customize this property via `SET`: - - SET spark.sql.shuffle.partitions=10; - SELECT page, count(*) c - FROM logs_last_month_cached - GROUP BY page ORDER BY c DESC LIMIT 10; - -You may also put this property in `hive-site.xml` to override the default value. - -For now, the `mapred.reduce.tasks` property is still recognized, and is converted to -`spark.sql.shuffle.partitions` automatically. - -### Caching - -The `shark.cache` table property no longer exists, and tables whose name end with `_cached` are no -longer automatically cached. Instead, we provide `CACHE TABLE` and `UNCACHE TABLE` statements to -let user control table caching explicitly: - - CACHE TABLE logs_last_month; - UNCACHE TABLE logs_last_month; - -**NOTE:** `CACHE TABLE tbl` is now __eager__ by default not __lazy__. Don't need to trigger cache materialization manually anymore. - -Spark SQL newly introduced a statement to let user control table caching whether or not lazy since Spark 1.2.0: - - CACHE [LAZY] TABLE [AS SELECT] ... - -Several caching related features are not supported yet: - -* User defined partition level cache eviction policy -* RDD reloading -* In-memory cache write through policy - ## Compatibility with Apache Hive Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. -- cgit v1.2.3