[SPARK-3212][SQL] Use logical plan matching instead of temporary tables for table caching - spark

diff options

author	Michael Armbrust <michael@databricks.com>	2014-10-03 12:34:27 -0700
committer	Michael Armbrust <michael@databricks.com>	2014-10-03 12:34:27 -0700
commit	6a1d48f4f02c4498b64439c3dd5f671286a90e30 (patch)
tree	0b22a278418d9f1d8a6decf3f15aafab0de3dd84 /docs/configuration.md
parent	bec0d0eaa33811fde72b84f7d53a6f6031e7b5d3 (diff)
download	spark-6a1d48f4f02c4498b64439c3dd5f671286a90e30.tar.gz spark-6a1d48f4f02c4498b64439c3dd5f671286a90e30.tar.bz2 spark-6a1d48f4f02c4498b64439c3dd5f671286a90e30.zip

[SPARK-3212][SQL] Use logical plan matching instead of temporary tables for table caching

_Also addresses: SPARK-1671, SPARK-1379 and SPARK-3641_ This PR introduces a new trait, `CacheManger`, which replaces the previous temporary table based caching system. Instead of creating a temporary table that shadows an existing table with and equivalent cached representation, the cached manager maintains a separate list of logical plans and their cached data. After optimization, this list is searched for any matching plan fragments. When a matching plan fragment is found it is replaced with the cached data. There are several advantages to this approach: - Calling .cache() on a SchemaRDD now works as you would expect, and uses the more efficient columnar representation. - Its now possible to provide a list of temporary tables, without having to decide if a given table is actually just a cached persistent table. (To be done in a follow-up PR) - In some cases it is possible that cached data will be used, even if a cached table was not explicitly requested. This is because we now look at the logical structure instead of the table name. - We now correctly invalidate when data is inserted into a hive table. Author: Michael Armbrust <michael@databricks.com> Closes #2501 from marmbrus/caching and squashes the following commits: 63fbc2c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into caching. 0ea889e [Michael Armbrust] Address comments. 1e23287 [Michael Armbrust] Add support for cache invalidation for hive inserts. 65ed04a [Michael Armbrust] fix tests. bdf9a3f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into caching b4b77f2 [Michael Armbrust] Address comments 6923c9d [Michael Armbrust] More comments / tests 80f26ac [Michael Armbrust] First draft of improved semantics for Spark SQL caching.

Diffstat (limited to 'docs/configuration.md')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: