aboutsummaryrefslogtreecommitdiff
path: root/assembly
diff options
context:
space:
mode:
authorMichael Armbrust <michael@databricks.com>2014-10-24 10:52:25 -0700
committerMichael Armbrust <michael@databricks.com>2014-10-24 10:52:25 -0700
commit0e886610eedd8ea24761cdcefa25ccedeca72dc8 (patch)
tree94560271700541448c803e6c26ffc134b4b6ced7 /assembly
parentd60a9d440b00beb107c1f1d7f42886c94f04a092 (diff)
downloadspark-0e886610eedd8ea24761cdcefa25ccedeca72dc8.tar.gz
spark-0e886610eedd8ea24761cdcefa25ccedeca72dc8.tar.bz2
spark-0e886610eedd8ea24761cdcefa25ccedeca72dc8.zip
[SPARK-4050][SQL] Fix caching of temporary tables with projections.
Previously cached data was found by `sameResult` plan matching on optimized plans. This technique however fails to locate the cached data when a temporary table with a projection is queried with a further reduced projection. The failure is due to the fact that optimization will collapse the projections, producing a plan that no longer produces the sameResult as the cached data (though the cached data still subsumes the desired data). For example consider the following previously failing test case. ```scala sql("CACHE TABLE tempTable AS SELECT key FROM testData") assertCached(sql("SELECT COUNT(*) FROM tempTable")) ``` In this PR I change the matching to occur after analysis instead of optimization, so that in the case of temporary tables, the plans will always match. I think this should work generally, however, this error does raise questions about the need to do more thorough subsumption checking when locating cached data. Another question is what sort of semantics we want to provide when uncaching data from temporary tables. For example consider the following sequence of commands: ```scala testData.select('key).registerTempTable("tempTable1") testData.select('key).registerTempTable("tempTable2") cacheTable("tempTable1") // This obviously works. assertCached(sql("SELECT COUNT(*) FROM tempTable1")) // It seems good that this works ... assertCached(sql("SELECT COUNT(*) FROM tempTable2")) // ... but is this valid? uncacheTable("tempTable2") // Should this still be cached? assertCached(sql("SELECT COUNT(*) FROM tempTable1"), 0) ``` Author: Michael Armbrust <michael@databricks.com> Closes #2912 from marmbrus/cachingBug and squashes the following commits: 9c822d4 [Michael Armbrust] remove commented out code 5c72fb7 [Michael Armbrust] Add a test case / question about uncaching semantics. 63a23e4 [Michael Armbrust] Perform caching on analyzed instead of optimized plan. 03f1cfe [Michael Armbrust] Clean-up / add tests to SameResult suite.
Diffstat (limited to 'assembly')
0 files changed, 0 insertions, 0 deletions