diff options
author | Sameer Agarwal <sameer@databricks.com> | 2016-06-10 20:43:18 -0700 |
---|---|---|
committer | Davies Liu <davies.liu@gmail.com> | 2016-06-10 20:43:18 -0700 |
commit | 468da03e23a01e02718608f05d778386cbb8416b (patch) | |
tree | 716d829d8a12df9f401c5b9e61902b52de6c4d49 /sql/catalyst | |
parent | 8e7b56f3d4917692d3ff44d91aa264738a6fc2ed (diff) | |
download | spark-468da03e23a01e02718608f05d778386cbb8416b.tar.gz spark-468da03e23a01e02718608f05d778386cbb8416b.tar.bz2 spark-468da03e23a01e02718608f05d778386cbb8416b.zip |
[SPARK-15678] Add support to REFRESH data source paths
## What changes were proposed in this pull request?
Spark currently incorrectly continues to use cached data even if the underlying data is overwritten.
Current behavior:
```scala
val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset
```
This patch fixes this bug by adding support for `REFRESH path` that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path.
Expected behavior:
```scala
val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
spark.catalog.refreshResource(dir)
sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset
```
## How was this patch tested?
Unit tests for overwrites and appends in `ParquetQuerySuite` and `CachedTableSuite`.
Author: Sameer Agarwal <sameer@databricks.com>
Closes #13566 from sameeragarwal/refresh-path-2.
Diffstat (limited to 'sql/catalyst')
-rw-r--r-- | sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 index d10255946a..044f910388 100644 --- a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 +++ b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 @@ -113,6 +113,7 @@ statement | (DESC | DESCRIBE) option=(EXTENDED | FORMATTED)? tableIdentifier partitionSpec? describeColName? #describeTable | REFRESH TABLE tableIdentifier #refreshTable + | REFRESH .*? #refreshResource | CACHE LAZY? TABLE identifier (AS? query)? #cacheTable | UNCACHE TABLE identifier #uncacheTable | CLEAR CACHE #clearCache |