aboutsummaryrefslogtreecommitdiff
path: root/.github
diff options
context:
space:
mode:
authorReynold Xin <rxin@databricks.com>2016-11-23 20:48:41 +0800
committerWenchen Fan <wenchen@databricks.com>2016-11-23 20:48:41 +0800
commit70ad07a9d20586ae182c4e60ed97bdddbcbceff3 (patch)
tree14666ca06583b5ee8fc6ee09b0434aa824c2efde /.github
parent9785ed40d7fe4e1fcd440e55706519c6e5f8d6b1 (diff)
downloadspark-70ad07a9d20586ae182c4e60ed97bdddbcbceff3.tar.gz
spark-70ad07a9d20586ae182c4e60ed97bdddbcbceff3.tar.bz2
spark-70ad07a9d20586ae182c4e60ed97bdddbcbceff3.zip
[SPARK-18522][SQL] Explicit contract for column stats serialization
## What changes were proposed in this pull request? The current implementation of column stats uses the base64 encoding of the internal UnsafeRow format to persist statistics (in table properties in Hive metastore). This is an internal format that is not stable across different versions of Spark and should NOT be used for persistence. In addition, it would be better if statistics stored in the catalog is human readable. This pull request introduces the following changes: 1. Created a single ColumnStat class to for all data types. All data types track the same set of statistics. 2. Updated the implementation for stats collection to get rid of the dependency on internal data structures (e.g. InternalRow, or storing DateType as an int32). For example, previously dates were stored as a single integer, but are now stored as java.sql.Date. When we implement the next steps of CBO, we can add code to convert those back into internal types again. 3. Documented clearly what JVM data types are being used to store what data. 4. Defined a simple Map[String, String] interface for serializing and deserializing column stats into/from the catalog. 5. Rearranged the method/function structure so it is more clear what the supported data types are, and also moved how stats are generated into ColumnStat class so they are easy to find. ## How was this patch tested? Removed most of the original test cases created for column statistics, and added three very simple ones to cover all the cases. The three test cases validate: 1. Roundtrip serialization works. 2. Behavior when analyzing non-existent column or unsupported data type column. 3. Result for stats collection for all valid data types. Also moved parser related tests into a parser test suite and added an explicit serialization test for the Hive external catalog. Author: Reynold Xin <rxin@databricks.com> Closes #15959 from rxin/SPARK-18522.
Diffstat (limited to '.github')
0 files changed, 0 insertions, 0 deletions