[SPARK-18522][SQL] Explicit contract for column stats serialization - spark

diff options

author	Reynold Xin <rxin@databricks.com>	2016-11-23 20:48:41 +0800
committer	Wenchen Fan <wenchen@databricks.com>	2016-11-23 20:48:41 +0800
commit	70ad07a9d20586ae182c4e60ed97bdddbcbceff3 (patch)
tree	14666ca06583b5ee8fc6ee09b0434aa824c2efde /.github
parent	9785ed40d7fe4e1fcd440e55706519c6e5f8d6b1 (diff)
download	spark-70ad07a9d20586ae182c4e60ed97bdddbcbceff3.tar.gz spark-70ad07a9d20586ae182c4e60ed97bdddbcbceff3.tar.bz2 spark-70ad07a9d20586ae182c4e60ed97bdddbcbceff3.zip

[SPARK-18522][SQL] Explicit contract for column stats serialization

## What changes were proposed in this pull request? The current implementation of column stats uses the base64 encoding of the internal UnsafeRow format to persist statistics (in table properties in Hive metastore). This is an internal format that is not stable across different versions of Spark and should NOT be used for persistence. In addition, it would be better if statistics stored in the catalog is human readable. This pull request introduces the following changes: 1. Created a single ColumnStat class to for all data types. All data types track the same set of statistics. 2. Updated the implementation for stats collection to get rid of the dependency on internal data structures (e.g. InternalRow, or storing DateType as an int32). For example, previously dates were stored as a single integer, but are now stored as java.sql.Date. When we implement the next steps of CBO, we can add code to convert those back into internal types again. 3. Documented clearly what JVM data types are being used to store what data. 4. Defined a simple Map[String, String] interface for serializing and deserializing column stats into/from the catalog. 5. Rearranged the method/function structure so it is more clear what the supported data types are, and also moved how stats are generated into ColumnStat class so they are easy to find. ## How was this patch tested? Removed most of the original test cases created for column statistics, and added three very simple ones to cover all the cases. The three test cases validate: 1. Roundtrip serialization works. 2. Behavior when analyzing non-existent column or unsupported data type column. 3. Result for stats collection for all valid data types. Also moved parser related tests into a parser test suite and added an explicit serialization test for the Hive external catalog. Author: Reynold Xin <rxin@databricks.com> Closes #15959 from rxin/SPARK-18522.

Diffstat (limited to '.github')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: