[SPARK-17073][SQL] generate column-level statistics - spark

diff options

author	Zhenhua Wang <wzh_zju@163.com>	2016-10-03 10:12:02 -0700
committer	Reynold Xin <rxin@databricks.com>	2016-10-03 10:12:02 -0700
commit	7bf92127643570e4eb3610fa3ffd36839eba2718 (patch)
tree	14386f49f956e97b50a8d6b2bbf0f776eab4dd39 /docs/mllib-linear-methods.md
parent	a27033c0bbaae8f31db9b91693947ed71738ed11 (diff)
download	spark-7bf92127643570e4eb3610fa3ffd36839eba2718.tar.gz spark-7bf92127643570e4eb3610fa3ffd36839eba2718.tar.bz2 spark-7bf92127643570e4eb3610fa3ffd36839eba2718.zip

[SPARK-17073][SQL] generate column-level statistics

## What changes were proposed in this pull request? Generate basic column statistics for all the atomic types: - numeric types: max, min, num of nulls, ndv (number of distinct values) - date/timestamp types: they are also represented as numbers internally, so they have the same stats as above. - string: avg length, max length, num of nulls, ndv - binary: avg length, max length, num of nulls - boolean: num of nulls, num of trues, num of falsies Also support storing and loading these statistics. One thing to notice: We support analyzing columns independently, e.g.: sql1: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key;` sql2: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS value;` when running sql2 to collect column stats for `value`, we don’t remove stats of columns `key` which are analyzed in sql1 and not in sql2. As a result, **users need to guarantee consistency** between sql1 and sql2. If the table has been changed before sql2, users should re-analyze column `key` when they want to analyze column `value`: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key, value;` ## How was this patch tested? add unit tests Author: Zhenhua Wang <wzh_zju@163.com> Closes #15090 from wzhfy/colStats.

Diffstat (limited to 'docs/mllib-linear-methods.md')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: