diff options
author | Zhenhua Wang <wzh_zju@163.com> | 2016-10-03 10:12:02 -0700 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2016-10-03 10:12:02 -0700 |
commit | 7bf92127643570e4eb3610fa3ffd36839eba2718 (patch) | |
tree | 14386f49f956e97b50a8d6b2bbf0f776eab4dd39 /docs/mllib-linear-methods.md | |
parent | a27033c0bbaae8f31db9b91693947ed71738ed11 (diff) | |
download | spark-7bf92127643570e4eb3610fa3ffd36839eba2718.tar.gz spark-7bf92127643570e4eb3610fa3ffd36839eba2718.tar.bz2 spark-7bf92127643570e4eb3610fa3ffd36839eba2718.zip |
[SPARK-17073][SQL] generate column-level statistics
## What changes were proposed in this pull request?
Generate basic column statistics for all the atomic types:
- numeric types: max, min, num of nulls, ndv (number of distinct values)
- date/timestamp types: they are also represented as numbers internally, so they have the same stats as above.
- string: avg length, max length, num of nulls, ndv
- binary: avg length, max length, num of nulls
- boolean: num of nulls, num of trues, num of falsies
Also support storing and loading these statistics.
One thing to notice:
We support analyzing columns independently, e.g.:
sql1: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key;`
sql2: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS value;`
when running sql2 to collect column stats for `value`, we don’t remove stats of columns `key` which are analyzed in sql1 and not in sql2. As a result, **users need to guarantee consistency** between sql1 and sql2. If the table has been changed before sql2, users should re-analyze column `key` when they want to analyze column `value`:
`ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key, value;`
## How was this patch tested?
add unit tests
Author: Zhenhua Wang <wzh_zju@163.com>
Closes #15090 from wzhfy/colStats.
Diffstat (limited to 'docs/mllib-linear-methods.md')
0 files changed, 0 insertions, 0 deletions