diff options
author | Sean Owen <sowen@cloudera.com> | 2016-10-08 11:31:12 +0100 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2016-10-08 11:31:12 +0100 |
commit | 4201ddcc07ca2e9af78bf4a74fdb3900c1783347 (patch) | |
tree | ae50667b9ae7e8e8b57ccf431ad08181c40baaac /sql/hive | |
parent | 362ba4b6f8e8fc2355368742c5adced7573fec00 (diff) | |
download | spark-4201ddcc07ca2e9af78bf4a74fdb3900c1783347.tar.gz spark-4201ddcc07ca2e9af78bf4a74fdb3900c1783347.tar.bz2 spark-4201ddcc07ca2e9af78bf4a74fdb3900c1783347.zip |
[SPARK-17768][CORE] Small (Sum,Count,Mean)Evaluator problems and suboptimalities
## What changes were proposed in this pull request?
Fix:
- GroupedMeanEvaluator and GroupedSumEvaluator are unused, as is the StudentTCacher support class
- CountEvaluator can return a lower bound < 0, when counts can't be negative
- MeanEvaluator will actually fail on exactly 1 datum (yields t-test with 0 DOF)
- CountEvaluator uses a normal distribution, which may be an inappropriate approximation (leading to above)
- Test for SumEvaluator asserts incorrect expected sums – e.g. after observing 10% of data has sum of 2, expectation should be 20, not 38
- CountEvaluator, MeanEvaluator have no unit tests to catch these
- Duplication of distribution code across CountEvaluator, GroupedCountEvaluator
- The stats in each could use a bit of documentation as I had to guess at them
- (Code could use a few cleanups and optimizations too)
## How was this patch tested?
Existing and new tests
Author: Sean Owen <sowen@cloudera.com>
Closes #15341 from srowen/SPARK-17768.
Diffstat (limited to 'sql/hive')
0 files changed, 0 insertions, 0 deletions