aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/ml/feature.py
diff options
context:
space:
mode:
authorPeng <peng.meng@intel.com>2016-10-14 12:48:57 +0100
committerSean Owen <sowen@cloudera.com>2016-10-14 12:48:57 +0100
commitc8b612decba28e51789891f7881b6d4ebc50e2bb (patch)
tree33a908908c1647bc1636d6c372cf381510be902e /python/pyspark/ml/feature.py
parenta1b136d05c6c458ae8211b0844bfc98d7693fa42 (diff)
downloadspark-c8b612decba28e51789891f7881b6d4ebc50e2bb.tar.gz
spark-c8b612decba28e51789891f7881b6d4ebc50e2bb.tar.bz2
spark-c8b612decba28e51789891f7881b6d4ebc50e2bb.zip
[SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference
## What changes were proposed in this pull request? For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features. So we change statistic to pValue for SelectKBest and SelectPercentile ## How was this patch tested? change existing test Author: Peng <peng.meng@intel.com> Closes #15444 from mpjlu/chisqure-bug.
Diffstat (limited to 'python/pyspark/ml/feature.py')
-rwxr-xr-xpython/pyspark/ml/feature.py4
1 files changed, 2 insertions, 2 deletions
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index a33c3e7945..7683360664 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -2592,9 +2592,9 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
>>> selector = ChiSqSelector(numTopFeatures=1, outputCol="selectedFeatures")
>>> model = selector.fit(df)
>>> model.transform(df).head().selectedFeatures
- DenseVector([1.0])
+ DenseVector([18.0])
>>> model.selectedFeatures
- [3]
+ [2]
>>> chiSqSelectorPath = temp_path + "/chi-sq-selector"
>>> selector.save(chiSqSelectorPath)
>>> loadedSelector = ChiSqSelector.load(chiSqSelectorPath)