[SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API

``` pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None) :: Experimental :: If `observed` is Vector, conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution, or againt the uniform distribution (by default), with each category having an expected frequency of `1 / len(observed)`. (Note: `observed` cannot contain negative values) If `observed` is matrix, conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0. If `observed` is an RDD of LabeledPoint, conduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical. :param observed: it could be a vector containing the observed categorical counts/relative frequencies, or the contingency matrix (containing either counts or relative frequencies), or an RDD of LabeledPoint containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. :param expected: Vector containing the expected categorical counts/relative frequencies. `expected` is rescaled if the `expected` sum differs from the `observed` sum. :return: ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis. ``` Author: Davies Liu <davies@databricks.com> Closes #3091 from davies/his and squashes the following commits: 145d16c [Davies Liu] address comments 0ab0764 [Davies Liu] fix float 5097d54 [Davies Liu] add Hypothesis test Python API
author: Davies Liu <davies@databricks.com> 2014-11-04 21:35:52 -0800
committer: Xiangrui Meng <meng@databricks.com> 2014-11-04 21:35:52 -0800
commit: c8abddc5164d8cf11cdede6ab3d5d1ea08028708 (patch)
tree: 2ba4fc42b9c1b9cc6ca8fbd648d4cc30e9a484c8 /python/pyspark/mllib/linalg.py
parent: 515abb9afa2d6b58947af6bb079a493b49d315ca (diff)
download: spark-c8abddc5164d8cf11cdede6ab3d5d1ea08028708.tar.gz
spark-c8abddc5164d8cf11cdede6ab3d5d1ea08028708.tar.bz2
spark-c8abddc5164d8cf11cdede6ab3d5d1ea08028708.zip
1 files changed, 12 insertions, 1 deletions
diff --git a/python/pyspark/mllib/linalg.py b/python/pyspark/mllib/linalg.py
index c0c3dff31e..e35202dca0 100644
--- a/python/pyspark/mllib/linalg.py
+++ b/python/pyspark/mllib/linalg.py
@@ -33,7 +33,7 @@ from pyspark.sql import UserDefinedType, StructField, StructType, ArrayType, Dou
     IntegerType, ByteType, Row
 
 
-__all__ = ['Vector', 'DenseVector', 'SparseVector', 'Vectors']
+__all__ = ['Vector', 'DenseVector', 'SparseVector', 'Vectors', 'DenseMatrix', 'Matrices']
 
 
 if sys.version_info[:2] == (2, 7):
@@ -578,6 +578,8 @@ class DenseMatrix(Matrix):
     def __init__(self, numRows, numCols, values):
         Matrix.__init__(self, numRows, numCols)
         assert len(values) == numRows * numCols
+        if not isinstance(values, array.array):
+            values = array.array('d', values)
         self.values = values
 
     def __reduce__(self):
@@ -596,6 +598,15 @@ class DenseMatrix(Matrix):
         return np.reshape(self.values, (self.numRows, self.numCols), order='F')
 
 
+class Matrices(object):
+    @staticmethod
+    def dense(numRows, numCols, values):
+        """
+        Create a DenseMatrix
+        """
+        return DenseMatrix(numRows, numCols, values)
+
+
 def _test():
     import doctest
     (failure_count, test_count) = doctest.testmod(optionflags=doctest.ELLIPSIS)
author	Davies Liu <davies@databricks.com>	2014-11-04 21:35:52 -0800
committer	Xiangrui Meng <meng@databricks.com>	2014-11-04 21:35:52 -0800
commit	c8abddc5164d8cf11cdede6ab3d5d1ea08028708 (patch)
tree	2ba4fc42b9c1b9cc6ca8fbd648d4cc30e9a484c8 /python/pyspark/mllib/linalg.py
parent	515abb9afa2d6b58947af6bb079a493b49d315ca (diff)
download	spark-c8abddc5164d8cf11cdede6ab3d5d1ea08028708.tar.gz spark-c8abddc5164d8cf11cdede6ab3d5d1ea08028708.tar.bz2 spark-c8abddc5164d8cf11cdede6ab3d5d1ea08028708.zip