[SPARK-5089][PYSPARK][MLLIB] Fix vector convert

This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to `DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans. The PR includes the fix, as well as a new test for the correct conversion behavior. davies Author: freeman <the.freeman.lab@gmail.com> Closes #3902 from freeman-lab/fix-vector-convert and squashes the following commits: 764db47 [freeman] Add a test for proper conversion behavior 704f97e [freeman] Return array after changing type
author: freeman <the.freeman.lab@gmail.com> 2015-01-05 13:10:59 -0800
committer: Xiangrui Meng <meng@databricks.com> 2015-01-05 13:10:59 -0800
commit: 6c6f32574023b8e43a24f2081ff17e6e446de2f3 (patch)
tree: 01940cc05e61712eb4e3e383f6a4ae12c8209c28 /python/pyspark/mllib/linalg.py
parent: 1c0e7ce056c79e1db96f85b8c56a479b8b043970 (diff)
download: spark-6c6f32574023b8e43a24f2081ff17e6e446de2f3.tar.gz
spark-6c6f32574023b8e43a24f2081ff17e6e446de2f3.tar.bz2
spark-6c6f32574023b8e43a24f2081ff17e6e446de2f3.zip
1 files changed, 1 insertions, 1 deletions
diff --git a/python/pyspark/mllib/linalg.py b/python/pyspark/mllib/linalg.py
index f7aa2b0cb0..4f8491f43e 100644
--- a/python/pyspark/mllib/linalg.py
+++ b/python/pyspark/mllib/linalg.py
@@ -178,7 +178,7 @@ class DenseVector(Vector):
         elif not isinstance(ar, np.ndarray):
             ar = np.array(ar, dtype=np.float64)
         if ar.dtype != np.float64:
-            ar.astype(np.float64)
+            ar = ar.astype(np.float64)
         self.array = ar
 
     def __reduce__(self):
author	freeman <the.freeman.lab@gmail.com>	2015-01-05 13:10:59 -0800
committer	Xiangrui Meng <meng@databricks.com>	2015-01-05 13:10:59 -0800
commit	6c6f32574023b8e43a24f2081ff17e6e446de2f3 (patch)
tree	01940cc05e61712eb4e3e383f6a4ae12c8209c28 /python/pyspark/mllib/linalg.py
parent	1c0e7ce056c79e1db96f85b8c56a479b8b043970 (diff)
download	spark-6c6f32574023b8e43a24f2081ff17e6e446de2f3.tar.gz spark-6c6f32574023b8e43a24f2081ff17e6e446de2f3.tar.bz2 spark-6c6f32574023b8e43a24f2081ff17e6e446de2f3.zip