aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/ml/linalg
diff options
context:
space:
mode:
authorLiang-Chi Hsieh <viirya@gmail.com>2017-04-05 17:46:44 -0700
committerJoseph K. Bradley <joseph@databricks.com>2017-04-05 17:46:44 -0700
commit12206058e8780e202c208b92774df3773eff36ae (patch)
tree363db4aa846ad9e7a57285fd9ba57d5921bb7039 /python/pyspark/ml/linalg
parent9d68c67235481fa33983afb766916b791ca8212a (diff)
downloadspark-12206058e8780e202c208b92774df3773eff36ae.tar.gz
spark-12206058e8780e202c208b92774df3773eff36ae.tar.bz2
spark-12206058e8780e202c208b92774df3773eff36ae.zip
[SPARK-20214][ML] Make sure converted csc matrix has sorted indices
## What changes were proposed in this pull request? `_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that: from scipy.sparse import lil_matrix lil = lil_matrix((4, 1)) lil[1, 0] = 1 lil[3, 0] = 2 _convert_to_vector(lil.todok()) File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector return SparseVector(l.shape[0], csc.indices, csc.data) File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__ % (self.indices[i], self.indices[i + 1])) TypeError: Indices 3 and 1 are not strictly increasing A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices: >>> from scipy.sparse import lil_matrix >>> lil = lil_matrix((4, 1)) >>> lil[1, 0] = 1 >>> lil[3, 0] = 2 >>> dok = lil.todok() >>> csc = dok.tocsc() >>> csc.has_sorted_indices 0 >>> csc.indices array([3, 1], dtype=int32) I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17532 from viirya/make-sure-sorted-indices.
Diffstat (limited to 'python/pyspark/ml/linalg')
-rw-r--r--python/pyspark/ml/linalg/__init__.py3
1 files changed, 3 insertions, 0 deletions
diff --git a/python/pyspark/ml/linalg/__init__.py b/python/pyspark/ml/linalg/__init__.py
index b765343251..ad1b487676 100644
--- a/python/pyspark/ml/linalg/__init__.py
+++ b/python/pyspark/ml/linalg/__init__.py
@@ -72,7 +72,10 @@ def _convert_to_vector(l):
return DenseVector(l)
elif _have_scipy and scipy.sparse.issparse(l):
assert l.shape[1] == 1, "Expected column vector"
+ # Make sure the converted csc_matrix has sorted indices.
csc = l.tocsc()
+ if not csc.has_sorted_indices:
+ csc.sort_indices()
return SparseVector(l.shape[0], csc.indices, csc.data)
else:
raise TypeError("Cannot convert type %s into Vector" % type(l))