spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-9656][MLLIB][PYTHON] Add missing methods to PySpark's Distributed ↵	Mike Dusenberry	2016-04-27	1	-1/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Linear Algebra Classes This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows: * `RowMatrix` <sup>[1]</sup> 1. `computeGramianMatrix` 2. `computeCovariance` 3. `computeColumnSummaryStatistics` 4. `columnSimilarities` 5. `tallSkinnyQR` <sup>[2]</sup> * `IndexedRowMatrix` <sup>[3]</sup> 1. `computeGramianMatrix` * `CoordinateMatrix` 1. `transpose` * `BlockMatrix` 1. `validate` 2. `cache` 3. `persist` 4. `transpose` [1]: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR #7963 for SPARK-6227. [2]: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP and am open to discussion. [3]: Note: `multiply` and `computeSVD` are already part of PR #7963 for SPARK-6227. Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
*	[SPARK-14739][PYSPARK] Fix Vectors parser bugs	Arash Parsa	2016-04-21	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The PySpark deserialization has a bug that shows while deserializing all zero sparse vectors. This fix filters out empty string tokens before casting, hence properly stringified SparseVectors successfully get parsed. ## How was this patch tested? Standard unit-tests similar to other methods. Author: Arash Parsa <arash@ip-192-168-50-106.ec2.internal> Author: Arash Parsa <arashpa@gmail.com> Author: Vishnu Prasad <vishnu667@gmail.com> Author: Vishnu Prasad S <vishnu667@gmail.com> Closes #12516 from arashpa/SPARK-14739.
*	Correct SparseVector.parse documentation	Miles Yucht	2016-02-16	1	-1/+1
\| \| \| \| \| \| \| \|	There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect. Author: Miles Yucht <miles@databricks.com> Closes #11213 from mgyucht/fix-sparsevector-docs.
*	[SPARK-11531][ML] SparseVector error Msg	Joshi	2016-01-06	1	-1/+3
\| \| \| \| \| \| \| \| \|	PySpark SparseVector should have "Found duplicate indices" error message Author: Joshi <rekhajoshm@gmail.com> Author: Rekha Joshi <rekhajoshm@gmail.com> Closes #9525 from rekhajoshm/SPARK-11531.
*	[SPARK-11084] [ML] [PYTHON] Check if index can contain non-zero value before ↵	zero323	2015-10-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	binary search At this moment `SparseVector.__getitem__` executes `np.searchsorted` first and checks if result is in an expected range after that. It is possible to check if index can contain non-zero value before executing `np.searchsorted`. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9098 from zero323/sparse_vector_getitem_improved.
*	[SPARK-11050] [MLLIB] PySpark SparseVector can return wrong index in e…	Bhargav Mangipudi	2015-10-16	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \|	…rror message For negative indices in the SparseVector, we update the index value. If we have an incorrect index at this point, the error message has the incorrect updated index instead of the original one. This change contains the fix for the same. Author: Bhargav Mangipudi <bhargav.mangipudi@gmail.com> Closes #9069 from bhargav/spark-10759.
*	[SPARK-10973] [ML] [PYTHON] __gettitem__ method throws IndexError exception ↵	zero323	2015-10-08	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	when we… __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry from pyspark.mllib.linalg import Vectors sv = Vectors.sparse(5, {1: 3}) sv[0] ## 0.0 sv[1] ## 3.0 sv[2] ## Traceback (most recent call last): ## File "<stdin>", line 1, in <module> ## File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__ ## row_ind = inds[insert_index] ## IndexError: index out of bounds Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9009 from zero323/sparse_vector_index_error.
*	[DOC] [PYSPARK] [MLLIB] Added newlines to docstrings to fix parameter formatting	noelsmith	2015-09-21	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added newlines before `:param ...:` and `:return:` markup. Without these, parameter lists aren't formatted correctly in the API docs. I.e: ![screen shot 2015-09-21 at 21 49 26](https://cloud.githubusercontent.com/assets/11915197/10004686/de3c41d4-60aa-11e5-9c50-a46dcb51243f.png) .. looks like this once newline is added: ![screen shot 2015-09-21 at 21 50 14](https://cloud.githubusercontent.com/assets/11915197/10004706/f86bfb08-60aa-11e5-8524-ae4436713502.png) Author: noelsmith <mail@noelsmith.com> Closes #8851 from noel-smith/docstring-missing-newline-fix.
*	[SPARK-10631] [DOCUMENTATION, MLLIB, PYSPARK] Added documentation for few APIs	vinodkc	2015-09-20	1	-5/+17
\| \| \| \| \| \| \| \|	There are some missing API docs in pyspark.mllib.linalg.Vector (including DenseVector and SparseVector). We should add them based on their Scala counterparts. Author: vinodkc <vinod.kc.in@gmail.com> Closes #8834 from vinodkc/fix_SPARK-10631.
*	[SPARK-10516] [ MLLIB] Added values property in DenseVector	Vinod K C	2015-09-15	1	-0/+4
\| \| \| \| \| \|	Author: Vinod K C <vinod.kc@huawei.com> Closes #8682 from vinodkc/fix_SPARK-10516.
*	[SPARK-9793] [MLLIB] [PYSPARK] PySpark DenseVector, SparseVector implement ↵	Yanbo Liang	2015-09-14	1	-15/+75
\| \| \| \| \| \| \| \| \| \| \|	__eq__ and __hash__ correctly PySpark DenseVector, SparseVector ```__eq__``` method should use semantics equality, and DenseVector can compared with SparseVector. Implement PySpark DenseVector, SparseVector ```__hash__``` method based on the first 16 entries. That will make PySpark Vector objects can be used in collections. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8166 from yanboliang/spark-9793.
*	[SPARK-9408] [PYSPARK] [MLLIB] Refactor linalg.py to /linalg	Xiangrui Meng	2015-07-30	1	-0/+1162
	This is based on MechCoder 's PR https://github.com/apache/spark/pull/7731. Hopefully it could pass tests. MechCoder I tried to make minimal changes. If this passes Jenkins, we can merge this one first and then try to move `__init__.py` to `local.py` in a separate PR. Closes #7731 Author: Xiangrui Meng <meng@databricks.com> Closes #7746 from mengxr/SPARK-9408 and squashes the following commits: 0e05a3b [Xiangrui Meng] merge master 1135551 [Xiangrui Meng] add a comment for str(...) c48cae0 [Xiangrui Meng] update tests 173a805 [Xiangrui Meng] move linalg.py to linalg/__init__.py