aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/mllib
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-10194] [MLLIB] [PYSPARK] SGD algorithms need convergenceTol parameter ↵Yanbo Liang2015-09-142-16/+33
| | | | | | | | | | in Python [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8457 from yanboliang/spark-10194.
* [SPARK-9805] [MLLIB] [PYTHON] [STREAMING] Added _eventually for ml streaming ↵Joseph K. Bradley2015-08-151-48/+129
| | | | | | | | | | | | pyspark tests Recently, PySpark ML streaming tests have been flaky, most likely because of the batches not being processed in time. Proposal: Replace the use of _ssc_wait (which waits for a fixed amount of time) with a method which waits for a fixed amount of time but can terminate early based on a termination condition method. With this, we can extend the waiting period (to make tests less flaky) but also stop early when possible (making tests faster on average, which I verified locally). CC: mengxr tdas freeman-lab Author: Joseph K. Bradley <joseph@databricks.com> Closes #8087 from jkbradley/streaming-ml-tests.
* [MINOR] [DOC] fix mllib pydoc warningsXiangrui Meng2015-08-132-4/+11
| | | | | | | | Switch to correct Sphinx syntax. MechCoder Author: Xiangrui Meng <meng@databricks.com> Closes #8169 from mengxr/mllib-pydoc-fix.
* [SPARK-6486] [MLLIB] [PYTHON] Add BlockMatrix to PySpark.Mike Dusenberry2015-08-051-6/+322
| | | | | | | | | | | | | | | | mengxr This adds the `BlockMatrix` to PySpark. I have the conversions to `IndexedRowMatrix` and `CoordinateMatrix` ready as well, so once PR #7554 is completed (which relies on PR #7746), this PR can be finished. Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #7761 from dusenberrymw/SPARK-6486_Add_BlockMatrix_to_PySpark and squashes the following commits: 27195c2 [Mike Dusenberry] Adding one more check to _convert_to_matrix_block_tuple, and a few minor documentation changes. ae50883 [Mike Dusenberry] Minor update: BlockMatrix should inherit from DistributedMatrix. b8acc1c [Mike Dusenberry] Moving BlockMatrix to pyspark.mllib.linalg.distributed, updating the logic to match that of the other distributed matrices, adding conversions, and adding documentation. c014002 [Mike Dusenberry] Using properties for better documentation. 3bda6ab [Mike Dusenberry] Adding documentation. 8fb3095 [Mike Dusenberry] Small cleanup. e17af2e [Mike Dusenberry] Adding BlockMatrix to PySpark.
* [SPARK-6485] [MLLIB] [PYTHON] Add ↵Mike Dusenberry2015-08-042-0/+539
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark. This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark. Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object. New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class. This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code. Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity. Associated documentation and unit-tests have also been added. To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now. Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits: bb039cb [Mike Dusenberry] Minor documentation update. b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner. Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that. If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly. This is only for internal usage, and publicly, we still require 'rows' to be an RDD. We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed. The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included. 7f0dcb6 [Mike Dusenberry] Updating module docstring. cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data. 687e345 [Mike Dusenberry] Improving conversion performance. This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side. 3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed. 308f197 [Mike Dusenberry] Using properties for better documentation. 1633f86 [Mike Dusenberry] Minor documentation cleanup. f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix. ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner. 3fd4016 [Mike Dusenberry] Updating docstrings. 27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix. a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly. d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction. 4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry. c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions. 329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring. 0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests. c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted. 4ad6819 [Mike Dusenberry] Documenting the and parameters. 3b854b9 [Mike Dusenberry] Minor updates to documentation. 10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods. 119018d [Mike Dusenberry] Adding static methods to each of the distributed matrix classes to consolidate conversion logic. 4d7af86 [Mike Dusenberry] Adding type checks to the constructors. Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace. 93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request. f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request. 6a3ecb7 [Mike Dusenberry] Updating pattern matching. 08f287b [Mike Dusenberry] Slight reformatting of the documentation. a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 & 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4'). The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output. This is fine since the values are all small, and thus can be easily represented as ints. 4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines. 7e3ca16 [Mike Dusenberry] Fixing long lines. f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices. ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful. dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices. Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests. 0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization. 3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier. The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction. This way, we can call for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object. This is analogous to the behavior of PySpark RDDs and DataFrames. We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on . 4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API. Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix. 23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs. b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix. Updating DistributedMatrices factory methods to accept numRows and numCols with default values. Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters. bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods. d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices. Added a factory method for creating a RowMatrix from an RDD of Vectors. Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method. Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.
* [SPARK-9408] [PYSPARK] [MLLIB] Refactor linalg.py to /linalgXiangrui Meng2015-07-301-0/+0
| | | | | | | | | | | | | | | This is based on MechCoder 's PR https://github.com/apache/spark/pull/7731. Hopefully it could pass tests. MechCoder I tried to make minimal changes. If this passes Jenkins, we can merge this one first and then try to move `__init__.py` to `local.py` in a separate PR. Closes #7731 Author: Xiangrui Meng <meng@databricks.com> Closes #7746 from mengxr/SPARK-9408 and squashes the following commits: 0e05a3b [Xiangrui Meng] merge master 1135551 [Xiangrui Meng] add a comment for str(...) c48cae0 [Xiangrui Meng] update tests 173a805 [Xiangrui Meng] move linalg.py to linalg/__init__.py
* [SPARK-7105] [PYSPARK] [MLLIB] Support model save/load in GMMMechCoder2015-07-282-22/+59
| | | | | | | | | | | | This PR introduces save / load for GMM's in python API. Also I refactored `GaussianMixtureModel` and inherited it from `JavaModelWrapper` with model being `GaussianMixtureModelWrapper`, a wrapper which provides convenience methods to `GaussianMixtureModel` (due to serialization and deserialization issues) and I moved the creation of gaussians to the scala backend. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7617 from MechCoder/python_gmm_save_load and squashes the following commits: 9c305aa [MechCoder] [SPARK-7105] [PySpark] [MLlib] Support model save/load in GMM
* [SPARK-9122] [MLLIB] [PySpark] spark.mllib regression support batch predictYanbo Liang2015-07-231-2/+10
| | | | | | | | | | spark.mllib support batch predict for LinearRegressionModel, RidgeRegressionModel and LassoModel. Author: Yanbo Liang <ybliang8@gmail.com> Closes #7614 from yanboliang/spark-9122 and squashes the following commits: 4e610c0 [Yanbo Liang] spark.mllib regression support batch predict
* [SPARK-9223] [PYSPARK] [MLLIB] Support model save/load in LDAMechCoder2015-07-221-1/+42
| | | | | | | | | | Since save / load has been merged in LDA, it takes no time to write the wrappers in Python as well. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7587 from MechCoder/python_lda_save_load and squashes the following commits: c8e4ea7 [MechCoder] [SPARK-9223] [PySpark] Support model save/load in LDA
* [SPARK-9198] [MLLIB] [PYTHON] Fixed typo in pyspark sparsevector doc testsJoseph K. Bradley2015-07-201-4/+4
| | | | | | | | | | | | | | | | | Several places in the PySpark SparseVector docs have one defined as: ``` SparseVector(4, [2, 4], [1.0, 2.0]) ``` The index 4 goes out of bounds (but this is not checked). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #7541 from jkbradley/sparsevec-doc-typo-fix and squashes the following commits: c806a65 [Joseph K. Bradley] fixed doc test e2dcb23 [Joseph K. Bradley] Fixed typo in pyspark sparsevector doc tests
* [SPARK-8996] [MLLIB] [PYSPARK] Python API for Kolmogorov-Smirnov TestMechCoder2015-07-203-13/+110
| | | | | | | | | | | | | | | Python API for the KS-test Statistics.kolmogorovSmirnovTest(data, distName, *params) I'm not quite sure how to support the callable function since it is not serializable. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7430 from MechCoder/spark-8996 and squashes the following commits: 2dd009d [MechCoder] minor 021d233 [MechCoder] Remove one wrapper and other minor stuff 49d07ab [MechCoder] [SPARK-8996] [MLlib] Python API for Kolmogorov-Smirnov Test
* [SPARK-9138] [MLLIB] fix Vectors.denseDavies Liu2015-07-171-3/+8
| | | | | | | | | | | | Vectors.dense() should accept numbers directly, like the one in Scala. We already use it in doctests, it worked by luck. cc mengxr jkbradley Author: Davies Liu <davies@databricks.com> Closes #7476 from davies/fix_vectors_dense and squashes the following commits: e0fd292 [Davies Liu] fix Vectors.dense
* [SPARK-9005] [MLLIB] Fix RegressionMetrics computation of explainedVarianceFeynman Liang2015-07-151-1/+1
| | | | | | | | | | | | | | | | Fixes implementation of `explainedVariance` and `r2` to be consistent with their definitions as described in [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005). Author: Feynman Liang <fliang@databricks.com> Closes #7361 from feynmanliang/SPARK-9005-RegressionMetrics-bugs and squashes the following commits: f1112fc [Feynman Liang] Add explainedVariance formula 1a3d098 [Feynman Liang] SROwen code review comments 08a0e1b [Feynman Liang] Fix pyspark tests db8605a [Feynman Liang] Style fix bde9761 [Feynman Liang] Fix RegressionMetrics tests, relax assumption predictor is unbiased c235de0 [Feynman Liang] Fix RegressionMetrics tests 4c4e56f [Feynman Liang] Fix RegressionMetrics computation of explainedVariance and r2
* [SPARK-6259] [MLLIB] Python API for LDAYu ISHIKAWA2015-07-141-1/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | I implemented the Python API for LDA. But I didn't implemented a method for `LDAModel.describeTopics()`, beause it's a little hard to implement it now. And adding document about that and an example code would fit for another issue. TODO: LDAModel.describeTopics() in Python must be also implemented. But it would be nice to fit for another issue. Implementing it is a little hard, since the return value of `describeTopics` in Scala consists of Tuple classes. Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6791 from yu-iskw/SPARK-6259 and squashes the following commits: 6855f59 [Yu ISHIKAWA] LDA inherits object 28bd165 [Yu ISHIKAWA] Change the place of testing code d7a332a [Yu ISHIKAWA] Remove the doc comment about the optimizer's default value 083e226 [Yu ISHIKAWA] Add the comment about the supported values and the default value of `optimizer` 9f8bed8 [Yu ISHIKAWA] Simplify casting faa9764 [Yu ISHIKAWA] Add some comments for the LDA paramters 98f645a [Yu ISHIKAWA] Remove the interface for `describeTopics`. Because it is not implemented. 57ac03d [Yu ISHIKAWA] Remove the unnecessary import in Python unit testing 73412c3 [Yu ISHIKAWA] Fix the typo 2278829 [Yu ISHIKAWA] Fix the indentation 39514ec [Yu ISHIKAWA] Modify how to cast the input data 8117e18 [Yu ISHIKAWA] Fix the validation problems by `lint-scala` 77fd1b7 [Yu ISHIKAWA] Not use LabeledPoint 68f0653 [Yu ISHIKAWA] Support some parameters for `ALS.train()` in Python 25ef2ac [Yu ISHIKAWA] Resolve conflicts with rebasing
* [SPARK-6266] [MLLIB] PySpark SparseVector missing doc for size, indices, valueslewuathe2015-07-091-2/+7
| | | | | | | | | | | | Write missing pydocs in `SparseVector` attributes. Author: lewuathe <lewuathe@me.com> Closes #7290 from Lewuathe/SPARK-6266 and squashes the following commits: 51d9895 [lewuathe] Update docs 0480d35 [lewuathe] Merge branch 'master' into SPARK-6266 ba42cf3 [lewuathe] [SPARK-6266] PySpark SparseVector missing doc for size, indices, values
* [SPARK-8068] [MLLIB] Add confusionMatrix method at class MulticlassMetrics ↵Yanbo Liang2015-07-081-0/+11
| | | | | | | | | | | | in pyspark/mllib Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib Author: Yanbo Liang <ybliang8@gmail.com> Closes #7286 from yanboliang/spark-8068 and squashes the following commits: 6109fe1 [Yanbo Liang] Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
* [SPARK-7785] [MLLIB] [PYSPARK] Add __str__ and __repr__ to MatricesMechCoder2015-07-082-1/+178
| | | | | | | | | | | | | | | | | Adding __str__ and __repr__ to DenseMatrix and SparseMatrix Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6342 from MechCoder/spark-7785 and squashes the following commits: 7b9a82c [MechCoder] Add tests for greater than 16 elements b88e9dd [MechCoder] Increment limit to 16 1425a01 [MechCoder] Change tests 36bd166 [MechCoder] Change str and repr representation 97f0da9 [MechCoder] zip is same as izip in python3 94ca4b2 [MechCoder] Added doctests and iterate over values instead of colPtrs b26fa89 [MechCoder] minor 394dde9 [MechCoder] [SPARK-7785] Add __str__ and __repr__ to Matrices
* [SPARK-8823] [MLLIB] [PYSPARK] Optimizations for SparseVector dot productsMechCoder2015-07-071-12/+8
| | | | | | | | | | | | Follow up for https://github.com/apache/spark/pull/5946 Currently we iterate over indices and values in SparseVector and can be vectorized. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7222 from MechCoder/sparse_optim and squashes the following commits: dcb51d3 [MechCoder] [SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot product
* [SPARK-8765] [MLLIB] Fix PySpark PowerIterationClustering test issueYanbo Liang2015-07-061-2/+14
| | | | | | | | | | | | PySpark PowerIterationClustering test failure due to bad demo data. If the data is small, PowerIterationClustering will behavior indeterministic. Author: Yanbo Liang <ybliang8@gmail.com> Closes #7177 from yanboliang/spark-8765 and squashes the following commits: 392ae54 [Yanbo Liang] fix model.assignments output 5ec3f1e [Yanbo Liang] fix PySpark PowerIterationClustering test issue
* Revert "[SPARK-7212] [MLLIB] Add sequence learning flag"Xiangrui Meng2015-07-061-2/+2
| | | | | | | | | | This reverts commit 25f574eb9a3cb9b93b7d9194a8ec16e00ce2c036. After speaking to some users and developers, we realized that FP-growth doesn't meet the requirement for frequent sequence mining. PrefixSpan (SPARK-6487) would be the correct algorithm for it. feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #7240 from mengxr/SPARK-7212.revert and squashes the following commits: 2b3d66b [Xiangrui Meng] Revert "[SPARK-7212] [MLLIB] Add sequence learning flag"
* [SPARK-7401] [MLLIB] [PYSPARK] Vectorize dot product and sq_dist between ↵MechCoder2015-07-032-23/+29
| | | | | | | | | | | | | | | | SparseVector and DenseVector Currently we iterate over indices which can be vectorized. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #5946 from MechCoder/spark-7203 and squashes the following commits: 034d086 [MechCoder] Vectorize dot calculation for numpy arrays for ndim=2 bce2b07 [MechCoder] fix doctest fcad0a3 [MechCoder] Remove type checks for list, pyarray etc 0ee5dd4 [MechCoder] Add tests and other isinstance changes e5f1de0 [MechCoder] [SPARK-7401] Vectorize dot product and sq_dist
* [SPARK-7104] [MLLIB] Support model save/load in Python's Word2VecYu ISHIKAWA2015-07-021-1/+20
| | | | | | | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6821 from yu-iskw/SPARK-7104 and squashes the following commits: 975136b [Yu ISHIKAWA] Organize import 0ef58b6 [Yu ISHIKAWA] Use rmtree, instead of removedirs cb21653 [Yu ISHIKAWA] Add an explicit type for `Word2VecModelWrapper.save` 1d468ef [Yu ISHIKAWA] [SPARK-7104][MLlib] Support model save/load in Python's Word2Vec
* [SPARK-8765] [MLLIB] [PYTHON] removed flaky python PIC testJoseph K. Bradley2015-07-011-4/+0
| | | | | | | | | | | | See failure: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36133/console] CC yanboliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #7164 from jkbradley/pic-python-test and squashes the following commits: 156d55b [Joseph K. Bradley] removed flaky python PIC test
* [SPARK-6263] [MLLIB] Python MLlib API missing items: Utilslewuathe2015-07-012-0/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement missing API in pyspark. MLUtils * appendBias * loadVectors `kFold` is also missing however I am not sure `ClassTag` can be passed or restored through python. Author: lewuathe <lewuathe@me.com> Closes #5707 from Lewuathe/SPARK-6263 and squashes the following commits: 16863ea [lewuathe] Merge master 3fc27e7 [lewuathe] Merge branch 'master' into SPARK-6263 6084e9c [lewuathe] Resolv conflict d2aa2a0 [lewuathe] Resolv conflict 9c329d8 [lewuathe] Fix efficiency 3a12a2d [lewuathe] Merge branch 'master' into SPARK-6263 1d4714b [lewuathe] Fix style b29e2bc [lewuathe] Remove scipy dependencies e32eb40 [lewuathe] Merge branch 'master' into SPARK-6263 25d3c9d [lewuathe] Remove unnecessary imports 7ec04db [lewuathe] Resolv conflict 1502d13 [lewuathe] Resolv conflict d6bd416 [lewuathe] Check existence of scipy.sparse 5d555b1 [lewuathe] Construct scipy.sparse matrix c345a44 [lewuathe] Merge branch 'master' into SPARK-6263 b8b5ef7 [lewuathe] Fix unnecessary sort method d254be7 [lewuathe] Merge branch 'master' into SPARK-6263 62a9c7e [lewuathe] Fix appendBias return type 454c73d [lewuathe] Merge branch 'master' into SPARK-6263 a353354 [lewuathe] Remove unnecessary appendBias implementation 44295c2 [lewuathe] Merge branch 'master' into SPARK-6263 64f72ad [lewuathe] Merge branch 'master' into SPARK-6263 c728046 [lewuathe] Fix style 2980569 [lewuathe] [SPARK-6263] Python MLlib API missing items: Utils
* [SPARK-4127] [MLLIB] [PYSPARK] Python bindings for ↵MechCoder2015-06-303-47/+217
| | | | | | | | | | | | | | | | | | | StreamingLinearRegressionWithSGD Python bindings for StreamingLinearRegressionWithSGD Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6744 from MechCoder/spark-4127 and squashes the following commits: d8f6457 [MechCoder] Moved StreamingLinearAlgorithm to pyspark.mllib.regression d47cc24 [MechCoder] Inherit from StreamingLinearAlgorithm 1b4ddd6 [MechCoder] minor 4de6c68 [MechCoder] Minor refactor 5e85a3b [MechCoder] Add tests for simultaneous training and prediction fb27889 [MechCoder] Add example and docs 505380b [MechCoder] Add tests d42bdae [MechCoder] [SPARK-4127] Python bindings for StreamingLinearRegressionWithSGD
* [SPARK-7667] [MLLIB] MLlib Python API consistency checkYanbo Liang2015-06-291-5/+10
| | | | | | | | | | | | | MLlib Python API consistency check Author: Yanbo Liang <ybliang8@gmail.com> Closes #6856 from yanboliang/spark-7667 and squashes the following commits: 21bae35 [Yanbo Liang] remove duplicate code eb12f95 [Yanbo Liang] fix doc inherit problem 9e7ec3c [Yanbo Liang] address comments e763d32 [Yanbo Liang] MLlib Python API consistency check
* [SPARK-5962] [MLLIB] Python support for Power Iteration ClusteringYanbo Liang2015-06-281-3/+95
| | | | | | | | | | | | Python support for Power Iteration Clustering https://issues.apache.org/jira/browse/SPARK-5962 Author: Yanbo Liang <ybliang8@gmail.com> Closes #6992 from yanboliang/pyspark-pic and squashes the following commits: 6b03d82 [Yanbo Liang] address comments 4be4423 [Yanbo Liang] Python support for Power Iteration Clustering
* [SPARK-7212] [MLLIB] Add sequence learning flagFeynman Liang2015-06-281-2/+2
| | | | | | | | | | | | | | | | Support mining of ordered frequent item sequences. Author: Feynman Liang <fliang@databricks.com> Closes #6997 from feynmanliang/fp-sequence and squashes the following commits: 7c14e15 [Feynman Liang] Improve scalatests with R code and Seq 0d3e4b6 [Feynman Liang] Fix python test ce987cb [Feynman Liang] Backwards compatibility aux constructor 34ef8f2 [Feynman Liang] Fix failing test due to reverse orderering f04bd50 [Feynman Liang] Naming, add ordered to FreqItemsets, test ordering using Seq 648d4d4 [Feynman Liang] Test case for frequent item sequences 252a36a [Feynman Liang] Add sequence learning flag
* [MINOR] [MLLIB] rename some functions of PythonMLLibAPIYanbo Liang2015-06-253-5/+5
| | | | | | | | | | | | | | | | | | | | | | | Keep the same naming conventions for PythonMLLibAPI. Only the following three functions is different from others ```scala trainNaiveBayes trainGaussianMixture trainWord2Vec ``` So change them to ```scala trainNaiveBayesModel trainGaussianMixtureModel trainWord2VecModel ``` It does not affect any users and public APIs, only to make better understand for developer and code hacker. Author: Yanbo Liang <ybliang8@gmail.com> Closes #7011 from yanboliang/py-mllib-api-rename and squashes the following commits: 771ffec [Yanbo Liang] rename some functions of PythonMLLibAPI
* [SPARK-7633] [MLLIB] [PYSPARK] Python bindings for ↵MechCoder2015-06-242-2/+229
| | | | | | | | | | | | | | | | StreamingLogisticRegressionwithSGD Add Python bindings to StreamingLogisticRegressionwithSGD. No Java wrappers are needed as models are updated directly using train. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6849 from MechCoder/spark-3258 and squashes the following commits: b4376a5 [MechCoder] minor d7e5fc1 [MechCoder] Refactor into StreamingLinearAlgorithm Better docs 9c09d4e [MechCoder] [SPARK-7633] Python bindings for StreamingLogisticRegressionwithSGD
* [SPARK-8265] [MLLIB] [PYSPARK] Add LinearDataGenerator to pyspark.mllib.utilsMechCoder2015-06-232-2/+55
| | | | | | | | | | | | It is useful to generate linear data for easy testing of linear models and in general. Scala already has it. This is just a wrapper around the Scala code. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6715 from MechCoder/generate_linear_input and squashes the following commits: 6182884 [MechCoder] Minor changes 8bda047 [MechCoder] Minor style fixes 0f1053c [MechCoder] [SPARK-8265] Add LinearDataGenerator to pyspark.mllib.utils
* [SPARK-7781] [MLLIB] gradient boosted trees.train regressor missing max binsHolden Karau2015-06-222-8/+21
| | | | | | | | | | | Author: Holden Karau <holden@pigscanfly.ca> Closes #6331 from holdenk/SPARK-7781-GradientBoostedTrees.trainRegressor-missing-max-bins and squashes the following commits: 2894695 [Holden Karau] remove extra blank line 2573e8d [Holden Karau] Update the scala side of the pythonmllibapi and make the test a bit nicer too 3a09170 [Holden Karau] add maxBins to to the train method as well af7f274 [Holden Karau] Add maxBins to GradientBoostedTrees.trainRegressor and correctly mention the default of 32 in other places where it mentioned 100
* [SPARK-8511] [PYSPARK] Modify a test to remove a saved model in `regression.py`Yu ISHIKAWA2015-06-225-11/+21
| | | | | | | | | | | [[SPARK-8511] Modify a test to remove a saved model in `regression.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8511) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6926 from yu-iskw/SPARK-8511 and squashes the following commits: 7cd0948 [Yu ISHIKAWA] Use `shutil.rmtree()` to temporary directories for saving model testings, instead of `os.removedirs()` 4a01c9e [Yu ISHIKAWA] [SPARK-8511][pyspark] Modify a test to remove a saved model in `regression.py`
* [SPARK-7604] [MLLIB] Python API for PCA and PCAModelYanbo Liang2015-06-211-0/+35
| | | | | | | | | | | Python API for PCA and PCAModel Author: Yanbo Liang <ybliang8@gmail.com> Closes #6315 from yanboliang/spark-7604 and squashes the following commits: 1d58734 [Yanbo Liang] remove transform() in PCAModel, use default behavior 4d9d121 [Yanbo Liang] Python API for PCA and PCAModel
* [SPARK-4118] [MLLIB] [PYSPARK] Python bindings for StreamingKMeansMechCoder2015-06-192-5/+352
| | | | | | | | | | | | | | | | | | | | | | Python bindings for StreamingKMeans Will change status to MRG once docs, tests and examples are updated. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6499 from MechCoder/spark-4118 and squashes the following commits: 7722d16 [MechCoder] minor style fixes 51052d3 [MechCoder] Doc fixes 2061a76 [MechCoder] Add tests for simultaneous training and prediction Minor style fixes 81482fd [MechCoder] minor 5d9fe61 [MechCoder] predictOn should take into account the latest model 8ab9e89 [MechCoder] Fix Python3 error a9817df [MechCoder] Better tests and minor fixes c80e451 [MechCoder] Add ignore_unicode_prefix ee8ce16 [MechCoder] Update tests, doc and examples 4b1481f [MechCoder] Some changes and tests d8b066a [MechCoder] [SPARK-4118] [MLlib] [PySpark] Python bindings for StreamingKMeans
* [SPARK-7605] [MLLIB] [PYSPARK] Python API for ElementwiseProductMechCoder2015-06-172-2/+48
| | | | | | | | | | | Python API for org.apache.spark.mllib.feature.ElementwiseProduct Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6346 from MechCoder/spark-7605 and squashes the following commits: 79d1ef5 [MechCoder] Consistent and support list / array types 5f81d81 [MechCoder] [SPARK-7605] [MLlib] Python API for ElementwiseProduct
* [SPARK-6390] [SQL] [MLlib] Port MatrixUDT to PySparkMechCoder2015-06-172-3/+90
| | | | | | | | | | | | | MatrixUDT was recently coded in scala. This has been ported to PySpark Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6354 from MechCoder/spark-6390 and squashes the following commits: fc4dc1e [MechCoder] Better error message c940a44 [MechCoder] Added test aa9c391 [MechCoder] Add pyUDT to MatrixUDT 62a2a7d [MechCoder] [SPARK-6390] Port MatrixUDT to PySpark
* [SPARK-7916] [MLLIB] MLlib Python doc parity check for classification and ↵Yanbo Liang2015-06-162-107/+247
| | | | | | | | | | | | | | | regression Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc. Author: Yanbo Liang <ybliang8@gmail.com> Closes #6460 from yanboliang/spark-7916 and squashes the following commits: f8deda4 [Yanbo Liang] trigger jenkins 6dc4d99 [Yanbo Liang] address comments ce2a43e [Yanbo Liang] truncate too long line and remove extra sparse 3eaf6ad [Yanbo Liang] MLlib Python doc parity check for classification and regression
* [SPARK-7639] [PYSPARK] [MLLIB] Python API for KernelDensityMechCoder2015-06-062-1/+63
| | | | | | | | | | | | | Python API for KernelDensity Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6387 from MechCoder/spark-7639 and squashes the following commits: 17abc62 [MechCoder] add tests 2de6540 [MechCoder] style tests bf4acc0 [MechCoder] Added doctests 84359d5 [MechCoder] [SPARK-7639] Python API for KernelDensity
* [SPARK-8032] [PYSPARK] Make version checking for NumPy in MLlib more robustMechCoder2015-06-021-1/+3
| | | | | | | | | | | | | | | | The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4` It fails in my system since I have version `1.10` :P Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6579 from MechCoder/np_ver and squashes the following commits: 15430f8 [MechCoder] fix syntax error 893fb7e [MechCoder] remove equal to e35f0d4 [MechCoder] minor e89376c [MechCoder] Better checking 22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust
* [SPARK-7918] [MLLIB] MLlib Python doc parity check for evaluation and featureYanbo Liang2015-05-302-39/+36
| | | | | | | | | | | Check then make the MLlib Python evaluation and feature doc to be as complete as the Scala doc. Author: Yanbo Liang <ybliang8@gmail.com> Closes #6461 from yanboliang/spark-7918 and squashes the following commits: 940e3f1 [Yanbo Liang] truncate too long line and remove extra sparse a80ae58 [Yanbo Liang] MLlib Python doc parity check for evaluation and feature
* [SPARK-7899] [PYSPARK] Fix Python 3 pyspark/sql/types module conflictMichael Nazario2015-05-292-8/+0
| | | | | | | | | | | | | | | This PR makes the types module in `pyspark/sql/types` work with pylint static analysis by removing the dynamic naming of the `pyspark/sql/_types` module to `pyspark/sql/types`. Tests are now loaded using `$PYSPARK_DRIVER_PYTHON -m module` rather than `$PYSPARK_DRIVER_PYTHON module.py`. The old method adds the location of `module.py` to `sys.path`, so this change prevents accidental use of relative paths in Python. Author: Michael Nazario <mnazario@palantir.com> Closes #6439 from mnazario/feature/SPARK-7899 and squashes the following commits: 366ef30 [Michael Nazario] Remove hack on random.py bb8b04d [Michael Nazario] Make doctests consistent with other tests 6ee4f75 [Michael Nazario] Change test scripts to use "-m" 673528f [Michael Nazario] Move _types back to types
* [SPARK-7922] [MLLIB] use DataFrames for user/item factors in ALSModelXiangrui Meng2015-05-281-1/+4
| | | | | | | | | | | | | Expose user/item factors in DataFrames. This is to be more consistent with the pipeline API. It also helps maintain consistent APIs across languages. This PR also removed fitting params from `ALSModel`. coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #6468 from mengxr/SPARK-7922 and squashes the following commits: 7bfb1d5 [Xiangrui Meng] update ALSModel in PySpark 1ba5607 [Xiangrui Meng] use DataFrames for user/item factors in ALS
* [SPARK-6094] [MLLIB] Add MultilabelMetrics in PySpark/MLlibYanbo Liang2015-05-201-0/+117
| | | | | | | | | | Add MultilabelMetrics in PySpark/MLlib Author: Yanbo Liang <ybliang8@gmail.com> Closes #6276 from yanboliang/spark-6094 and squashes the following commits: b8e3343 [Yanbo Liang] Add MultilabelMetrics in PySpark/MLlib
* [SPARK-6657] [PYSPARK] Fix doc warningsXiangrui Meng2015-05-182-9/+8
| | | | | | | | | | | | | | | | | | | | | | | Fixed the following warnings in `make clean html` under `python/docs`: ~~~ /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:3: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:4: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:3: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:4: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.replace:16: WARNING: Field list ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:8: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:9: WARNING: Block quote ends without a blank line; unexpected unindent. ~~~ davies Author: Xiangrui Meng <meng@databricks.com> Closes #6221 from mengxr/SPARK-6657 and squashes the following commits: e3f83fe [Xiangrui Meng] fix sql and streaming doc warnings 2b4371e [Xiangrui Meng] fix mllib python doc warnings
* [SPARK-7651] [MLLIB] [PYSPARK] GMM predict, predictSoft should raise error ↵FlytxtRnD2015-05-151-0/+6
| | | | | | | | | | | | on bad input In the Python API for Gaussian Mixture Model, predict() and predictSoft() methods should raise an error when the input argument is not an RDD. Author: FlytxtRnD <meethu.mathew@flytxt.com> Closes #6180 from FlytxtRnD/GmmPredictException and squashes the following commits: 4b6aa11 [FlytxtRnD] Raise error if the input to predict()/predictSoft() is not an RDD
* [SPARK-6258] [MLLIB] GaussianMixture Python API parity checkYanbo Liang2015-05-151-14/+53
| | | | | | | | | | | | | | | | | | | Implement Python API for major disparities of GaussianMixture cluster algorithm between Scala & Python ```scala GaussianMixture setInitialModel GaussianMixtureModel k ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #6087 from yanboliang/spark-6258 and squashes the following commits: b3af21c [Yanbo Liang] fix typo 2b645c1 [Yanbo Liang] fix doc 638b4b7 [Yanbo Liang] address comments b5bcade [Yanbo Liang] GaussianMixture Python API parity check
* [SPARK-6092] [MLLIB] Add RankingMetrics in PySpark/MLlibYanbo Liang2015-05-111-2/+76
| | | | | | | | | Author: Yanbo Liang <ybliang8@gmail.com> Closes #6044 from yanboliang/spark-6092 and squashes the following commits: 726a9b1 [Yanbo Liang] add newRankingMetrics 33f649c [Yanbo Liang] Add RankingMetrics in PySpark/MLlib
* [SPARK-6091] [MLLIB] Add MulticlassMetrics in PySpark/MLlibYanbo Liang2015-05-101-0/+129
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-6091 Author: Yanbo Liang <ybliang8@gmail.com> Closes #6011 from yanboliang/spark-6091 and squashes the following commits: bb3e4ba [Yanbo Liang] trigger jenkins 53c045d [Yanbo Liang] keep compatibility for python 2.6 972d5ac [Yanbo Liang] Add MulticlassMetrics in PySpark/MLlib
* [SPARK-5913] [MLLIB] Python API for ChiSqSelectorYanbo Liang2015-05-081-2/+57
| | | | | | | | | | | Add a Python API for mllib.feature.ChiSqSelector https://issues.apache.org/jira/browse/SPARK-5913 Author: Yanbo Liang <ybliang8@gmail.com> Closes #5939 from yanboliang/spark-5913 and squashes the following commits: cdaac99 [Yanbo Liang] Python API for ChiSqSelector