aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/mllib
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-9138] [MLLIB] fix Vectors.denseDavies Liu2015-07-171-3/+8
| | | | | | | | | | | | Vectors.dense() should accept numbers directly, like the one in Scala. We already use it in doctests, it worked by luck. cc mengxr jkbradley Author: Davies Liu <davies@databricks.com> Closes #7476 from davies/fix_vectors_dense and squashes the following commits: e0fd292 [Davies Liu] fix Vectors.dense
* [SPARK-9005] [MLLIB] Fix RegressionMetrics computation of explainedVarianceFeynman Liang2015-07-151-1/+1
| | | | | | | | | | | | | | | | Fixes implementation of `explainedVariance` and `r2` to be consistent with their definitions as described in [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005). Author: Feynman Liang <fliang@databricks.com> Closes #7361 from feynmanliang/SPARK-9005-RegressionMetrics-bugs and squashes the following commits: f1112fc [Feynman Liang] Add explainedVariance formula 1a3d098 [Feynman Liang] SROwen code review comments 08a0e1b [Feynman Liang] Fix pyspark tests db8605a [Feynman Liang] Style fix bde9761 [Feynman Liang] Fix RegressionMetrics tests, relax assumption predictor is unbiased c235de0 [Feynman Liang] Fix RegressionMetrics tests 4c4e56f [Feynman Liang] Fix RegressionMetrics computation of explainedVariance and r2
* [SPARK-6259] [MLLIB] Python API for LDAYu ISHIKAWA2015-07-141-1/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | I implemented the Python API for LDA. But I didn't implemented a method for `LDAModel.describeTopics()`, beause it's a little hard to implement it now. And adding document about that and an example code would fit for another issue. TODO: LDAModel.describeTopics() in Python must be also implemented. But it would be nice to fit for another issue. Implementing it is a little hard, since the return value of `describeTopics` in Scala consists of Tuple classes. Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6791 from yu-iskw/SPARK-6259 and squashes the following commits: 6855f59 [Yu ISHIKAWA] LDA inherits object 28bd165 [Yu ISHIKAWA] Change the place of testing code d7a332a [Yu ISHIKAWA] Remove the doc comment about the optimizer's default value 083e226 [Yu ISHIKAWA] Add the comment about the supported values and the default value of `optimizer` 9f8bed8 [Yu ISHIKAWA] Simplify casting faa9764 [Yu ISHIKAWA] Add some comments for the LDA paramters 98f645a [Yu ISHIKAWA] Remove the interface for `describeTopics`. Because it is not implemented. 57ac03d [Yu ISHIKAWA] Remove the unnecessary import in Python unit testing 73412c3 [Yu ISHIKAWA] Fix the typo 2278829 [Yu ISHIKAWA] Fix the indentation 39514ec [Yu ISHIKAWA] Modify how to cast the input data 8117e18 [Yu ISHIKAWA] Fix the validation problems by `lint-scala` 77fd1b7 [Yu ISHIKAWA] Not use LabeledPoint 68f0653 [Yu ISHIKAWA] Support some parameters for `ALS.train()` in Python 25ef2ac [Yu ISHIKAWA] Resolve conflicts with rebasing
* [SPARK-6266] [MLLIB] PySpark SparseVector missing doc for size, indices, valueslewuathe2015-07-091-2/+7
| | | | | | | | | | | | Write missing pydocs in `SparseVector` attributes. Author: lewuathe <lewuathe@me.com> Closes #7290 from Lewuathe/SPARK-6266 and squashes the following commits: 51d9895 [lewuathe] Update docs 0480d35 [lewuathe] Merge branch 'master' into SPARK-6266 ba42cf3 [lewuathe] [SPARK-6266] PySpark SparseVector missing doc for size, indices, values
* [SPARK-8068] [MLLIB] Add confusionMatrix method at class MulticlassMetrics ↵Yanbo Liang2015-07-081-0/+11
| | | | | | | | | | | | in pyspark/mllib Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib Author: Yanbo Liang <ybliang8@gmail.com> Closes #7286 from yanboliang/spark-8068 and squashes the following commits: 6109fe1 [Yanbo Liang] Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
* [SPARK-7785] [MLLIB] [PYSPARK] Add __str__ and __repr__ to MatricesMechCoder2015-07-082-1/+178
| | | | | | | | | | | | | | | | | Adding __str__ and __repr__ to DenseMatrix and SparseMatrix Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6342 from MechCoder/spark-7785 and squashes the following commits: 7b9a82c [MechCoder] Add tests for greater than 16 elements b88e9dd [MechCoder] Increment limit to 16 1425a01 [MechCoder] Change tests 36bd166 [MechCoder] Change str and repr representation 97f0da9 [MechCoder] zip is same as izip in python3 94ca4b2 [MechCoder] Added doctests and iterate over values instead of colPtrs b26fa89 [MechCoder] minor 394dde9 [MechCoder] [SPARK-7785] Add __str__ and __repr__ to Matrices
* [SPARK-8823] [MLLIB] [PYSPARK] Optimizations for SparseVector dot productsMechCoder2015-07-071-12/+8
| | | | | | | | | | | | Follow up for https://github.com/apache/spark/pull/5946 Currently we iterate over indices and values in SparseVector and can be vectorized. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7222 from MechCoder/sparse_optim and squashes the following commits: dcb51d3 [MechCoder] [SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot product
* [SPARK-8765] [MLLIB] Fix PySpark PowerIterationClustering test issueYanbo Liang2015-07-061-2/+14
| | | | | | | | | | | | PySpark PowerIterationClustering test failure due to bad demo data. If the data is small, PowerIterationClustering will behavior indeterministic. Author: Yanbo Liang <ybliang8@gmail.com> Closes #7177 from yanboliang/spark-8765 and squashes the following commits: 392ae54 [Yanbo Liang] fix model.assignments output 5ec3f1e [Yanbo Liang] fix PySpark PowerIterationClustering test issue
* Revert "[SPARK-7212] [MLLIB] Add sequence learning flag"Xiangrui Meng2015-07-061-2/+2
| | | | | | | | | | This reverts commit 25f574eb9a3cb9b93b7d9194a8ec16e00ce2c036. After speaking to some users and developers, we realized that FP-growth doesn't meet the requirement for frequent sequence mining. PrefixSpan (SPARK-6487) would be the correct algorithm for it. feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #7240 from mengxr/SPARK-7212.revert and squashes the following commits: 2b3d66b [Xiangrui Meng] Revert "[SPARK-7212] [MLLIB] Add sequence learning flag"
* [SPARK-7401] [MLLIB] [PYSPARK] Vectorize dot product and sq_dist between ↵MechCoder2015-07-032-23/+29
| | | | | | | | | | | | | | | | SparseVector and DenseVector Currently we iterate over indices which can be vectorized. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #5946 from MechCoder/spark-7203 and squashes the following commits: 034d086 [MechCoder] Vectorize dot calculation for numpy arrays for ndim=2 bce2b07 [MechCoder] fix doctest fcad0a3 [MechCoder] Remove type checks for list, pyarray etc 0ee5dd4 [MechCoder] Add tests and other isinstance changes e5f1de0 [MechCoder] [SPARK-7401] Vectorize dot product and sq_dist
* [SPARK-7104] [MLLIB] Support model save/load in Python's Word2VecYu ISHIKAWA2015-07-021-1/+20
| | | | | | | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6821 from yu-iskw/SPARK-7104 and squashes the following commits: 975136b [Yu ISHIKAWA] Organize import 0ef58b6 [Yu ISHIKAWA] Use rmtree, instead of removedirs cb21653 [Yu ISHIKAWA] Add an explicit type for `Word2VecModelWrapper.save` 1d468ef [Yu ISHIKAWA] [SPARK-7104][MLlib] Support model save/load in Python's Word2Vec
* [SPARK-8765] [MLLIB] [PYTHON] removed flaky python PIC testJoseph K. Bradley2015-07-011-4/+0
| | | | | | | | | | | | See failure: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36133/console] CC yanboliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #7164 from jkbradley/pic-python-test and squashes the following commits: 156d55b [Joseph K. Bradley] removed flaky python PIC test
* [SPARK-6263] [MLLIB] Python MLlib API missing items: Utilslewuathe2015-07-012-0/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement missing API in pyspark. MLUtils * appendBias * loadVectors `kFold` is also missing however I am not sure `ClassTag` can be passed or restored through python. Author: lewuathe <lewuathe@me.com> Closes #5707 from Lewuathe/SPARK-6263 and squashes the following commits: 16863ea [lewuathe] Merge master 3fc27e7 [lewuathe] Merge branch 'master' into SPARK-6263 6084e9c [lewuathe] Resolv conflict d2aa2a0 [lewuathe] Resolv conflict 9c329d8 [lewuathe] Fix efficiency 3a12a2d [lewuathe] Merge branch 'master' into SPARK-6263 1d4714b [lewuathe] Fix style b29e2bc [lewuathe] Remove scipy dependencies e32eb40 [lewuathe] Merge branch 'master' into SPARK-6263 25d3c9d [lewuathe] Remove unnecessary imports 7ec04db [lewuathe] Resolv conflict 1502d13 [lewuathe] Resolv conflict d6bd416 [lewuathe] Check existence of scipy.sparse 5d555b1 [lewuathe] Construct scipy.sparse matrix c345a44 [lewuathe] Merge branch 'master' into SPARK-6263 b8b5ef7 [lewuathe] Fix unnecessary sort method d254be7 [lewuathe] Merge branch 'master' into SPARK-6263 62a9c7e [lewuathe] Fix appendBias return type 454c73d [lewuathe] Merge branch 'master' into SPARK-6263 a353354 [lewuathe] Remove unnecessary appendBias implementation 44295c2 [lewuathe] Merge branch 'master' into SPARK-6263 64f72ad [lewuathe] Merge branch 'master' into SPARK-6263 c728046 [lewuathe] Fix style 2980569 [lewuathe] [SPARK-6263] Python MLlib API missing items: Utils
* [SPARK-4127] [MLLIB] [PYSPARK] Python bindings for ↵MechCoder2015-06-303-47/+217
| | | | | | | | | | | | | | | | | | | StreamingLinearRegressionWithSGD Python bindings for StreamingLinearRegressionWithSGD Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6744 from MechCoder/spark-4127 and squashes the following commits: d8f6457 [MechCoder] Moved StreamingLinearAlgorithm to pyspark.mllib.regression d47cc24 [MechCoder] Inherit from StreamingLinearAlgorithm 1b4ddd6 [MechCoder] minor 4de6c68 [MechCoder] Minor refactor 5e85a3b [MechCoder] Add tests for simultaneous training and prediction fb27889 [MechCoder] Add example and docs 505380b [MechCoder] Add tests d42bdae [MechCoder] [SPARK-4127] Python bindings for StreamingLinearRegressionWithSGD
* [SPARK-7667] [MLLIB] MLlib Python API consistency checkYanbo Liang2015-06-291-5/+10
| | | | | | | | | | | | | MLlib Python API consistency check Author: Yanbo Liang <ybliang8@gmail.com> Closes #6856 from yanboliang/spark-7667 and squashes the following commits: 21bae35 [Yanbo Liang] remove duplicate code eb12f95 [Yanbo Liang] fix doc inherit problem 9e7ec3c [Yanbo Liang] address comments e763d32 [Yanbo Liang] MLlib Python API consistency check
* [SPARK-5962] [MLLIB] Python support for Power Iteration ClusteringYanbo Liang2015-06-281-3/+95
| | | | | | | | | | | | Python support for Power Iteration Clustering https://issues.apache.org/jira/browse/SPARK-5962 Author: Yanbo Liang <ybliang8@gmail.com> Closes #6992 from yanboliang/pyspark-pic and squashes the following commits: 6b03d82 [Yanbo Liang] address comments 4be4423 [Yanbo Liang] Python support for Power Iteration Clustering
* [SPARK-7212] [MLLIB] Add sequence learning flagFeynman Liang2015-06-281-2/+2
| | | | | | | | | | | | | | | | Support mining of ordered frequent item sequences. Author: Feynman Liang <fliang@databricks.com> Closes #6997 from feynmanliang/fp-sequence and squashes the following commits: 7c14e15 [Feynman Liang] Improve scalatests with R code and Seq 0d3e4b6 [Feynman Liang] Fix python test ce987cb [Feynman Liang] Backwards compatibility aux constructor 34ef8f2 [Feynman Liang] Fix failing test due to reverse orderering f04bd50 [Feynman Liang] Naming, add ordered to FreqItemsets, test ordering using Seq 648d4d4 [Feynman Liang] Test case for frequent item sequences 252a36a [Feynman Liang] Add sequence learning flag
* [MINOR] [MLLIB] rename some functions of PythonMLLibAPIYanbo Liang2015-06-253-5/+5
| | | | | | | | | | | | | | | | | | | | | | | Keep the same naming conventions for PythonMLLibAPI. Only the following three functions is different from others ```scala trainNaiveBayes trainGaussianMixture trainWord2Vec ``` So change them to ```scala trainNaiveBayesModel trainGaussianMixtureModel trainWord2VecModel ``` It does not affect any users and public APIs, only to make better understand for developer and code hacker. Author: Yanbo Liang <ybliang8@gmail.com> Closes #7011 from yanboliang/py-mllib-api-rename and squashes the following commits: 771ffec [Yanbo Liang] rename some functions of PythonMLLibAPI
* [SPARK-7633] [MLLIB] [PYSPARK] Python bindings for ↵MechCoder2015-06-242-2/+229
| | | | | | | | | | | | | | | | StreamingLogisticRegressionwithSGD Add Python bindings to StreamingLogisticRegressionwithSGD. No Java wrappers are needed as models are updated directly using train. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6849 from MechCoder/spark-3258 and squashes the following commits: b4376a5 [MechCoder] minor d7e5fc1 [MechCoder] Refactor into StreamingLinearAlgorithm Better docs 9c09d4e [MechCoder] [SPARK-7633] Python bindings for StreamingLogisticRegressionwithSGD
* [SPARK-8265] [MLLIB] [PYSPARK] Add LinearDataGenerator to pyspark.mllib.utilsMechCoder2015-06-232-2/+55
| | | | | | | | | | | | It is useful to generate linear data for easy testing of linear models and in general. Scala already has it. This is just a wrapper around the Scala code. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6715 from MechCoder/generate_linear_input and squashes the following commits: 6182884 [MechCoder] Minor changes 8bda047 [MechCoder] Minor style fixes 0f1053c [MechCoder] [SPARK-8265] Add LinearDataGenerator to pyspark.mllib.utils
* [SPARK-7781] [MLLIB] gradient boosted trees.train regressor missing max binsHolden Karau2015-06-222-8/+21
| | | | | | | | | | | Author: Holden Karau <holden@pigscanfly.ca> Closes #6331 from holdenk/SPARK-7781-GradientBoostedTrees.trainRegressor-missing-max-bins and squashes the following commits: 2894695 [Holden Karau] remove extra blank line 2573e8d [Holden Karau] Update the scala side of the pythonmllibapi and make the test a bit nicer too 3a09170 [Holden Karau] add maxBins to to the train method as well af7f274 [Holden Karau] Add maxBins to GradientBoostedTrees.trainRegressor and correctly mention the default of 32 in other places where it mentioned 100
* [SPARK-8511] [PYSPARK] Modify a test to remove a saved model in `regression.py`Yu ISHIKAWA2015-06-225-11/+21
| | | | | | | | | | | [[SPARK-8511] Modify a test to remove a saved model in `regression.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8511) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6926 from yu-iskw/SPARK-8511 and squashes the following commits: 7cd0948 [Yu ISHIKAWA] Use `shutil.rmtree()` to temporary directories for saving model testings, instead of `os.removedirs()` 4a01c9e [Yu ISHIKAWA] [SPARK-8511][pyspark] Modify a test to remove a saved model in `regression.py`
* [SPARK-7604] [MLLIB] Python API for PCA and PCAModelYanbo Liang2015-06-211-0/+35
| | | | | | | | | | | Python API for PCA and PCAModel Author: Yanbo Liang <ybliang8@gmail.com> Closes #6315 from yanboliang/spark-7604 and squashes the following commits: 1d58734 [Yanbo Liang] remove transform() in PCAModel, use default behavior 4d9d121 [Yanbo Liang] Python API for PCA and PCAModel
* [SPARK-4118] [MLLIB] [PYSPARK] Python bindings for StreamingKMeansMechCoder2015-06-192-5/+352
| | | | | | | | | | | | | | | | | | | | | | Python bindings for StreamingKMeans Will change status to MRG once docs, tests and examples are updated. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6499 from MechCoder/spark-4118 and squashes the following commits: 7722d16 [MechCoder] minor style fixes 51052d3 [MechCoder] Doc fixes 2061a76 [MechCoder] Add tests for simultaneous training and prediction Minor style fixes 81482fd [MechCoder] minor 5d9fe61 [MechCoder] predictOn should take into account the latest model 8ab9e89 [MechCoder] Fix Python3 error a9817df [MechCoder] Better tests and minor fixes c80e451 [MechCoder] Add ignore_unicode_prefix ee8ce16 [MechCoder] Update tests, doc and examples 4b1481f [MechCoder] Some changes and tests d8b066a [MechCoder] [SPARK-4118] [MLlib] [PySpark] Python bindings for StreamingKMeans
* [SPARK-7605] [MLLIB] [PYSPARK] Python API for ElementwiseProductMechCoder2015-06-172-2/+48
| | | | | | | | | | | Python API for org.apache.spark.mllib.feature.ElementwiseProduct Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6346 from MechCoder/spark-7605 and squashes the following commits: 79d1ef5 [MechCoder] Consistent and support list / array types 5f81d81 [MechCoder] [SPARK-7605] [MLlib] Python API for ElementwiseProduct
* [SPARK-6390] [SQL] [MLlib] Port MatrixUDT to PySparkMechCoder2015-06-172-3/+90
| | | | | | | | | | | | | MatrixUDT was recently coded in scala. This has been ported to PySpark Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6354 from MechCoder/spark-6390 and squashes the following commits: fc4dc1e [MechCoder] Better error message c940a44 [MechCoder] Added test aa9c391 [MechCoder] Add pyUDT to MatrixUDT 62a2a7d [MechCoder] [SPARK-6390] Port MatrixUDT to PySpark
* [SPARK-7916] [MLLIB] MLlib Python doc parity check for classification and ↵Yanbo Liang2015-06-162-107/+247
| | | | | | | | | | | | | | | regression Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc. Author: Yanbo Liang <ybliang8@gmail.com> Closes #6460 from yanboliang/spark-7916 and squashes the following commits: f8deda4 [Yanbo Liang] trigger jenkins 6dc4d99 [Yanbo Liang] address comments ce2a43e [Yanbo Liang] truncate too long line and remove extra sparse 3eaf6ad [Yanbo Liang] MLlib Python doc parity check for classification and regression
* [SPARK-7639] [PYSPARK] [MLLIB] Python API for KernelDensityMechCoder2015-06-062-1/+63
| | | | | | | | | | | | | Python API for KernelDensity Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6387 from MechCoder/spark-7639 and squashes the following commits: 17abc62 [MechCoder] add tests 2de6540 [MechCoder] style tests bf4acc0 [MechCoder] Added doctests 84359d5 [MechCoder] [SPARK-7639] Python API for KernelDensity
* [SPARK-8032] [PYSPARK] Make version checking for NumPy in MLlib more robustMechCoder2015-06-021-1/+3
| | | | | | | | | | | | | | | | The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4` It fails in my system since I have version `1.10` :P Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6579 from MechCoder/np_ver and squashes the following commits: 15430f8 [MechCoder] fix syntax error 893fb7e [MechCoder] remove equal to e35f0d4 [MechCoder] minor e89376c [MechCoder] Better checking 22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust
* [SPARK-7918] [MLLIB] MLlib Python doc parity check for evaluation and featureYanbo Liang2015-05-302-39/+36
| | | | | | | | | | | Check then make the MLlib Python evaluation and feature doc to be as complete as the Scala doc. Author: Yanbo Liang <ybliang8@gmail.com> Closes #6461 from yanboliang/spark-7918 and squashes the following commits: 940e3f1 [Yanbo Liang] truncate too long line and remove extra sparse a80ae58 [Yanbo Liang] MLlib Python doc parity check for evaluation and feature
* [SPARK-7899] [PYSPARK] Fix Python 3 pyspark/sql/types module conflictMichael Nazario2015-05-292-8/+0
| | | | | | | | | | | | | | | This PR makes the types module in `pyspark/sql/types` work with pylint static analysis by removing the dynamic naming of the `pyspark/sql/_types` module to `pyspark/sql/types`. Tests are now loaded using `$PYSPARK_DRIVER_PYTHON -m module` rather than `$PYSPARK_DRIVER_PYTHON module.py`. The old method adds the location of `module.py` to `sys.path`, so this change prevents accidental use of relative paths in Python. Author: Michael Nazario <mnazario@palantir.com> Closes #6439 from mnazario/feature/SPARK-7899 and squashes the following commits: 366ef30 [Michael Nazario] Remove hack on random.py bb8b04d [Michael Nazario] Make doctests consistent with other tests 6ee4f75 [Michael Nazario] Change test scripts to use "-m" 673528f [Michael Nazario] Move _types back to types
* [SPARK-7922] [MLLIB] use DataFrames for user/item factors in ALSModelXiangrui Meng2015-05-281-1/+4
| | | | | | | | | | | | | Expose user/item factors in DataFrames. This is to be more consistent with the pipeline API. It also helps maintain consistent APIs across languages. This PR also removed fitting params from `ALSModel`. coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #6468 from mengxr/SPARK-7922 and squashes the following commits: 7bfb1d5 [Xiangrui Meng] update ALSModel in PySpark 1ba5607 [Xiangrui Meng] use DataFrames for user/item factors in ALS
* [SPARK-6094] [MLLIB] Add MultilabelMetrics in PySpark/MLlibYanbo Liang2015-05-201-0/+117
| | | | | | | | | | Add MultilabelMetrics in PySpark/MLlib Author: Yanbo Liang <ybliang8@gmail.com> Closes #6276 from yanboliang/spark-6094 and squashes the following commits: b8e3343 [Yanbo Liang] Add MultilabelMetrics in PySpark/MLlib
* [SPARK-6657] [PYSPARK] Fix doc warningsXiangrui Meng2015-05-182-9/+8
| | | | | | | | | | | | | | | | | | | | | | | Fixed the following warnings in `make clean html` under `python/docs`: ~~~ /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:3: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:4: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:3: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:4: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.replace:16: WARNING: Field list ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:8: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:9: WARNING: Block quote ends without a blank line; unexpected unindent. ~~~ davies Author: Xiangrui Meng <meng@databricks.com> Closes #6221 from mengxr/SPARK-6657 and squashes the following commits: e3f83fe [Xiangrui Meng] fix sql and streaming doc warnings 2b4371e [Xiangrui Meng] fix mllib python doc warnings
* [SPARK-7651] [MLLIB] [PYSPARK] GMM predict, predictSoft should raise error ↵FlytxtRnD2015-05-151-0/+6
| | | | | | | | | | | | on bad input In the Python API for Gaussian Mixture Model, predict() and predictSoft() methods should raise an error when the input argument is not an RDD. Author: FlytxtRnD <meethu.mathew@flytxt.com> Closes #6180 from FlytxtRnD/GmmPredictException and squashes the following commits: 4b6aa11 [FlytxtRnD] Raise error if the input to predict()/predictSoft() is not an RDD
* [SPARK-6258] [MLLIB] GaussianMixture Python API parity checkYanbo Liang2015-05-151-14/+53
| | | | | | | | | | | | | | | | | | | Implement Python API for major disparities of GaussianMixture cluster algorithm between Scala & Python ```scala GaussianMixture setInitialModel GaussianMixtureModel k ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #6087 from yanboliang/spark-6258 and squashes the following commits: b3af21c [Yanbo Liang] fix typo 2b645c1 [Yanbo Liang] fix doc 638b4b7 [Yanbo Liang] address comments b5bcade [Yanbo Liang] GaussianMixture Python API parity check
* [SPARK-6092] [MLLIB] Add RankingMetrics in PySpark/MLlibYanbo Liang2015-05-111-2/+76
| | | | | | | | | Author: Yanbo Liang <ybliang8@gmail.com> Closes #6044 from yanboliang/spark-6092 and squashes the following commits: 726a9b1 [Yanbo Liang] add newRankingMetrics 33f649c [Yanbo Liang] Add RankingMetrics in PySpark/MLlib
* [SPARK-6091] [MLLIB] Add MulticlassMetrics in PySpark/MLlibYanbo Liang2015-05-101-0/+129
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-6091 Author: Yanbo Liang <ybliang8@gmail.com> Closes #6011 from yanboliang/spark-6091 and squashes the following commits: bb3e4ba [Yanbo Liang] trigger jenkins 53c045d [Yanbo Liang] keep compatibility for python 2.6 972d5ac [Yanbo Liang] Add MulticlassMetrics in PySpark/MLlib
* [SPARK-5913] [MLLIB] Python API for ChiSqSelectorYanbo Liang2015-05-081-2/+57
| | | | | | | | | | | Add a Python API for mllib.feature.ChiSqSelector https://issues.apache.org/jira/browse/SPARK-5913 Author: Yanbo Liang <ybliang8@gmail.com> Closes #5939 from yanboliang/spark-5913 and squashes the following commits: cdaac99 [Yanbo Liang] Python API for ChiSqSelector
* [SPARK-7328] [MLLIB] [PYSPARK] Pyspark.mllib.linalg.Vectors: Missing itemsMechCoder2015-05-072-2/+171
| | | | | | | | | | | | | | | | | | | | Add 1. Class methods squared_dist 3. parse 4. norm 5. numNonzeros 6. copy I made a few vectorizations wrt squared_dist and dot as well. I have added support for SparseMatrix serialization in a separate PR (https://github.com/apache/spark/pull/5775) and plan to complete support for Matrices in another PR. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #5872 from MechCoder/local_linalg_api and squashes the following commits: a8ff1e0 [MechCoder] minor ce3e53e [MechCoder] Add error message for parser 1bd3c04 [MechCoder] Robust parser and removed unnecessary methods f779561 [MechCoder] [SPARK-7328] Pyspark.mllib.linalg.Vectors: Missing items
* [SPARK-6093] [MLLIB] Add RegressionMetrics in PySpark/MLlibYanbo Liang2015-05-071-2/+76
| | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-6093 Author: Yanbo Liang <ybliang8@gmail.com> Closes #5941 from yanboliang/spark-6093 and squashes the following commits: 6934af3 [Yanbo Liang] change to @property aac3bc5 [Yanbo Liang] Add RegressionMetrics in PySpark/MLlib
* [SPARK-6267] [MLLIB] Python API for IsotonicRegressionYanbo Liang2015-05-051-2/+71
| | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-6267 Author: Yanbo Liang <ybliang8@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #5890 from yanboliang/spark-6267 and squashes the following commits: f20541d [Yanbo Liang] Merge pull request #3 from mengxr/SPARK-6267 7f202f9 [Xiangrui Meng] use Vector to have the best Python 2&3 compatibility 4bccfee [Yanbo Liang] fix doctest ec09412 [Yanbo Liang] fix typos 8214bbb [Yanbo Liang] fix code style 5c8ebe5 [Yanbo Liang] Python API for IsotonicRegression
* [SPARK-6612] [MLLIB] [PYSPARK] Python KMeans parityHrishikesh Subramonian2015-05-052-7/+31
| | | | | | | | | | | | | | | | | The following items are added to Python kmeans: kmeans - setEpsilon, setInitializationSteps KMeansModel - computeCost, k Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com> Closes #5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits: b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test 5fd3ced [Hrishikesh Subramonian] doc test corrections 20b3c68 [Hrishikesh Subramonian] python 3 fixes 4d4e695 [Hrishikesh Subramonian] added arguments in python tests 21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
* [SPARK-7202] [MLLIB] [PYSPARK] Add SparseMatrixPickler to SerDeMechCoder2015-05-052-2/+5
| | | | | | | | | | Utilities for pickling and unpickling SparseMatrices using SerDe Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #5775 from MechCoder/spark-7202 and squashes the following commits: 7e689dc [MechCoder] [SPARK-7202] Add SparseMatrixPickler to SerDe
* [SPARK-6257] [PYSPARK] [MLLIB] MLlib API missing items in RecommendationMechCoder2015-04-301-0/+39
| | | | | | | | | | | | | Adds rank, recommendUsers and RecommendProducts to MatrixFactorizationModel in PySpark. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #5807 from MechCoder/spark-6257 and squashes the following commits: 09629c6 [MechCoder] doc 953b326 [MechCoder] [SPARK-6257] MLlib API missing items in Recommendation
* [SPARK-7208] [ML] [PYTHON] Added Matrix, SparseMatrix to __all__ list in ↵Joseph K. Bradley2015-04-281-1/+2
| | | | | | | | | | | | | | linalg.py Added Matrix, SparseMatrix to __all__ list in linalg.py CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #5759 from jkbradley/SPARK-7208 and squashes the following commits: deb51a2 [Joseph K. Bradley] Added Matrix, SparseMatrix to __all__ list in linalg.py
* [SPARK-6827] [MLLIB] Wrap FPGrowthModel.freqItemsets and make it consistent ↵Yanbo Liang2015-04-221-3/+12
| | | | | | | | | | | | | | | with Java API Make PySpark ```FPGrowthModel.freqItemsets``` consistent with Java/Scala API like ```MatrixFactorizationModel.userFeatures``` It return a RDD with each tuple is composed of an array and a long value. I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function. Author: Yanbo Liang <ybliang8@gmail.com> Closes #5614 from yanboliang/spark-6827 and squashes the following commits: da8c404 [Yanbo Liang] use namedtuple 5532e78 [Yanbo Liang] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API
* [SPARK-6953] [PySpark] speed up python testsReynold Xin2015-04-214-58/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR try to speed up some python tests: ``` tests.py 144s -> 103s -41s mllib/classification.py 24s -> 17s -7s mllib/regression.py 27s -> 15s -12s mllib/tree.py 27s -> 13s -14s mllib/tests.py 64s -> 31s -33s streaming/tests.py 185s -> 84s -101s ``` Considering python3, the total saving will be 558s (almost 10 minutes) (core, and streaming run three times, mllib runs twice). During testing, it will show used time for each test file: ``` Run core tests ... Running test: pyspark/rdd.py ... ok (22s) Running test: pyspark/context.py ... ok (16s) Running test: pyspark/conf.py ... ok (4s) Running test: pyspark/broadcast.py ... ok (4s) Running test: pyspark/accumulators.py ... ok (4s) Running test: pyspark/serializers.py ... ok (6s) Running test: pyspark/profiler.py ... ok (5s) Running test: pyspark/shuffle.py ... ok (1s) Running test: pyspark/tests.py ... ok (103s) 144s ``` Author: Reynold Xin <rxin@databricks.com> Author: Xiangrui Meng <meng@databricks.com> Closes #5605 from rxin/python-tests-speed and squashes the following commits: d08542d [Reynold Xin] Merge pull request #14 from mengxr/SPARK-6953 89321ee [Xiangrui Meng] fix seed in tests 3ad2387 [Reynold Xin] Merge pull request #5427 from davies/python_tests
* [SPARK-7036][MLLIB] ALS.train should support DataFrames in PySparkXiangrui Meng2015-04-211-10/+26
| | | | | | | | | | SchemaRDD works with ALS.train in 1.2, so we should continue support DataFrames for compatibility. coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #5619 from mengxr/SPARK-7036 and squashes the following commits: dfcaf5a [Xiangrui Meng] ALS.train should support DataFrames in PySpark
* [SPARK-6845] [MLlib] [PySpark] Add isTranposed flag to DenseMatrixMechCoder2015-04-212-16/+49
| | | | | | | | | | | | | Since sparse matrices now support a isTransposed flag for row major data, DenseMatrices should do the same. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #5455 from MechCoder/spark-6845 and squashes the following commits: 525c370 [MechCoder] minor 004a37f [MechCoder] Cast boolean to int 151f3b6 [MechCoder] [WIP] Add isTransposed to pickle DenseMatrix cc0b90a [MechCoder] [SPARK-6845] Add isTranposed flag to DenseMatrix