| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
| |
fixed python doc build warnings
CC whomever wants to review: rxin mengxr davies
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #5317 from jkbradley/python-doc-warnings and squashes the following commits:
4cd43c2 [Joseph K. Bradley] fixed python doc build warnings
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
numpy array
Users should be able to use numpy operators directly on dense vectors. davies atalwalkar
Author: Xiangrui Meng <meng@databricks.com>
Closes #5312 from mengxr/SPARK-6651 and squashes the following commits:
e665c5c [Xiangrui Meng] wrap the result in a dense vector
23dfca3 [Xiangrui Meng] delegate dense vector arithmetics to the underlying numpy array
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Python API parity check for classification and multiclass classification support, major disparities need to be added for Python:
```scala
LogisticRegressionWithLBFGS
setNumClasses
setValidateData
LogisticRegressionModel
getThreshold
numClasses
numFeatures
SVMWithSGD
setValidateData
SVMModel
getThreshold
```
For users the greatest benefit in this PR is multiclass classification was supported by Python API.
Users can train multiclass classification model and use it to predict in pyspark.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #5137 from yanboliang/spark-6255 and squashes the following commits:
0bd531e [Yanbo Liang] address comments
444d5e2 [Yanbo Liang] LogisticRegressionModel.predict() optimization
fc7990b [Yanbo Liang] address comments
b0d9c63 [Yanbo Liang] Support Mulinomial LR model predict in Python API
ded847c [Yanbo Liang] Python API parity check for classification (support multiclass classification)
|
|
|
|
|
|
|
|
|
|
|
| |
This is the sub-task of SPARK-6254.
Wrapping IDFModel `idf` member function for pyspark.
Author: lewuathe <lewuathe@me.com>
Closes #5264 from Lewuathe/SPARK-6598 and squashes the following commits:
1dc522c [lewuathe] [SPARK-6598] Python API for IDFModel
|
|
|
|
|
|
|
|
|
|
| |
This fixes `predictAll` after load. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #5243 from mengxr/SPARK-6571 and squashes the following commits:
82dcaa7 [Xiangrui Meng] use wrapper in MatrixFactorizationModel.load
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
MLlib Python API parity check for Regression, major disparities need to be added for Python list following:
```scala
LinearRegressionWithSGD
setValidateData
LassoWithSGD
setIntercept
setValidateData
RidgeRegressionWithSGD
setIntercept
setValidateData
```
setFeatureScaling is mllib private function which is not needed to expose in pyspark.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #4997 from yanboliang/spark-6256 and squashes the following commits:
102f498 [Yanbo Liang] fix intercept issue & add doc test
1fb7b4f [Yanbo Liang] change 'intercept' to 'addIntercept'
de5ecbc [Yanbo Liang] MLlib Python API parity check for regression
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
correctly
Weight parameters must be initialized correctly even when numpy array is passed as initial weights.
Author: lewuathe <lewuathe@me.com>
Closes #5101 from Lewuathe/SPARK-6421 and squashes the following commits:
7795201 [lewuathe] Fix lint-python errors
21d4fe3 [lewuathe] Fix init logic of weights
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See [SPARK-6096](https://issues.apache.org/jira/browse/SPARK-6096).
Author: Xusen Yin <yinxusen@gmail.com>
Closes #5090 from yinxusen/SPARK-6096 and squashes the following commits:
bd0fea5 [Xusen Yin] fix style problem, etc.
3fd41f2 [Xusen Yin] use hanging indent in Python style
e83803d [Xusen Yin] fix Python style
d6dbde5 [Xusen Yin] fix python call java error
a054bb3 [Xusen Yin] add save load for NaiveBayes python
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For Python's linear models, weights and intercept are stored in Python.
This PR implements Python's linear models sava/load functions which do the same thing as scala.
It can also make model import/export cross languages.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #5016 from yanboliang/spark-6095 and squashes the following commits:
d9bb824 [Yanbo Liang] fix python style
b3813ca [Yanbo Liang] linear model save/load for Python reuse the Scala implementation
|
|
|
|
|
|
|
|
|
|
|
| |
Use `_py2java` and `_java2py` to convert Python model to/from Java model. yinxusen
Author: Xiangrui Meng <meng@databricks.com>
Closes #5049 from mengxr/SPARK-6226-mengxr and squashes the following commits:
570ba81 [Xiangrui Meng] fix python style
b10b911 [Xiangrui Meng] add save/load in PySpark's KMeansModel
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add LassoModel to __all__ in regression.py
LassoModel does not show up in Python docs
This should be merged into branch-1.3 and master.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4970 from jkbradley/SPARK-6253 and squashes the following commits:
c2cb533 [Joseph K. Bradley] Add LassoModel to __all__ in regression.py
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A simple wrapper around the Scala implementation. `DataFrame` is used for serialization/deserialization. Methods that return `RDD`s are not supported in this PR.
davies If we recognize Scala's `Product`s in Py4J, we can easily add wrappers for Scala methods that returns `RDD[(Double, Double)]`. Is it easy to register serializer for `Product` in PySpark?
Author: Xiangrui Meng <meng@databricks.com>
Closes #4863 from mengxr/SPARK-6090 and squashes the following commits:
009a3a3 [Xiangrui Meng] provide schema
dcddab5 [Xiangrui Meng] add a basic BinaryClassificationMetrics to PySpark/MLlib
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Similar to `MatrixFactorizaionModel`, we only need wrappers to support save/load for tree models in Python.
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #4854 from mengxr/SPARK-6097 and squashes the following commits:
4586a4d [Xiangrui Meng] fix more typos
8ebcac2 [Xiangrui Meng] fix python style
91172d8 [Xiangrui Meng] fix typos
201b3b9 [Xiangrui Meng] update user guide
b5158e2 [Xiangrui Meng] support tree model save/load in PySpark/MLlib
|
|
|
|
|
|
|
|
|
|
|
|
| |
`df.dtypes` shows `null` for UDTs. This PR uses `udt` by default and `VectorUDT` overwrites it with `vector`.
jkbradley davies
Author: Xiangrui Meng <meng@databricks.com>
Closes #4858 from mengxr/SPARK-6121 and squashes the following commits:
34f0a77 [Xiangrui Meng] simpleString for UDT
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
for pyspark
Currently LogisticRegressionWithLBFGS in python/pyspark/mllib/classification.py will invoke callMLlibFunc with a wrong "regType" parameter.
It was assigned to "str(regType)" which translate None(Python) to "None"(Java/Scala). The right way should be translate None(Python) to null(Java/Scala) just as what we did at LogisticRegressionWithSGD.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #4831 from yanboliang/pyspark_classification and squashes the following commits:
12db65a [Yanbo Liang] correct LogisticRegressionWithLBFGS regType parameter for pyspark
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #4811 from mengxr/SPARK-5991 and squashes the following commits:
f135dac [Xiangrui Meng] update save doc
57e5200 [Xiangrui Meng] address comments
06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991
282ec8d [Xiangrui Meng] support save/load in PySpark's ALS
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
save/load, Python GBT
* Add GradientBoostedTrees Python examples to ML guide
* I ran these in the pyspark shell, and they worked.
* Add save/load to examples in ML guide
* Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981)
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4750 from jkbradley/SPARK-5974 and squashes the following commits:
c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes
bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide. Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet).
6d81c3e [Joseph K. Bradley] completed python GBT examples
9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases
c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide. Added GBT examples to ML guide
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For SPARK-5867:
* The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
* It should also include Python examples now.
For SPARK-5892:
* Fix Python docs
* Various other cleanups
BTW, I accidentally merged this with master. If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]
CC: mengxr (ML), davies (Python docs)
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits:
f191bb0 [Joseph K. Bradley] small cleanups
e786efa [Joseph K. Bradley] small doc corrections
6b1ab4a [Joseph K. Bradley] fixed python lint test
946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example. Changed spark.ml Java examples to use DataFrames API instead of sql()
da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
34b067f [Joseph K. Bradley] small doc correction
da16aef [Joseph K. Bradley] Fixed python mllib docs
8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
b05a80d [Joseph K. Bradley] organize imports. doc cleanups
e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- The old implicit would convert RDDs directly to DataFrames, and that added too many methods.
- toDataFrame -> toDF
- Dsl -> functions
- implicits moved into SQLContext.implicits
- addColumn -> withColumn
- renameColumn -> withColumnRenamed
Python changes:
- toDataFrame -> toDF
- Dsl -> functions package
- addColumn -> withColumn
- renameColumn -> withColumnRenamed
- add toDF functions to RDD on SQLContext init
- add flatMap to DataFrame
Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>
Closes #4556 from rxin/SPARK-5752 and squashes the following commits:
5ef9910 [Reynold Xin] More fix
61d3fca [Reynold Xin] Merge branch 'df5' of github.com:davies/spark into SPARK-5752
ff5832c [Reynold Xin] Fix python
749c675 [Reynold Xin] count(*) fixes.
5806df0 [Reynold Xin] Fix build break again.
d941f3d [Reynold Xin] Fixed explode compilation break.
fe1267a [Davies Liu] flatMap
c4afb8e [Reynold Xin] style
d9de47f [Davies Liu] add comment
b783994 [Davies Liu] add comment for toDF
e2154e5 [Davies Liu] schema() -> schema
3a1004f [Davies Liu] Dsl -> functions, toDF()
fb256af [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
0dd74eb [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
97dd47c [Davies Liu] fix mistake
6168f74 [Davies Liu] fix test
1fc0199 [Davies Liu] fix test
a075cd5 [Davies Liu] clean up, toPandas
663d314 [Davies Liu] add test for agg('*')
9e214d5 [Reynold Xin] count(*) fixes.
1ed7136 [Reynold Xin] Fix build break again.
921b2e3 [Reynold Xin] Fixed explode compilation break.
14698d4 [Davies Liu] flatMap
ba3e12d [Reynold Xin] style
d08c92d [Davies Liu] add comment
5c8b524 [Davies Liu] add comment for toDF
a4e5e66 [Davies Liu] schema() -> schema
d377fc9 [Davies Liu] Dsl -> functions, toDF()
6b3086c [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
807e8b1 [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
All the DataTypes moved into pyspark.sql.types
The changes can be tracked by `--find-copies-harder -M25`
```
davieslocalhost:~/work/spark/python$ git diff --find-copies-harder -M25 --numstat master..
2 5 python/docs/pyspark.ml.rst
0 3 python/docs/pyspark.mllib.rst
10 2 python/docs/pyspark.sql.rst
1 1 python/pyspark/mllib/linalg.py
21 14 python/pyspark/{mllib => sql}/__init__.py
14 2108 python/pyspark/{sql.py => sql/context.py}
10 1772 python/pyspark/{sql.py => sql/dataframe.py}
7 6 python/pyspark/{sql_tests.py => sql/tests.py}
8 1465 python/pyspark/{sql.py => sql/types.py}
4 2 python/run-tests
1 1 sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala
```
Also `git blame -C -C python/pyspark/sql/context.py` to track the history.
Author: Davies Liu <davies@databricks.com>
Closes #4479 from davies/sql and squashes the following commits:
1b5f0a5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sql
2b2b983 [Davies Liu] restructure pyspark.sql
|
|
|
|
|
|
|
|
|
|
|
| |
Add a seed for tests.
Author: Davies Liu <davies@databricks.com>
Closes #4358 from davies/flaky_test and squashes the following commits:
02371c3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into flaky_test
ced499b [Davies Liu] add seed for test
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The only issue is that `analyzeBlock` is removed, which was marked as a developer API. I didn't change other tests in the ALSSuite under `spark.mllib` to ensure that the implementation is correct.
CC: srowen coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes #4321 from mengxr/SPARK-5536 and squashes the following commits:
5a3cee8 [Xiangrui Meng] update python tests that are too strict
e840acf [Xiangrui Meng] ignore scala style check for ALS.train
e9a721c [Xiangrui Meng] update mima excludes
9ee6a36 [Xiangrui Meng] merge master
9a8aeac [Xiangrui Meng] update tests
d8c3271 [Xiangrui Meng] remove analyzeBlocks
d68eee7 [Xiangrui Meng] add checkpoint to new ALS
22a56f8 [Xiangrui Meng] wrap old ALS
c387dff [Xiangrui Meng] support random seed
3bdf24b [Xiangrui Meng] make storage level configurable in the new ALS
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Python API for the Gaussian Mixture Model clustering algorithm in MLLib.
Author: FlytxtRnD <meethu.mathew@flytxt.com>
Closes #4059 from FlytxtRnD/PythonGmmWrapper and squashes the following commits:
c973ab3 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
339b09c [FlytxtRnD] Added MultivariateGaussian namedtuple and Arraybuffer in trainGaussianMixture
fa0a142 [FlytxtRnD] New line added
d5b36ab [FlytxtRnD] Changed argument names to lowercase
ac134f1 [FlytxtRnD] Merge branch 'PythonGmmWrapper' of https://github.com/FlytxtRnD/spark into PythonGmmWrapper
6671ea1 [FlytxtRnD] Added mllib/stat/distribution.py
3aee84b [FlytxtRnD] Fixed style issues
2e9f12a [FlytxtRnD] Added mllib/stat/distribution.py and fixed style issues
b22532c [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
2e14d82 [FlytxtRnD] Incorporate MultivariateGaussian instances in GaussianMixtureModel
05767c7 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
3464d19 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
c1d4c71 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'origin/PythonGmmWrapper' into PythonGmmWrapper
426d130 [FlytxtRnD] Added random seed parameter
332bad1 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
f82750b [FlytxtRnD] Fixed style issues
5c83825 [FlytxtRnD] Split input file with space delimiter
fda60f3 [FlytxtRnD] Python API for Gaussian Mixture Model
|
|
|
|
|
|
|
|
|
|
| |
This PR is implementing the Gradient Boosted Trees for Python API.
Author: Kazuki Taniguchi <kazuki.t.1018@gmail.com>
Closes #3951 from kazk1018/gbt_for_py and squashes the following commits:
620d247 [Kazuki Taniguchi] [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is only a single `stat.py` file for the `mllib.stat` package. We recently added `MultivariateGaussian` under `mllib.stat.distribution` in Scala/Java. It would be nice to refactor `stat.py` and make it easy to expand. Note that `ChiSqTestResult` is moved from `mllib.stat` to `mllib.stat.test`. The latter is used in Scala/Java. It is only used in the return value of `Statistics.chiSqTest`, so this should be an okay change.
davies
Author: Xiangrui Meng <meng@databricks.com>
Closes #4266 from mengxr/py-stat-refactor and squashes the following commits:
1a5e1db [Xiangrui Meng] refactor stat.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This implements the functionality for SPARK-4749 and provides units tests in Scala and PySpark
Author: nate.crosswhite <nate.crosswhite@stresearch.com>
Author: nxwhite-str <nxwhite-str@users.noreply.github.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #3610 from nxwhite-str/master and squashes the following commits:
a2ebbd3 [nxwhite-str] Merge pull request #1 from mengxr/SPARK-4749-kmeans-seed
7668124 [Xiangrui Meng] minor updates
f8d5928 [nate.crosswhite] Addressing PR issues
277d367 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
9156a57 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
5d087b4 [nate.crosswhite] Adding KMeans train with seed and Scala unit test
616d111 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
35c1884 [nate.crosswhite] Add kmeans initial seed to pyspark API
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Slightly different than the scala code which converts the sparsevector into a densevector and then checks the index.
I also hope I've added tests in the right place.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #4025 from MechCoder/spark-2909 and squashes the following commits:
07d0f26 [MechCoder] STY: Rename item to index
f02148b [MechCoder] [SPARK-2909] [Mlib] SparseVector in pyspark now supports indexing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It will introduce problems if the object in dict/list/tuple can not support by py4j, such as Vector.
Also, pickle may have better performance for larger object (less RPC).
In some cases that the object in dict/list can not be pickled (such as JavaObject), we should still use MapConvert/ListConvert.
This PR should be ported into branch-1.2
Author: Davies Liu <davies@databricks.com>
Closes #4023 from davies/listconvert and squashes the following commits:
55d4ab2 [Davies Liu] fix MapConverter and ListConverter in MLlib
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...ySpark MLlib
This is a follow up to PR3680 https://github.com/apache/spark/pull/3680 .
Author: RJ Nowling <rnowling@gmail.com>
Closes #3955 from rnowling/spark4891 and squashes the following commits:
1236a01 [RJ Nowling] Fix Python style issues
7a01a78 [RJ Nowling] Fix Python style issues
174beab [RJ Nowling] [SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp dist sampling to PySpark MLlib
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to `DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans.
The PR includes the fix, as well as a new test for the correct conversion behavior.
davies
Author: freeman <the.freeman.lab@gmail.com>
Closes #3902 from freeman-lab/fix-vector-convert and squashes the following commits:
764db47 [freeman] Add a test for proper conversion behavior
704f97e [freeman] Return array after changing type
|
|
|
|
|
|
|
|
|
|
|
|
| |
Modify python annotations for sphinx. There is no change to build process from.
https://github.com/apache/spark/blob/master/docs/README.md
Author: lewuathe <lewuathe@me.com>
Closes #3685 from Lewuathe/sphinx-tag-for-pydoc and squashes the following commits:
88a0fd9 [lewuathe] [SPARK-4822] Fix DevelopApi and WARN tags
3d7a398 [lewuathe] [SPARK-4822] Use sphinx tags for Python doc annotations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
+ small doc edit
+ include edit to make IntelliJ happy
CC: davies mengxr
Note to davies -- this does not fix the "WARNING: Literal block expected; none found." warnings since that seems to involve spacing which IntelliJ does not like. (Those warnings occur when generating the Python docs.)
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #3669 from jkbradley/python-warnings and squashes the following commits:
4587868 [Joseph K. Bradley] fixed warning
8cb073c [Joseph K. Bradley] Updated based on davies recommendation
c51eca4 [Joseph K. Bradley] Updated rst file for pyspark.mllib.rand doc. Small doc edit. Small include edit to make IntelliJ happy.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR tests the pyspark Chi-squared hypothesis test from this commit: c8abddc5164d8cf11cdede6ab3d5d1ea08028708 and moves some of the error messaging in to python.
It is a port of the Scala tests here: [HypothesisTestSuite.scala](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala)
Hopefully, SPARK-2980 can be closed.
Author: jbencook <jbenjamincook@gmail.com>
Closes #3679 from jbencook/master and squashes the following commits:
44078e0 [jbencook] checking that bad input throws the correct exceptions
f12ee10 [jbencook] removing checks for ValueError since input tests are on the Scala side
7536cf1 [jbencook] removing python checks for invalid input
a17ee84 [jbencook] [SPARK-2980][mllib] adding unit tests for the pyspark chi-squared test
3aeb0d9 [jbencook] [SPARK-2980][mllib] bringing Chi-squared error messages to the python side
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I improved `IDFModel.transform` to allow using a single vector.
[[SPARK-4494] IDFModel.transform() add support for single vector - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-4494)
Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #3603 from yu-iskw/idf and squashes the following commits:
256ff3d [Yuu ISHIKAWA] Fix typo
a3bf566 [Yuu ISHIKAWA] - Fix typo - Optimize import order - Aggregate the assertion tests - Modify `IDFModel.transform` API for pyspark
d25e49b [Yuu ISHIKAWA] Add the implementation of `IDFModel.transform` for a term frequency vector
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DecisionTree API fix
Major changes:
* Added programming guide sections for tree ensembles
* Added examples for tree ensembles
* Updated DecisionTree programming guide with more info on parameters
* **API change**: Standardized the tree parameter for the number of classes (for classification)
Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
* Use train/test split, and compute test error instead of training error.
* Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)
Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming
I have run all examples and relevant unit tests.
CC: mengxr manishamde codedeft
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:
70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
6fab846 [Joseph K. Bradley] small fixes based on review
b9f8576 [Joseph K. Bradley] updated decision tree doc
375204c [Joseph K. Bradley] fixed python style
2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file. added header. Fixed small bug in same example in the programming guide.
706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
c76c823 [Joseph K. Bradley] added migration guide for mllib
abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
cdfdfbc [Joseph K. Bradley] added examples for GBT
6372a2b [Joseph K. Bradley] updated decision tree examples to use random split. tested all of them.
ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide. still need to update their examples
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array.
It also improve the serialization of DenseVector.
Before this change:
trial | trainingTime | testTime
-------|--------|--------
0 | 5.126 | 1.786
1 |2.698 |1.693
After the change:
trial | trainingTime | testTime
-------|--------|--------
0 |4.692 |0.554
1 |2.307 |0.525
This could partially fix the performance regression during test.
Author: Davies Liu <davies@databricks.com>
Closes #3420 from davies/ser2 and squashes the following commits:
0e1e6f3 [Davies Liu] fix tests
426f5db [Davies Liu] impove toArray()
44707ec [Davies Liu] add name for ISO-8859-1
fa7d791 [Davies Liu] address comments
1cfb137 [Davies Liu] handle zero sparse vector
2548ee2 [Davies Liu] fix tests
9e6389d [Davies Liu] bugfix
470f702 [Davies Liu] speed up DenseMatrix
f0d3c40 [Davies Liu] speedup SparseVector
ef6ce70 [Davies Liu] speed up dense vector
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.
This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.
Author: Davies Liu <davies@databricks.com>
Closes #3397 from davies/cache and squashes the following commits:
7f6e6ce [Davies Liu] Update -> Updater
4b52edd [Davies Liu] using named argument
63b984e [Davies Liu] fix
7da0332 [Davies Liu] add unpersist()
dff33e1 [Davies Liu] address comments
c2bdfc2 [Davies Liu] refactor
d572f00 [Davies Liu] Merge branch 'master' into cache
f1063e1 [Davies Liu] cache serialized java object
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
```
class RandomForestModel
| A model trained by RandomForest
|
| numTrees(self)
| Get number of trees in forest.
|
| predict(self, x)
| Predict values for a single data point or an RDD of points using the model trained.
|
| toDebugString(self)
| Full model
|
| totalNumNodes(self)
| Get total number of nodes, summed over all trees in the forest.
|
class RandomForest
| trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None):
| Method to train a decision tree model for binary or multiclass classification.
|
| :param data: Training dataset: RDD of LabeledPoint.
| Labels should take values {0, 1, ..., numClasses-1}.
| :param numClassesForClassification: number of classes for classification.
| :param categoricalFeaturesInfo: Map storing arity of categorical features.
| E.g., an entry (n -> k) indicates that feature n is categorical
| with k categories indexed from 0: {0, 1, ..., k-1}.
| :param numTrees: Number of trees in the random forest.
| :param featureSubsetStrategy: Number of features to consider for splits at each node.
| Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
| If "auto" is set, this parameter is set based on numTrees:
| if numTrees == 1, set to "all";
| if numTrees > 1 (forest) set to "sqrt".
| :param impurity: Criterion used for information gain calculation.
| Supported values: "gini" (recommended) or "entropy".
| :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
| 1 internal node + 2 leaf nodes. (default: 4)
| :param maxBins: maximum number of bins used for splitting features (default: 100)
| :param seed: Random seed for bootstrapping and choosing feature subsets.
| :return: RandomForestModel that can be used for prediction
|
| trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None):
| Method to train a decision tree model for regression.
|
| :param data: Training dataset: RDD of LabeledPoint.
| Labels are real numbers.
| :param categoricalFeaturesInfo: Map storing arity of categorical features.
| E.g., an entry (n -> k) indicates that feature n is categorical
| with k categories indexed from 0: {0, 1, ..., k-1}.
| :param numTrees: Number of trees in the random forest.
| :param featureSubsetStrategy: Number of features to consider for splits at each node.
| Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
| If "auto" is set, this parameter is set based on numTrees:
| if numTrees == 1, set to "all";
| if numTrees > 1 (forest) set to "onethird".
| :param impurity: Criterion used for information gain calculation.
| Supported values: "variance".
| :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
| 1 internal node + 2 leaf nodes.(default: 4)
| :param maxBins: maximum number of bins used for splitting features (default: 100)
| :param seed: Random seed for bootstrapping and choosing feature subsets.
| :return: RandomForestModel that can be used for prediction
|
```
Author: Davies Liu <davies@databricks.com>
Closes #3320 from davies/forest and squashes the following commits:
8003dfc [Davies Liu] reorder
53cf510 [Davies Liu] fix docs
4ca593d [Davies Liu] fix docs
e0df852 [Davies Liu] fix docs
0431746 [Davies Liu] rebased
2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into forest
885abee [Davies Liu] address comments
dae7fc0 [Davies Liu] address comments
89a000f [Davies Liu] fix docs
565d476 [Davies Liu] add python api for random forest
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
```
class LogisticRegressionWithLBFGS
| train(cls, data, iterations=100, initialWeights=None, corrections=10, tolerance=0.0001, regParam=0.01, intercept=False)
| Train a logistic regression model on the given data.
|
| :param data: The training data, an RDD of LabeledPoint.
| :param iterations: The number of iterations (default: 100).
| :param initialWeights: The initial weights (default: None).
| :param regParam: The regularizer parameter (default: 0.01).
| :param regType: The type of regularizer used for training
| our model.
| :Allowed values:
| - "l1" for using L1 regularization
| - "l2" for using L2 regularization
| - None for no regularization
| (default: "l2")
| :param intercept: Boolean parameter which indicates the use
| or not of the augmented representation for
| training data (i.e. whether bias features
| are activated or not).
| :param corrections: The number of corrections used in the LBFGS update (default: 10).
| :param tolerance: The convergence tolerance of iterations for L-BFGS (default: 1e-4).
|
| >>> data = [
| ... LabeledPoint(0.0, [0.0, 1.0]),
| ... LabeledPoint(1.0, [1.0, 0.0]),
| ... ]
| >>> lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(data))
| >>> lrm.predict([1.0, 0.0])
| 1
| >>> lrm.predict([0.0, 1.0])
| 0
| >>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect()
| [1, 0]
```
Author: Davies Liu <davies@databricks.com>
Closes #3307 from davies/lbfgs and squashes the following commits:
34bd986 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into lbfgs
5a945a6 [Davies Liu] address comments
941061b [Davies Liu] Merge branch 'master' of github.com:apache/spark into lbfgs
03e5543 [Davies Liu] add it to docs
ed2f9a8 [Davies Liu] add regType
76cd1b6 [Davies Liu] reorder arguments
4429a74 [Davies Liu] Update classification.py
9252783 [Davies Liu] python api for LogisticRegressionWithLBFGS
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In PySpark, ALS can take an RDD of (user, product, rating) tuples as input. However, model.predict outputs an RDD of Rating. So on the input side, users can use r[0], r[1], r[2], while on the output side, users have to use r.user, r.product, r.rating. We should allow lookup by index in Rating by making Rating a namedtuple.
davies
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3261)
<!-- Reviewable:end -->
Author: Xiangrui Meng <meng@databricks.com>
Closes #3261 from mengxr/SPARK-4396 and squashes the following commits:
543aef0 [Xiangrui Meng] use named tuple to implement ALS
0b61bae [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4396
d3bd7d4 [Xiangrui Meng] allow lookup by index in Python's Rating
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR add setThrehold() and clearThreshold() for LogisticRegressionModel and SVMModel, also support RDD of vector in LogisticRegressionModel.predict(), SVNModel.predict() and NaiveBayes.predict()
Author: Davies Liu <davies@databricks.com>
Closes #3305 from davies/setThreshold and squashes the following commits:
d0b835f [Davies Liu] Merge branch 'master' of github.com:apache/spark into setThreshold
e4acd76 [Davies Liu] address comments
2231a5f [Davies Liu] bugfix
7bd9009 [Davies Liu] address comments
0b0a8a7 [Davies Liu] address comments
c1e5573 [Davies Liu] improve classification
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
and Python
The current default regParam is 1.0 and regType is claimed to be none in Python (but actually it is l2), while regParam = 0.0 and regType is L2 in Scala. We should make the default values consistent. This PR sets the default regType to L2 and regParam to 0.01. Note that the default regParam value in LIBLINEAR (and hence scikit-learn) is 1.0. However, we use average loss instead of total loss in our formulation. Hence regParam=1.0 is definitely too heavy.
In LinearRegression, we set regParam=0.0 and regType=None, because we have separate classes for Lasso and Ridge, both of which use regParam=0.01 as the default.
davies atalwalkar
Author: Xiangrui Meng <meng@databricks.com>
Closes #3232 from mengxr/SPARK-4372 and squashes the following commits:
9979837 [Xiangrui Meng] update Ridge/Lasso to use default regParam 0.01 cast input arguments
d3ba096 [Xiangrui Meng] change 'none' back to None
1909a6e [Xiangrui Meng] change default regParam to 0.01 and regType to L2 in LR and SVM
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR rename random.py to rand.py to avoid the side affects of conflict with random module, but still keep the same interface as before.
```
>>> from pyspark.mllib.random import RandomRDDs
```
```
$ pydoc pyspark.mllib.random
Help on module random in pyspark.mllib:
NAME
random - Python package for random data generation.
FILE
/Users/davies/work/spark/python/pyspark/mllib/rand.py
CLASSES
__builtin__.object
pyspark.mllib.random.RandomRDDs
class RandomRDDs(__builtin__.object)
| Generator methods for creating RDDs comprised of i.i.d samples from
| some distribution.
|
| Static methods defined here:
|
| normalRDD(sc, size, numPartitions=None, seed=None)
```
cc mengxr
reference link: http://xion.org.pl/2012/05/06/hacking-python-imports/
Author: Davies Liu <davies@databricks.com>
Closes #3216 from davies/random and squashes the following commits:
7ac4e8b [Davies Liu] rename random.py to rand.py
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix TreeModel.predict() with RDD, added tests for it.
(Also checked that other models don't have this issue)
Author: Davies Liu <davies@databricks.com>
Closes #3230 from davies/predict and squashes the following commits:
81172aa [Davies Liu] fix predict
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR check all of the existing Python MLlib API to make sure that numpy.array is supported as Vector (also RDD of numpy.array).
It also improve some docstring and doctest.
cc mateiz mengxr
Author: Davies Liu <davies@databricks.com>
Closes #3189 from davies/numpy and squashes the following commits:
d5057c4 [Davies Liu] fix tests
6987611 [Davies Liu] support numpy.array for all MLlib API
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SPARK-1553 added alternating nonnegative least squares to MLLib, however it's not possible to access it via the python API. This pull request resolves that.
Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com>
Closes #3095 from mdagost/python_nmf and squashes the following commits:
a6743ad [Michelangelo D'Agostino] Use setters instead of static methods in PythonMLLibAPI. Remove the new static methods I added. Set seed in tests. Change ratings to ratingsRDD in both train and trainImplicit for consistency.
7cffd39 [Michelangelo D'Agostino] Swapped nonnegative and seed in a few more places.
3fdc851 [Michelangelo D'Agostino] Moved seed to the end of the python parameter list.
bdcc154 [Michelangelo D'Agostino] Change seed type to java.lang.Long so that it can handle null.
cedf043 [Michelangelo D'Agostino] Added in ability to set the seed from python and made that play nice with the nonnegative changes. Also made the python ALS tests more exact.
a72fdc9 [Michelangelo D'Agostino] Expose nonnegative ALS in the python API.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
```
pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
:: Experimental ::
If `observed` is Vector, conduct Pearson's chi-squared goodness
of fit test of the observed data against the expected distribution,
or againt the uniform distribution (by default), with each category
having an expected frequency of `1 / len(observed)`.
(Note: `observed` cannot contain negative values)
If `observed` is matrix, conduct Pearson's independence test on the
input contingency matrix, which cannot contain negative entries or
columns or rows that sum up to 0.
If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
test for every feature against the label across the input RDD.
For each feature, the (feature, label) pairs are converted into a
contingency matrix for which the chi-squared statistic is computed.
All label and feature values must be categorical.
:param observed: it could be a vector containing the observed categorical
counts/relative frequencies, or the contingency matrix
(containing either counts or relative frequencies),
or an RDD of LabeledPoint containing the labeled dataset
with categorical features. Real-valued features will be
treated as categorical for each distinct value.
:param expected: Vector containing the expected categorical counts/relative
frequencies. `expected` is rescaled if the `expected` sum
differs from the `observed` sum.
:return: ChiSquaredTest object containing the test statistic, degrees
of freedom, p-value, the method used, and the null hypothesis.
```
Author: Davies Liu <davies@databricks.com>
Closes #3091 from davies/his and squashes the following commits:
145d16c [Davies Liu] address comments
0ab0764 [Davies Liu] fix float
5097d54 [Davies Liu] add Hypothesis test Python API
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
default.
This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
Author: Davies Liu <davies@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Josh Rosen <joshrosen@databricks.com>
Closes #2920 from davies/fix_autobatch and squashes the following commits:
e544ef9 [Davies Liu] revert unrelated change
6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
1d557fc [Davies Liu] fix tests
8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
76abdce [Davies Liu] clean up
53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
b4292ce [Davies Liu] fix bug in master
d79744c [Davies Liu] recover hive tests
be37ece [Davies Liu] refactor
eb3938d [Davies Liu] refactor serializer in scala
8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley.
~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~
marmbrus jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #3070 from mengxr/SPARK-3573 and squashes the following commits:
3a0b6e5 [Xiangrui Meng] organize imports
236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.
After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.
cc mengxr
Author: Davies Liu <davies@databricks.com>
Closes #2995 from davies/cleanup and squashes the following commits:
8fa6ec6 [Davies Liu] address comments
16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup
43743e5 [Davies Liu] bugfix
731331f [Davies Liu] simplify serialization in MLlib Python API
|