| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
This fixes `predictAll` after load. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #5243 from mengxr/SPARK-6571 and squashes the following commits:
82dcaa7 [Xiangrui Meng] use wrapper in MatrixFactorizationModel.load
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See [SPARK-6526](https://issues.apache.org/jira/browse/SPARK-6526).
mengxr Should we add test suite for this transformer? There is no test suite for all feature transformers in ML package now.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #5181 from yinxusen/SPARK-6526 and squashes the following commits:
6faa7bf [Xusen Yin] fix style
8a462da [Xusen Yin] remove duplications
ab35ab0 [Xusen Yin] add test suite
bc8cd0f [Xusen Yin] fix comment
79774c9 [Xusen Yin] add Normalizer transformer in ML package
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are any bugs of breeze's SparseVector at 0.11.1. You know, Spark 1.3 depends on breeze 0.11.1. So I think we should upgrade it to 0.11.2.
https://issues.apache.org/jira/browse/SPARK-6341
And thanks you for your great cooperation, David Hall(dlwh)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #5222 from yu-iskw/upgrade-breeze and squashes the following commits:
ad8a688 [Yu ISHIKAWA] Upgrade breeze from 0.11.1 to 0.11.2 because of a bug of SparseVector. Thanks you for your great cooperation, David Hall(@dlwh)
|
|
|
|
|
|
|
|
|
|
| |
minor thing. Let me know if jira is required.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #5207 from hhbyyh/adjustImport and squashes the following commits:
2240121 [Yuhao Yang] remove unused import
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Should be self explanatory.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #4986 from MechCoder/spark-5987 and squashes the following commits:
7d2cd56 [MechCoder] Iterate over dataframe in a better way
e7a14cb [MechCoder] Minor
33c84f9 [MechCoder] Store as Array[Data] instead of Data[Array]
505bd57 [MechCoder] Rebased over master and used MatrixUDT
7422bb4 [MechCoder] Store sigmas as Array[Double] instead of Array[Array[Double]]
b9794e4 [MechCoder] Minor
cb77095 [MechCoder] [SPARK-5987] Save/load for GaussianMixtureModels
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
MLlib Python API parity check for Regression, major disparities need to be added for Python list following:
```scala
LinearRegressionWithSGD
setValidateData
LassoWithSGD
setIntercept
setValidateData
RidgeRegressionWithSGD
setIntercept
setValidateData
```
setFeatureScaling is mllib private function which is not needed to expose in pyspark.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #4997 from yanboliang/spark-6256 and squashes the following commits:
102f498 [Yanbo Liang] fix intercept issue & add doc test
1fb7b4f [Yanbo Liang] change 'intercept' to 'addIntercept'
de5ecbc [Yanbo Liang] MLlib Python API parity check for regression
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added a Regex based tokenizer for ml.
Currently the regex is fixed but if I could add a regex type paramater to the paramMap,
changing the tokenizer regex could be a parameter used in the crossValidation.
Also I wonder what would be the best way to add a stop word list.
Author: Augustin Borsu <augustin@sagacify.com>
Author: Augustin Borsu <a.borsu@gmail.com>
Author: Augustin Borsu <aborsu@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #4504 from aborsu985/master and squashes the following commits:
716d257 [Augustin Borsu] Merge branch 'mengxr-SPARK-5566'
cb07021 [Augustin Borsu] Merge branch 'SPARK-5566' of git://github.com/mengxr/spark into mengxr-SPARK-5566
5f09434 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
a164800 [Xiangrui Meng] remove tabs
556aa27 [Xiangrui Meng] Merge branch 'aborsu985-master' into SPARK-5566
9651aec [Xiangrui Meng] update test
f96526d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5566
2338da5 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
e88d7b8 [Xiangrui Meng] change pattern to a StringParameter; update tests
148126f [Augustin Borsu] Added return type to public functions
12dddb4 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
daf685e [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
6a85982 [Augustin Borsu] Style corrections
38b95a1 [Augustin Borsu] Added Java unit test for RegexTokenizer
b66313f [Augustin Borsu] Modified the pattern Param so it is compiled when given to the Tokenizer
e262bac [Augustin Borsu] Added unit tests in scala
cd6642e [Augustin Borsu] Changed regex to pattern
132b00b [Augustin Borsu] Changed matching to gaps and removed case folding
201a107 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
cb9c9a7 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
d3ef6d3 [Augustin Borsu] Added doc to RegexTokenizer
9082fc3 [Augustin Borsu] Removed stopwords parameters and updated doc
19f9e53 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
f6a5002 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
7f930bb [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
77ff9ca [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
2e89719 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
196cd7a [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
11ca50f [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
9f8685a [Augustin Borsu] RegexTokenizer
9e07a78 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
9547e9d [Augustin Borsu] RegEx Tokenizer
01cd26f [Augustin Borsu] RegExTokenizer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
should initialize numFeatures
In GeneralizedLinearAlgorithm ```numFeatures``` is default to -1, we need to update it to correct value when we call run() to train a model.
```LogisticRegressionWithLBFGS.run(input)``` works well, but when we call ```LogisticRegressionWithLBFGS.run(input, initialWeights)``` to train multiclass classification model, it will throw exception due to the numFeatures is not updated.
In this PR, we just update numFeatures at the beginning of GeneralizedLinearAlgorithm.run(input, initialWeights) and add test case.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #5167 from yanboliang/spark-6496 and squashes the following commits:
8131c48 [Yanbo Liang] LogisticRegressionWithLBFGS.run(input, initialWeights) should initialize numFeatures
|
|
|
|
|
|
|
|
|
| |
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #5118 from MechCoder/spark-6308 and squashes the following commits:
6c8ffab [MechCoder] Add test for simpleString
b966242 [MechCoder] [SPARK-6308] [MLlib][Sql] VectorUDT is displayed as vecto in dtypes
|
|
|
|
|
|
|
|
| |
Author: vinodkc <vinod.kc.in@gmail.com>
Closes #5112 from vinodkc/spark_1.3_doc_fixes and squashes the following commits:
2c6aee6 [vinodkc] Spark 1.3 doc fixes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
learning curve
Added evaluateEachIteration to allow the user to manually extract the error for each iteration of GradientBoosting. The internal optimisation can be dealt with later.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #4906 from MechCoder/spark-6025 and squashes the following commits:
67146ab [MechCoder] Minor
352001f [MechCoder] Minor
6e8aa10 [MechCoder] Made the following changes Used mapPartition instead of map Refactored computeError and unpersisted broadcast variables
bc99ac6 [MechCoder] Refactor the method and stuff
dbda033 [MechCoder] [SPARK-6025] Add helper method evaluateEachIteration to extract learning curve
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Utilities to serialize and deserialize Matrices in MLlib
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #5048 from MechCoder/spark-6309 and squashes the following commits:
05dc6f2 [MechCoder] Hashcode and organize imports
16d5d47 [MechCoder] Test some more
6e67020 [MechCoder] TST: Test using Array conversion instead of equals
7fa7a2c [MechCoder] [SPARK-6309] [SQL] [MLlib] Implement MatrixUDT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add checkpiontInterval to ALS to prevent:
1. StackOverflow exceptions caused by long lineage,
2. large shuffle files generated during iterations,
3. slow recovery when some node fail.
srowen coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes #5076 from mengxr/SPARK-5955 and squashes the following commits:
df56791 [Xiangrui Meng] update impl to reuse code
29affcb [Xiangrui Meng] do not materialize factors in implicit
20d3f7f [Xiangrui Meng] add checkpointInterval to ALS
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See [SPARK-6096](https://issues.apache.org/jira/browse/SPARK-6096).
Author: Xusen Yin <yinxusen@gmail.com>
Closes #5090 from yinxusen/SPARK-6096 and squashes the following commits:
bd0fea5 [Xusen Yin] fix style problem, etc.
3fd41f2 [Xusen Yin] use hanging indent in Python style
e83803d [Xusen Yin] fix Python style
d6dbde5 [Xusen Yin] fix python call java error
a054bb3 [Xusen Yin] add save load for NaiveBayes python
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR implements two functions
- `topByKey(num: Int): RDD[(K, Array[V])]` finds the top-k values for each key in a pair RDD. This can be used, e.g., in computing top recommendations.
- `takeOrderedByKey(num: Int): RDD[(K, Array[V])] ` does the opposite of `topByKey`
The `sorted` is used here as the `toArray` method of the PriorityQueue does not return a necessarily sorted array.
Author: Shuo Xiang <shuoxiangpub@gmail.com>
Closes #5075 from coderxiang/topByKey and squashes the following commits:
1611c37 [Shuo Xiang] code clean up
6f565c0 [Shuo Xiang] naming
a80e0ec [Shuo Xiang] typo and warning
82dded9 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into topByKey
d202745 [Shuo Xiang] move to MLPairRDDFunctions
901b0af [Shuo Xiang] style check
70c6e35 [Shuo Xiang] remove takeOrderedByKey, update doc and test
0895c17 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into topByKey
b10e325 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into topByKey
debccad [Shuo Xiang] topByKey
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #5056 from vanzin/SPARK-6371 and squashes the following commits:
63220df [Marcelo Vanzin] Merge branch 'master' into SPARK-6371
6506f75 [Marcelo Vanzin] Use more fine-grained exclusion.
178ba71 [Marcelo Vanzin] Oops.
75b2375 [Marcelo Vanzin] Exclude VertexRDD in MiMA.
a45a62c [Marcelo Vanzin] Work around MIMA warning.
1d8a670 [Marcelo Vanzin] Re-group jetty exclusion.
0e8e909 [Marcelo Vanzin] Ignore ml, don't ignore graphx.
cef4603 [Marcelo Vanzin] Indentation.
296cf82 [Marcelo Vanzin] [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
hashCode when equals is defined.
I want to add a checker to turn public type checking on, since future pull requests can accidentally expose a non-public type. This is the first cleanup task.
Author: Reynold Xin <rxin@databricks.com>
Closes #5102 from rxin/mllib-hashcode-publicmethodtypes and squashes the following commits:
617f19e [Reynold Xin] Fixed Scala compilation error.
52bc2d5 [Reynold Xin] [MLlib] Added explicit type for public methods and implemented hashCode when equals is defined.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GLM toString prints out intercept, numFeatures.
For LogisticRegression and SVM model, toString also prints out numClasses, threshold.
GLM toDebugString prints out the whole weights, intercept.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #5038 from yanboliang/spark-6291 and squashes the following commits:
2f578b0 [Yanbo Liang] code format
78b33f2 [Yanbo Liang] fix typos
1e8a023 [Yanbo Liang] GLM toString & toDebugString
|
|
|
|
|
|
|
|
|
|
| |
I find it's better to have getter for NumFeatures and addIntercept within GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get the value through debug.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #5058 from hhbyyh/addGetLinear and squashes the following commits:
9dc90e8 [Yuhao Yang] add get for GeneralizedLinearAlgo
|
|
|
|
|
|
|
|
|
|
|
| |
Use `_py2java` and `_java2py` to convert Python model to/from Java model. yinxusen
Author: Xiangrui Meng <meng@databricks.com>
Closes #5049 from mengxr/SPARK-6226-mengxr and squashes the following commits:
570ba81 [Xiangrui Meng] fix python style
b10b911 [Xiangrui Meng] add save/load in PySpark's KMeansModel
|
|
|
|
|
|
|
|
|
|
|
|
| |
LBFGS uses convergence tolerance. This value should be written in document as an argument.
Author: lewuathe <lewuathe@me.com>
Closes #5033 from Lewuathe/SPARK-6336 and squashes the following commits:
e738b33 [lewuathe] Modify text to be more natural
ac03c3a [lewuathe] Modify documentations
6ccb304 [lewuathe] [SPARK-6336] LBFGS should document what convergenceTol means
|
|
|
|
|
|
|
|
|
|
|
|
| |
Note: not relevant for Python API since it only has a static train method
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4969 from jkbradley/SPARK-6252 and squashes the following commits:
a471d90 [Joseph K. Bradley] small edits from review
63eff48 [Joseph K. Bradley] Added getLambda to Scala NaiveBayes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This continues the work in #4460 from srowen . The design doc is published on the JIRA page with some minor changes.
Short description of ML attributes: https://github.com/apache/spark/pull/4925/files?diff=unified#diff-95e7f5060429f189460b44a3f8731a35R24
More details can be found in the design doc.
srowen Could you help review this PR? There are many lines but most of them are boilerplate code.
Author: Xiangrui Meng <meng@databricks.com>
Author: Sean Owen <sowen@cloudera.com>
Closes #4925 from mengxr/SPARK-4588-new and squashes the following commits:
71d1bd0 [Xiangrui Meng] add JavaDoc for package ml.attribute
617be40 [Xiangrui Meng] remove final; rename cardinality to numValues
393ffdc [Xiangrui Meng] forgot to include Java attribute group tests
b1aceef [Xiangrui Meng] more tests
e7ab467 [Xiangrui Meng] update ML attribute impl
7c944da [Sean Owen] Add FeatureType hierarchy and categorical cardinality
2a21d6d [Sean Owen] Initial draft of FeatureAttributes class
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-6268
KMeans has many setters for parameters. It should have matching getters.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #4974 from hhbyyh/get4Kmeans and squashes the following commits:
f44d4dc [Yuhao Yang] add experimental to getRuns
f94a3d7 [Yuhao Yang] add get for KMeans
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The issue is discussed in https://issues.apache.org/jira/browse/SPARK-5669. Replacing all JBLAS usage by netlib-java gives us a simpler dependency tree and less license issues to worry about. I didn't touch the test scope in this PR. The user guide is not modified to avoid merge conflicts with branch-1.3. srowen ankurdave pwendell
Author: Xiangrui Meng <meng@databricks.com>
Closes #4699 from mengxr/SPARK-5814 and squashes the following commits:
48635c6 [Xiangrui Meng] move netlib-java version to parent pom
ca21c74 [Xiangrui Meng] remove jblas from ml-guide
5f7767a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5814
c5c4183 [Xiangrui Meng] merge master
0f20cad [Xiangrui Meng] add mima excludes
e53e9f4 [Xiangrui Meng] remove jblas from mllib runtime
ceaa14d [Xiangrui Meng] replace jblas by netlib-java in graphx
fa7c2ca [Xiangrui Meng] move jblas to test scope
|
|
|
|
|
|
|
|
|
|
|
| |
Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
Author: Sean Owen <sowen@cloudera.com>
Closes #4950 from srowen/SPARK-6225 and squashes the following commits:
3080972 [Sean Owen] Ordered imports: Java, Scala, 3rd party, Spark
c67985b [Sean Owen] Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR adds save/load for K-means as described in SPARK-5986. Python version will be added in another PR.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #4951 from yinxusen/SPARK-5986 and squashes the following commits:
6dd74a0 [Xusen Yin] rewrite some functions and classes
cd390fd [Xusen Yin] add indexed point
b144216 [Xusen Yin] remove invalid comments
dce7055 [Xusen Yin] add save/load for k-means for SPARK-5986
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A simple wrapper around the Scala implementation. `DataFrame` is used for serialization/deserialization. Methods that return `RDD`s are not supported in this PR.
davies If we recognize Scala's `Product`s in Py4J, we can easily add wrappers for Scala methods that returns `RDD[(Double, Double)]`. Is it easy to register serializer for `Product` in PySpark?
Author: Xiangrui Meng <meng@databricks.com>
Closes #4863 from mengxr/SPARK-6090 and squashes the following commits:
009a3a3 [Xiangrui Meng] provide schema
dcddab5 [Xiangrui Meng] add a basic BinaryClassificationMetrics to PySpark/MLlib
|
|
|
|
|
|
|
|
|
|
| |
Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
Author: Sean Owen <sowen@cloudera.com>
Closes #4912 from srowen/SPARK-6182.1 and squashes the following commits:
eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
LBFGS and OWLQN in Breeze 0.10 has convergence check bug.
This is fixed in 0.11, see the description in Breeze project for detail:
https://github.com/scalanlp/breeze/pull/373#issuecomment-76879760
Author: Xiangrui Meng <meng@databricks.com>
Author: DB Tsai <dbtsai@alpinenow.com>
Author: DB Tsai <dbtsai@dbtsai.com>
Closes #4879 from dbtsai/breeze and squashes the following commits:
d848f65 [DB Tsai] Merge pull request #1 from mengxr/AlpineNow-breeze
c2ca6ac [Xiangrui Meng] upgrade to breeze-0.11.1
35c2f26 [Xiangrui Meng] fix LRSuite
397a208 [DB Tsai] upgrade breeze
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Issue: When the Python DecisionTree example in the programming guide is run, it runs out of Java Heap Space when using the default memory settings for the spark shell.
This prints a warning.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4864 from jkbradley/dt-save-heap and squashes the following commits:
02e8daf [Joseph K. Bradley] fixed based on code review
7ecb1ed [Joseph K. Bradley] Added warnings about memory when calling tree and ensemble model save with too small a Java heap size
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
work when using datasource api
This PR contains the following changes:
1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values).
2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types.
3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings.
4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust.
5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables.
JIRA: https://issues.apache.org/jira/browse/SPARK-5950
Thanks viirya for the initial work in #4729.
cc marmbrus liancheng
Author: Yin Huai <yhuai@databricks.com>
Closes #4826 from yhuai/insertNullabilityCheck and squashes the following commits:
3b61a04 [Yin Huai] Revert change on equals.
80e487e [Yin Huai] asNullable in UDT.
587d88b [Yin Huai] Make methods private.
0cb7ea2 [Yin Huai] marmbrus's comments.
3cec464 [Yin Huai] Cheng's comments.
486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
d3747d1 [Yin Huai] Remove unnecessary change.
8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check.
0eb5578 [Yin Huai] Fix tests.
f6ed813 [Yin Huai] Update old parquet path.
e4f397c [Yin Huai] Unit tests.
b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check.
8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data.
bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data.
0a703e7 [Yin Huai] Test failed again since we cannot read correct content.
9a26611 [Yin Huai] Make InsertIntoTable happy.
8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability
4ec17fd [Yin Huai] Failed test.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #4811 from mengxr/SPARK-5991 and squashes the following commits:
f135dac [Xiangrui Meng] update save doc
57e5200 [Xiangrui Meng] address comments
06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991
282ec8d [Xiangrui Meng] support save/load in PySpark's ALS
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
LDAModel.scala
Remove unicode characters from MLlib file.
Author: Michael Griffiths <msjgriffiths@gmail.com>
Author: Griffiths, Michael (NYC-RPM) <michael.griffiths@reprisemedia.com>
Closes #4815 from msjgriffiths/SPARK-6063 and squashes the following commits:
bcd7de1 [Griffiths, Michael (NYC-RPM)] Change \u201D quote marks around 'theta' to standard single apostrophe (\x27)
38eb535 [Michael Griffiths] Merge pull request #2 from apache/master
b08e865 [Michael Griffiths] Merge pull request #1 from apache/master
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
with validation
Since the validation error does not change monotonically, in practice, it should be proper to pick the best model when training GradientBoostedTrees with validation instead of stopping it early.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #4763 from viirya/gbt_record_model and squashes the following commits:
452e049 [Liang-Chi Hsieh] Address comment.
ea2fae2 [Liang-Chi Hsieh] Pick the best model when training GradientBoostedTrees with validation.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The model trained by ALS requires partitioning information to do quick lookup of a user/item factor for making recommendation on individual requests. In the new implementation, we didn't set partitioners in the factors returned by ALS, which would cause performance regression.
srowen coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes #4748 from mengxr/SPARK-5976 and squashes the following commits:
9373a09 [Xiangrui Meng] add partitioner to factors returned by ALS
260f183 [Xiangrui Meng] add a test for partitioner
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
One can early stop if the decrease in error rate is lesser than a certain tol or if the error increases if the training data is overfit.
This introduces a new method runWithValidation which takes in a pair of RDD's , one for the training data and the other for the validation.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #4677 from MechCoder/spark-5436 and squashes the following commits:
1bb21d4 [MechCoder] Combine regression and classification tests into a single one
e4d799b [MechCoder] Addresses indentation and doc comments
b48a70f [MechCoder] COSMIT
b928a19 [MechCoder] Move validation while training section under usage tips
fad9b6e [MechCoder] Made the following changes 1. Add section to documentation 2. Return corresponding to bestValidationError 3. Allow negative tolerance.
55e5c3b [MechCoder] One liner for prevValidateError
3e74372 [MechCoder] TST: Add test for classification
77549a9 [MechCoder] [SPARK-5436] Validate GradientBoostedTrees using runWithValidation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For SPARK-5867:
* The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
* It should also include Python examples now.
For SPARK-5892:
* Fix Python docs
* Various other cleanups
BTW, I accidentally merged this with master. If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]
CC: mengxr (ML), davies (Python docs)
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits:
f191bb0 [Joseph K. Bradley] small cleanups
e786efa [Joseph K. Bradley] small doc corrections
6b1ab4a [Joseph K. Bradley] fixed python lint test
946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example. Changed spark.ml Java examples to use DataFrames API instead of sql()
da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
34b067f [Joseph K. Bradley] small doc correction
da16aef [Joseph K. Bradley] Fixed python mllib docs
8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
b05a80d [Joseph K. Bradley] organize imports. doc cleanups
e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the previous version, PIC stores clustering assignments as an `RDD[(Long, Int)]`. This is mapped to `RDD<Tuple2<Object, Object>>` in Java and hence Java users have to cast types manually. We should either create a new method called `javaAssignments` that returns `JavaRDD[(java.lang.Long, java.lang.Int)]` or wrap the result pair in a class. I chose the latter approach in this PR. Now assignments are stored as an `RDD[Assignment]`, where `Assignment` is a class with `id` and `cluster`.
Similarly, in FPGrowth, the frequent itemsets are stored as an `RDD[(Array[Item], Long)]`, which is mapped to `RDD<Tuple2<Object, Object>>`. Though we provide a "Java-friendly" method `javaFreqItemsets` that returns `JavaRDD[(Array[Item], java.lang.Long)]`. It doesn't really work because `Array[Item]` is mapped to `Object` in Java. So in this PR I created a class `FreqItemset` to wrap the results. It has `items` and `freq`, as well as a `javaItems` method that returns `List<Item>` in Java.
I'm not certain that the names I chose are proper: `Assignment`/`id`/`cluster` and `FreqItemset`/`items`/`freq`. Please let me know if there are better suggestions.
CC: jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #4695 from mengxr/SPARK-5900 and squashes the following commits:
865b5ca [Xiangrui Meng] make Assignment serializable
cffa96e [Xiangrui Meng] fix test
9c0e590 [Xiangrui Meng] remove unused Tuple2
1b9db3d [Xiangrui Meng] make PIC and FPGrowth Java-friendly
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Another one from JoshRosen 's wish list. The first commit is much smaller and removes 2 of the 4 Clock classes. The second is much larger, necessary for consolidating the streaming one. I put together implementations in the way that seemed simplest. Almost all the change is standardizing class and method names.
Author: Sean Owen <sowen@cloudera.com>
Closes #4514 from srowen/SPARK-4682 and squashes the following commits:
5ed3a03 [Sean Owen] Javadoc Clock classes; make ManualClock private[spark]
169dd13 [Sean Owen] Add support for legacy org.apache.spark.streaming clock class names
277785a [Sean Owen] Reduce the net change in this patch by reversing some unnecessary syntax changes along the way
b5e53df [Sean Owen] FakeClock -> ManualClock; getTime() -> getTimeMillis()
160863a [Sean Owen] Consolidate Streaming Clock class into common util Clock
7c956b2 [Sean Owen] Consolidate Clocks except for Streaming Clock
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
private to ml
For users to implement their own PipelineStages, we need to make PipelineStage.transformSchema be public instead of private to ml. This would be nice to include in Spark 1.3
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4682 from jkbradley/SPARK-5902 and squashes the following commits:
6f02357 [Joseph K. Bradley] Made transformSchema public
0e6d0a0 [Joseph K. Bradley] made implementations of transformSchema protected as well
fdaf26a [Joseph K. Bradley] Made PipelineStage.transformSchema protected instead of private[ml]
|
|
|
|
|
|
|
|
|
|
|
|
| |
Updated PIC user guide to reflect API changes and added a simple Java example. The API is still not very Java-friendly. I created SPARK-5990 for this issue.
Author: Xiangrui Meng <meng@databricks.com>
Closes #4680 from mengxr/SPARK-5897 and squashes the following commits:
847d216 [Xiangrui Meng] apache header
87719a2 [Xiangrui Meng] remove PIC image
2dd921f [Xiangrui Meng] update PIC user guide and add a Java example
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Although we've migrated to the DataFrame API, lots of code still uses `rdd` or `srdd` as local variable names. This PR tries to address these naming inconsistencies and some other minor DataFrame related style issues.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4670)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #4670 from liancheng/df-cleanup and squashes the following commits:
3e14448 [Cheng Lian] Cleans up DataFrame variable names and toDF() calls
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For unordered features, it is sufficient to use splits since the threshold of the split corresponds the threshold of the HighSplit of the bin and there is no use of the LowSplit.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #4231 from MechCoder/spark-3381 and squashes the following commits:
58c19a5 [MechCoder] COSMIT
c274b74 [MechCoder] Remove unordered feature calculation in labeledPointToTreePoint
b2b9b89 [MechCoder] COSMIT
d3ee042 [MechCoder] [SPARK-3381] [MLlib] Eliminate bins for unordered features
|
|
|
|
|
|
|
|
|
|
|
| |
`numFeatures` is only used by multinomial logistic regression. Calling `.first()` for every GLM causes performance regression, especially in Python.
Author: Xiangrui Meng <meng@databricks.com>
Closes #4647 from mengxr/SPARK-5858 and squashes the following commits:
036dc7f [Xiangrui Meng] remove unnecessary first() call
12c5548 [Xiangrui Meng] check numFeatures only once
|
|
|
|
|
|
|
|
|
|
| |
If we need to transform the input data, we should cache the output to avoid re-computing feature vectors every iteration. dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes #4593 from mengxr/SPARK-5802 and squashes the following commits:
ae3be84 [Xiangrui Meng] cache transformed data in glm
|
|
|
|
|
|
|
|
|
|
|
| |
On a big dataset explicitly unpersist train and validation folds allows to load more data into memory in the next loop iteration. On my environment (single node 8Gb worker RAM, 2 GB dataset file, 3 folds for cross validation), saved more than 5 minutes.
Author: Peter Rudenko <petro.rudenko@gmail.com>
Closes #4595 from petro-rudenko/patch-2 and squashes the following commits:
66a7cfb [Peter Rudenko] Move validationDataset cache to declaration
c5f3265 [Peter Rudenko] [Ml] SPARK-5804 Explicitly manage cache in Crossvalidator k-fold loop
|
|
|
|
|
|
|
|
|
|
| |
If it's a last estimator in Pipeline there's no need to transform data, since there's no next stage that would consume this data.
Author: Peter Rudenko <petro.rudenko@gmail.com>
Closes #4590 from petro-rudenko/patch-1 and squashes the following commits:
d13ec33 [Peter Rudenko] [Ml] SPARK-5796 Don't transform data on a last estimator in Pipeline
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- The old implicit would convert RDDs directly to DataFrames, and that added too many methods.
- toDataFrame -> toDF
- Dsl -> functions
- implicits moved into SQLContext.implicits
- addColumn -> withColumn
- renameColumn -> withColumnRenamed
Python changes:
- toDataFrame -> toDF
- Dsl -> functions package
- addColumn -> withColumn
- renameColumn -> withColumnRenamed
- add toDF functions to RDD on SQLContext init
- add flatMap to DataFrame
Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>
Closes #4556 from rxin/SPARK-5752 and squashes the following commits:
5ef9910 [Reynold Xin] More fix
61d3fca [Reynold Xin] Merge branch 'df5' of github.com:davies/spark into SPARK-5752
ff5832c [Reynold Xin] Fix python
749c675 [Reynold Xin] count(*) fixes.
5806df0 [Reynold Xin] Fix build break again.
d941f3d [Reynold Xin] Fixed explode compilation break.
fe1267a [Davies Liu] flatMap
c4afb8e [Reynold Xin] style
d9de47f [Davies Liu] add comment
b783994 [Davies Liu] add comment for toDF
e2154e5 [Davies Liu] schema() -> schema
3a1004f [Davies Liu] Dsl -> functions, toDF()
fb256af [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
0dd74eb [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
97dd47c [Davies Liu] fix mistake
6168f74 [Davies Liu] fix test
1fc0199 [Davies Liu] fix test
a075cd5 [Davies Liu] clean up, toPandas
663d314 [Davies Liu] add test for agg('*')
9e214d5 [Reynold Xin] count(*) fixes.
1ed7136 [Reynold Xin] Fix build break again.
921b2e3 [Reynold Xin] Fixed explode compilation break.
14698d4 [Davies Liu] flatMap
ba3e12d [Reynold Xin] style
d08c92d [Davies Liu] add comment
5c8b524 [Davies Liu] add comment for toDF
a4e5e66 [Davies Liu] schema() -> schema
d377fc9 [Davies Liu] Dsl -> functions, toDF()
6b3086c [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
807e8b1 [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR adds three groups to the ScalaDoc: `param`, `setParam`, and `getParam`. Params will show up in the generated Scala API doc as the top group. Setters/getters will be at the bottom.
Preview:
![screen shot 2015-02-13 at 2 47 49 pm](https://cloud.githubusercontent.com/assets/829644/6196657/5740c240-b38f-11e4-94bb-bd8ef5a796c5.png)
Author: Xiangrui Meng <meng@databricks.com>
Closes #4600 from mengxr/SPARK-5730 and squashes the following commits:
febed9a [Xiangrui Meng] add doc groups to spark.ml components
|