| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.
Example of usage:
```
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.feature.PCA
val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
val pca = PCA.create(training.first().features.size/2, data.map(_.features))
val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))
val numIterations = 100
val model = LinearRegressionWithSGD.train(training, numIterations)
val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)
val valuesAndPreds = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val valuesAndPreds_pca = test_pca.map { point =>
val score = model_pca.predict(point.features)
(score, point.label)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("Mean Squared Error = " + MSE)
println("PCA Mean Squared Error = " + MSE_pca)
```
Author: Kirill A. Korinskiy <catap@catap.ru>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4304 from catap/pca and squashes the following commits:
501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit(). In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
(cherry picked from commit 8c07c75c9831d6c34f69fe840edb6470d4dfdfef)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See https://issues.apache.org/jira/browse/SPARK-5726
Author: Octavian Geagla <ogeagla@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4580 from ogeagla/spark-mllib-weighting and squashes the following commits:
fac12ad [Octavian Geagla] [SPARK-5726] [MLLIB] Use new createTransformFunc.
90f7e39 [Joseph K. Bradley] small cleanups
4595165 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove erroneous test case.
ded3ac6 [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
37d4705 [Octavian Geagla] [SPARK-5726] [MLLIB] Incorporated feedback.
1dffeee [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
e436896 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove 'TF' from 'ElementwiseProductTF'
cb520e6 [Octavian Geagla] [SPARK-5726] [MLLIB] Rename HadamardProduct to ElementwiseProduct
4922722 [Octavian Geagla] [SPARK-5726] [MLLIB] Hadamard Vector Product Transformer
(cherry picked from commit 658a478d3f86456df09d0fbb1ba438fb36d8725c)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes:
* typo in Scala example
* Removed comment "usually applied on sparse data" since that is debatable
* small edits to text for clarity
CC: avulanov I noticed a typo post-hoc and ended up making a few small edits. Do the changes look OK?
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4732 from jkbradley/chisqselector-docs and squashes the following commits:
9656a3b [Joseph K. Bradley] added Java example for ChiSqSelector to guide
3f3f9f4 [Joseph K. Bradley] small fixes to ChiSqSelector docs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added description of ChiSqSelector and few words about feature selection in general. I could add a code example, however it would not look reasonable in the absence of feature discretizer or a dataset in the `data` folder that has redundant features.
Author: Alexander Ulanov <nashb@yandex.ru>
Closes #4709 from avulanov/SPARK-5912 and squashes the following commits:
19a8a4e [Alexander Ulanov] Addressing reviewers comments @jkbradley
58d9e4d [Alexander Ulanov] Addressing reviewers comments @jkbradley
eb6b9fe [Alexander Ulanov] Typo
2921a1d [Alexander Ulanov] ChiSqSelector example of use
c845350 [Alexander Ulanov] ChiSqSelector docs
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This seems complete, the duplication of tests for provided means/variances might be overkill, would appreciate some feedback.
Author: Octavian Geagla <ogeagla@gmail.com>
Closes #4140 from ogeagla/SPARK-5207 and squashes the following commits:
fa64dfa [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel to take stddev instead of variance
9078fe0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] Incorporate code review feedback: change arg ordering, add dev api annotations, do better null checking, add another test and some doc for this.
997d2e0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] make withMean and withStd public, add constructor which uses defaults, un-refactor test class
64408a4 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel contructor to not be private to mllib, added tests for newly-exposed functionality
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This pull request refers to issue: https://issues.apache.org/jira/browse/SPARK-3838
Python example for word2vec
mengxr
Author: Anant <anant.asty@gmail.com>
Closes #2952 from anantasty/SPARK-3838 and squashes the following commits:
87bd723 [Anant] remove stop line
4bd439e [Anant] Changes as per code review. Fized error in word2vec python example, simplified example in docs.
3d3c9ee [Anant] Added empty line after python imports
0c90c31 [Anant] Fixed erroneous code. I was still treating each line to be a single word instead of 16 words
ee4f5f6 [Anant] Fixes from code review comments
c637bcf [Anant] Added word2vec python example to docs
269f31f [Anant] added example in docs
c015b14 [Anant] Added python example for word2vec
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added completed Python API for MLlib.feature
Normalizer
StandardScalerModel
StandardScaler
HashTF
IDFModel
IDF
cc mengxr
Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>
Closes #2819 from davies/feature and squashes the following commits:
4f48f48 [Davies Liu] add a note for HashingTF
67f6d21 [Davies Liu] address comments
b628693 [Davies Liu] rollback changes in Word2Vec
efb4f4f [Davies Liu] Merge branch 'master' into feature
806c7c2 [Davies Liu] address comments
3abb8c2 [Davies Liu] address comments
59781b9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into feature
a405ae7 [Davies Liu] fix tests
7a1891a [Davies Liu] fix tests
486795f [Davies Liu] update programming guide, HashTF -> HashingTF
8a50584 [Davies Liu] Python API for mllib.feature
|
|
|
|
|
|
|
|
|
|
| |
Thare are some inconsistent spellings 'MLlib' and 'MLLib' in some documents and source codes.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #2903 from sarutak/SPARK-4055 and squashes the following commits:
b031640 [Kousuke Saruta] Fixed inconsistent spelling "MLlib and MLLib"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR for [SPARK-3614](https://issues.apache.org/jira/browse/SPARK-3614) adds functionality for filtering out terms which do not appear in at least a minimum number of documents.
This is implemented using a minimumOccurence parameter (default 0). When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0. As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents.
This PR makes the following changes:
* Add a minimumOccurence parameter to the IDF and DocumentFrequencyAggregator classes.
* Create a parameter-less constructor for IDF with a default minimumOccurence value of 0 to remain backwards-compatibility with the original IDF API.
* Sets the IDFs to 0 for terms which DFs are less than minimumOccurence
* Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites
* Updated the MLLib Feature Extraction programming guide to describe the new feature
Author: RJ Nowling <rnowling@gmail.com>
Closes #2494 from rnowling/spark-3614-idf-filter and squashes the following commits:
0aa3c63 [RJ Nowling] Fix identation
e6523a8 [RJ Nowling] Remove unnecessary toDouble's from IDFSuite
bfa82ec [RJ Nowling] Add space after if
30d20b3 [RJ Nowling] Add spaces around equals signs
9013447 [RJ Nowling] Add space before division operator
79978fc [RJ Nowling] Remove unnecessary semi-colon
40fd70c [RJ Nowling] Change minimumOccurence to minDocFreq in code and docs
47850ab [RJ Nowling] Changed minimumOccurence to Int from Long
9fb4093 [RJ Nowling] Remove unnecessary lines from IDF class docs
1fc09d8 [RJ Nowling] Add backwards-compatible constructor to DocumentFrequencyAggregator
1801fd2 [RJ Nowling] Fix style errors in IDF.scala
6897252 [RJ Nowling] Preface minimumOccurence members with val to make them final and immutable
a200bab [RJ Nowling] Remove unnecessary else statement
4b974f5 [RJ Nowling] Remove accidentally-added import from testing
c0cc643 [RJ Nowling] Add minimumOccurence filtering to IDF
|
|
|
|
|
|
|
|
|
|
| |
Extraction guide
Author: RJ Nowling <rnowling@gmail.com>
Closes #2459 from rnowling/tfidf-fix and squashes the following commits:
b370a91 [RJ Nowling] Fix variable name misspelling in MLLib Feature Extraction guide
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Documentation for newly added feature transformations:
1. TF-IDF
2. StandardScaler
3. Normalizer
Author: DB Tsai <dbtsai@alpinenow.com>
Closes #2068 from dbtsai/transformer-documentation and squashes the following commits:
109f324 [DB Tsai] address feedback
|
|
|
|
|
|
|
|
|
|
|
| |
Moved TF-IDF before Word2Vec because the former is more basic. I also added a link for Word2Vec. atalwalkar
Author: Xiangrui Meng <meng@databricks.com>
Closes #2061 from mengxr/tfidf-doc and squashes the following commits:
ca04c70 [Xiangrui Meng] address comments
a5ea4b4 [Xiangrui Meng] add tf-idf user guide
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mengxr
Documentation for Word2Vec
Author: Liquan Pei <liquanpei@gmail.com>
Closes #2003 from Ishiihara/Word2Vec-doc and squashes the following commits:
4ff11d4 [Liquan Pei] minor fix
8d7458f [Liquan Pei] code reformat
6df0dcb [Liquan Pei] add Word2Vec documentation
|
|
As per discussions with Xiangrui, I've reorganized and edited the mllib documentation.
Author: Ameet Talwalkar <atalwalkar@gmail.com>
Closes #1908 from atalwalkar/master and squashes the following commits:
fe6938a [Ameet Talwalkar] made xiangruis suggested changes
840028b [Ameet Talwalkar] made xiangruis suggested changes
7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation
|