| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
1,create a libsvm-type dataset for lda: `data/mllib/sample_lda_libsvm_data.txt`
2,add python example
3,directly read the datafile in examples
4,BTW, change to `SparkSession` in `aft_survival_regression.py`
## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/lda_example.py`
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #12927 from zhengruifeng/lda_pe.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Python example for ml.kmeans already exists, but not included in user guide.
1,small changes like: `example_on` `example_off`
2,add it to user guide
3,update examples to directly read datafile
## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/kmeans_example.py
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #12925 from zhengruifeng/km_pe.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ml.BisectingKMeans
## What changes were proposed in this pull request?
1, add BisectingKMeans to ml-clustering.md
2, add the missing Scala BisectingKMeansExample
3, create a new datafile `data/mllib/sample_kmeans_data.txt`
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #11844 from zhengruifeng/doc_bkm.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
1, Add python example for OneVsRest
2, remove args-parsing
## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/one_vs_rest_example.py`
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #12920 from zhengruifeng/ovr_pe.
|
|
|
|
|
|
|
|
|
|
| |
This PR removes `sqlContext` in examples. Actual usage was all replaced in https://github.com/apache/spark/pull/12809 but there are some in comments.
Manual style checking.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #13006 from HyukjinKwon/minor-docs.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Cleans up ALS examples by removing unnecessary casts to double for `rating` and `prediction` columns, since `RegressionEvaluator` now supports `Double` & `Float` input types.
## How was this patch tested?
Manual compile and run with `run-example ml.ALSExample` and `spark-submit examples/src/main/python/ml/als_example.py`.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes #12892 from MLnick/als-examples-cleanup.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add the missing python example for QuantileDiscretizer
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #12281 from zhengruifeng/discret_pe.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
binary_classification_metrics_example.py
## What changes were proposed in this pull request?
This issue addresses the comments in SPARK-15031 and also fix java-linter errors.
- Use multiline format in SparkSession builder patterns.
- Update `binary_classification_metrics_example.py` to use `SparkSession`.
- Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)
## How was this patch tested?
After passing the Jenkins tests and run `dev/lint-java` manually.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #12911 from dongjoon-hyun/SPARK-15134.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`.
- Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files.
- Add `getConf` in Python SparkContext class: `python/pyspark/context.py`
- Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**:
- `SqlNetworkWordCount.scala`
- `JavaSqlNetworkWordCount.java`
- `sql_network_wordcount.py`
Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue.
- `simple_params_example.py`
- `aft_survival_regression.py`
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #12809 from dongjoon-hyun/SPARK-15031.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add python3 compatibility in python examples
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #12868 from zhengruifeng/fix_gmm_py.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PySpark.
## What changes were proposed in this pull request?
This is a python port of corresponding Scala builder pattern code. `sql.py` is modified as a target example case.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #12860 from dongjoon-hyun/SPARK-15084.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add the missing python example for VectorSlicer
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #12282 from zhengruifeng/vecslicer_pe.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
First, make all dependencies in the examples module provided, and explicitly
list a couple of ones that somehow are promoted to compile by maven. This
means that to run streaming examples, the streaming connector package needs
to be provided to run-examples using --packages or --jars, just like regular
apps.
Also, remove a couple of outdated examples. HBase has had Spark bindings for
a while and is even including them in the HBase distribution in the next
version, making the examples obsolete. The same applies to Cassandra, which
seems to have a proper Spark binding library already.
I just tested the build, which passes, and ran SparkPi. The examples jars
directory now has only two jars:
```
$ ls -1 examples/target/scala-2.11/jars/
scopt_2.11-3.3.0.jar
spark-examples_2.11-2.0.0-SNAPSHOT.jar
```
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #12544 from vanzin/SPARK-14744.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.
## How was this patch tested?
unit tests and doc generation
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #12454 from hhbyyh/tfdoc.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add the missing python example for ChiSqSelector
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #12283 from zhengruifeng/chi2_pe.
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-13089
Add section in ml-classification.md for NaiveBayes DataFrame-based API, plus example code (using include_example to clip code from examples/ folder files).
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #11015 from hhbyyh/naiveBayesDoc.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add python CountVectorizerExample
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #11917 from zhengruifeng/cv_pe.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
add three python examples
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #12063 from zhengruifeng/dct_pe.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
streaming-mqtt and streaming-twitter
## What changes were proposed in this pull request?
This PR removes all docs about the old streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter projects since I have already copied them to https://github.com/spark-packages
Also remove mqtt_wordcount.py that I forgot to remove previously.
## How was this patch tested?
Jenkins PR Build.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #11824 from zsxwing/remove-doc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
using include_example
Replace example code in mllib-feature-extraction.md using include_example
https://issues.apache.org/jira/browse/SPARK-13017
The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.
Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/TFIDFExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
in the markdown.
See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337
Author: Xin Ren <iamshrek@126.com>
Closes #11142 from keypointt/SPARK-13017.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mllib-statistics.md using include_example
## What changes were proposed in this pull request?
This PR for ticket SPARK-13019 is based on previous PR(https://github.com/apache/spark/pull/11108).
Since PR(https://github.com/apache/spark/pull/11108) is breaking scala-2.10 build, more work is needed to fix build errors.
What I did new in this PR is adding keyword argument for 'fractions':
` val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)`
` val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)`
I reopened ticket on JIRA but sorry I don't know how to reopen a GitHub pull request, so I just submitting a new pull request.
## How was this patch tested?
Manual build testing on local machine, build based on scala-2.10.
Author: Xin Ren <iamshrek@126.com>
Closes #11901 from keypointt/SPARK-13019.
|
|
|
|
|
|
| |
using include_example"
This reverts commit 1af8de200c4d3357bcb09e7bbc6deece00e885f2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
include_example
https://issues.apache.org/jira/browse/SPARK-13019
The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.
Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
in the markdown.
See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337
Author: Xin Ren <iamshrek@126.com>
Closes #11108 from keypointt/SPARK-13019.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-13814
## What changes were proposed in this pull request?
delete unnecessary imports in python examples files
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #11651 from zhengruifeng/del_import_pe.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-13672
## What changes were proposed in this pull request?
add two python examples of BisectingKMeans for ml and mllib
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #11515 from zhengruifeng/mllib_bkm_pe.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This pull request adds a python example for train validation split.
## How was this patch tested?
This was style tested through lint-python, generally tested with ./dev/run-tests, and run in notebook and shell environments. It was viewed in docs locally with jekyll serve.
This contribution is my original work and I license it to Spark under its open source license.
Author: JeremyNixon <jnixon2@gmail.com>
Closes #11547 from JeremyNixon/tvs_example.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR fixes typos in comments and testcase name of code.
## How was this patch tested?
manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
include_example
Replace example code in mllib-clustering.md using include_example
https://issues.apache.org/jira/browse/SPARK-13013
The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.
Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
in the markdown.
See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337
Author: Xin Ren <iamshrek@126.com>
Closes #11116 from keypointt/SPARK-13013.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
include_example
## What changes were proposed in this pull request?
This PR replaces example codes in `mllib-linear-methods.md` using `include_example`
by doing the followings:
* Extracts the example codes(Scala,Java,Python) as files in `example` module.
* Merges some dialog-style examples into a single file.
* Hide redundant codes in HTML for the consistency with other docs.
## How was the this patch tested?
manual test.
This PR can be tested by document generations, `SKIP_API=1 jekyll build`.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #11320 from dongjoon-hyun/SPARK-11381.
|
|
|
|
|
|
|
|
| |
This pull request uses {%include_example%} to add an example for the python cross validator to ml-guide.
Author: JeremyNixon <jnixon2@gmail.com>
Closes #11240 from JeremyNixon/pipeline_include_example.
|
|
|
|
|
|
|
|
|
|
| |
after loading it
Refine naive Bayes example by checking model after loading it
Author: movelikeriver <mars.lenjoy@gmail.com>
Closes #11125 from movelikeriver/naive_bayes.
|
|
|
|
|
|
|
|
|
|
| |
include_example
Replaced example code in ml-guide.md using include_example
Author: Devaraj K <devaraj@apache.org>
Closes #11053 from devaraj-kavali/SPARK-13012.
|
|
|
|
|
|
|
|
|
|
| |
filtering in general
This documents the implementation of ALS in `spark.ml` with example code in scala, java and python.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10411 from BenFradet/SPARK-12247.
|
|
|
|
|
|
|
|
| |
Without importing the print_function, the lines later on like ```print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)``` fail when using python2.*. Import fixes that problem and doesn't break anything on python3 either.
Author: Mark Grover <mark@apache.org>
Closes #10872 from markgrover/python2_compat.
|
|
|
|
|
|
|
|
|
|
| |
single instance predict/predictSoft
PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10552 from yanboliang/spark-12603.
|
|
|
|
|
|
|
|
| |
According to the documentation the sortByKey method does not take a lambda as an argument, thus the example is flawed. Removed the argument completely as this will default to ascending sort.
Author: Udo Klein <git@blinkenlight.net>
Closes #10640 from udoklein/patch-1.
|
|
|
|
|
|
| |
Author: Udo Klein <git@blinkenlight.net>
Closes #10642 from udoklein/patch-2.
|
|
|
|
|
|
|
|
|
|
| |
Streaming
This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #10385 from zsxwing/accumulator-broadcast-example.
|
|
|
|
|
|
|
|
|
|
| |
sentiment values
Example of joining a static RDD of word sentiments to a streaming RDD of Tweets in order to demo the usage of the transform() method.
Author: Jeff L <sha0lin@alumni.carnegiemellon.edu>
Closes #8431 from Agent007/SPARK-9057.
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-12199
Follow-up PR of SPARK-11551. Fix some errors in ml-features.md
mengxr
Author: Xusen Yin <yinxusen@gmail.com>
Closes #10193 from yinxusen/SPARK-12199.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
dataframe_example.py
Since ```Dataset``` has a new meaning in Spark 1.6, we should rename it to avoid confusion.
#9873 finished the work of Scala example, here we focus on the Python one.
Move dataset_example.py to ```examples/ml``` and rename to ```dataframe_example.py```.
BTW, fix minor missing issues of #9873.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9957 from yanboliang/SPARK-11978.
|
|
|
|
|
|
|
|
| |
Adding ability to define an initial state RDD for use with updateStateByKey PySpark. Added unit test and changed stateful_network_wordcount example to use initial RDD.
Author: Bryan Cutler <bjcutler@us.ibm.com>
Closes #10082 from BryanCutler/initial-rdd-updateStateByKey-SPARK-11713.
|
|
|
|
|
|
|
|
|
| |
PR on behalf of somideshmukh, thanks!
Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>
Closes #10219 from yinxusen/SPARK-11551.
|
|
|
|
|
|
|
|
| |
Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10166 from BenFradet/SPARK-12159.
|
|
|
|
|
|
|
|
|
|
| |
This reverts PR #10002, commit 78209b0ccaf3f22b5e2345dfb2b98edfdb746819.
The original PR wasn't tested on Jenkins before being merged.
Author: Cheng Lian <lian@databricks.com>
Closes #10200 from liancheng/revert-pr-10002.
|
|
|
|
|
|
|
|
| |
Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10006 from yanboliang/spark-11958.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
include_example
Made new patch contaning only markdown examples moved to exmaple/folder.
Ony three java code were not shfted since they were contaning compliation error ,these classes are
1)StandardScale 2)NormalizerExample 3)VectorIndexer
Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>
Closes #10002 from somideshmukh/SomilBranch1.33.
|
|
|
|
|
|
|
|
|
|
| |
Remove duplicate mllib example (DT/RF/GBT in Java/Python).
Since we have tutorial code for DT/RF/GBT classification/regression in Scala/Java/Python and example applications for DT/RF/GBT in Scala, so we mark these as duplicated and remove them.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9954 from yanboliang/SPARK-11975.
|
|
|
|
|
|
|
|
| |
Remove duplicate ml examples (only for ml). mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9933 from yanboliang/SPARK-11685.
|
|
|
|
|
|
|
|
|
|
|
|
| |
examples and user guide doc
ML ```LinearRegression``` use ```data/mllib/sample_libsvm_data.txt``` as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use ```data/mllib/sample_linear_regression_data.txt``` instead.
The deeper causes is that ```LinearRegression``` with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at [SPARK-11918](https://issues.apache.org/jira/browse/SPARK-11918).
It will confuse users if they run the example code but get exception, so we should make this change which can clearly illustrate the usage of ```LinearRegression``` algorithm.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9905 from yanboliang/spark-11920.
|