| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Otherwise, extra params get ignored in `PipelineModel.transform`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #6622 from mengxr/SPARK-8087 and squashes the following commits:
0e4c8c4 [Xiangrui Meng] fix merge issues
26fc1f0 [Xiangrui Meng] address comments
e607a04 [Xiangrui Meng] merge master
b85b57e [Xiangrui Meng] fix examples/compile
d6f7891 [Xiangrui Meng] rename defaultCopyWithParams to defaultCopy
84ec278 [Xiangrui Meng] remove setter checks due to generics
2cf2ed0 [Xiangrui Meng] snapshot
291814f [Xiangrui Meng] OneVsRest.copy
1dfe3bd [Xiangrui Meng] PipelineModel.copy should copy stages
(cherry picked from commit 43c7ec6384e51105dedf3a53354b6a3732cc27b2)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Ram Sriharsha <rsriharsha@hw11853.local>
Closes #6443 from harsha2010/SPARK-6013 and squashes the following commits:
732506e [Ram Sriharsha] Code Review Feedback
121c211 [Ram Sriharsha] python style fix
5f9b8c3 [Ram Sriharsha] python style fixes
925ca86 [Ram Sriharsha] Simple Params Example
8b372b1 [Ram Sriharsha] GBT Example
965ec14 [Ram Sriharsha] Random Forest Example
(cherry picked from commit dbf8ff38de0f95f467b874a5b527dcf59439efe8)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
to src
Since `spark-streaming-kafka` now is published for both Scala 2.10 and 2.11, we can move `KafkaWordCount` and `DirectKafkaWordCount` from `examples/scala-2.10/src/` to `examples/src/` so that they will appear in `spark-examples-***-jar` for Scala 2.11.
Author: zsxwing <zsxwing@gmail.com>
Closes #6436 from zsxwing/SPARK-7895 and squashes the following commits:
c6052f1 [zsxwing] Update examples/pom.xml
0bcfa87 [zsxwing] Fix the sleep time
b9d1256 [zsxwing] Move Kafka examples from scala-2.10/src to src
(cherry picked from commit 000df2f0d6af068bb188e81bbb207f0c2f43bf16)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Also moved all the deprecated functions into one place for SQLContext and DataFrame, and updated tests to use the new API.
Author: Reynold Xin <rxin@databricks.com>
Closes #6210 from rxin/df-writer-reader-jdbc and squashes the following commits:
7465c2c [Reynold Xin] Fixed unit test.
118e609 [Reynold Xin] Updated tests.
3441b57 [Reynold Xin] Updated javadoc.
13cdd1c [Reynold Xin] [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface.
(cherry picked from commit 517eb37a85e0a28820bcfd5d98c50d02df6521c6)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch introduces DataFrameWriter and DataFrameReader.
DataFrameReader interface, accessible through SQLContext.read, contains methods that create DataFrames. These methods used to reside in SQLContext. Example usage:
```scala
sqlContext.read.json("...")
sqlContext.read.parquet("...")
```
DataFrameWriter interface, accessible through DataFrame.write, implements a builder pattern to avoid the proliferation of options in writing DataFrame out. It currently implements:
- mode
- format (e.g. "parquet", "json")
- options (generic options passed down into data sources)
- partitionBy (partitioning columns)
Example usage:
```scala
df.write.mode("append").format("json").partitionBy("date").saveAsTable("myJsonTable")
```
TODO:
- [ ] Documentation update
- [ ] Move JDBC into reader / writer?
- [ ] Deprecate the old interfaces
- [ ] Move the generic load interface into reader.
- [ ] Update example code and documentation
Author: Reynold Xin <rxin@databricks.com>
Closes #6175 from rxin/reader-writer and squashes the following commits:
b146c95 [Reynold Xin] Deprecation of old APIs.
bd8abdf [Reynold Xin] Fixed merge conflict.
26abea2 [Reynold Xin] Added general load methods.
244fbec [Reynold Xin] Added equivalent to example.
4f15d92 [Reynold Xin] Added documentation for partitionBy.
7e91611 [Reynold Xin] [SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output API.
(cherry picked from commit 578bfeeff514228f6fd4b07a536815fbb3510f7e)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Java and Scala examples for OneVsRest. Fixes the base classifier to be Logistic Regression and accepts the configuration parameters of the base classifier.
Author: Ram Sriharsha <rsriharsha@hw11853.local>
Closes #6115 from harsha2010/SPARK-7575 and squashes the following commits:
87ad3c7 [Ram Sriharsha] extra line
f5d9891 [Ram Sriharsha] Merge branch 'master' into SPARK-7575
7076084 [Ram Sriharsha] cleanup
dfd660c [Ram Sriharsha] cleanup
8703e4f [Ram Sriharsha] update doc
cb23995 [Ram Sriharsha] fix commandline options for JavaOneVsRestExample
69e91f8 [Ram Sriharsha] cleanup
7f4e127 [Ram Sriharsha] cleanup
d4c40d0 [Ram Sriharsha] Code Review fixes
461eb38 [Ram Sriharsha] cleanup
e0106d9 [Ram Sriharsha] Fix typo
935cf56 [Ram Sriharsha] Try to match Java and Scala Example Commandline options
5323ff9 [Ram Sriharsha] cleanup
196a59a [Ram Sriharsha] cleanup
6adfa0c [Ram Sriharsha] Style Fix
8cfc5d5 [Ram Sriharsha] [SPARK-7575] Example code for OneVsRest
(cherry picked from commit cc12a86fb049f2be1f45baf461d202ec356ccf8f)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The difference is because we previously don't fit the intercept in Spark 1.3. Here, we change the input `String` so that the probability of instance 6 can be classified as `1.0` without any ambiguity.
with lambda = 0.001 in current LOR implementation, the prediction is
```
(4, spark i j k) --> prob=[0.1596407738787411,0.8403592261212589], prediction=1.0
(5, l m n) --> prob=[0.8378325685476612,0.16216743145233883], prediction=0.0
(6, spark hadoop spark) --> prob=[0.0692663313297627,0.9307336686702373], prediction=1.0
(7, apache hadoop) --> prob=[0.9821575333444208,0.01784246665557917], prediction=0.0
```
and the training accuracy is
```
(0, a b c d e spark) --> prob=[0.0021342419881406746,0.9978657580118594], prediction=1.0
(1, b d) --> prob=[0.9959176174854043,0.004082382514595685], prediction=0.0
(2, spark f g h) --> prob=[0.0014541569986711233,0.9985458430013289], prediction=1.0
(3, hadoop mapreduce) --> prob=[0.9982978367343561,0.0017021632656438518], prediction=0.0
```
Author: DB Tsai <dbt@netflix.com>
Closes #6109 from dbtsai/lor-example and squashes the following commits:
ac63ce4 [DB Tsai] first commit
(cherry picked from commit c1080b6fddb22d84694da2453e46a03fbc041576)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field.
This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #6019 from mengxr/SPARK-7407 and squashes the following commits:
c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
520f0a2 [Xiangrui Meng] address comments
2569168 [Xiangrui Meng] fix tests
873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn
409ea08 [Xiangrui Meng] minor updates
83a163c [Xiangrui Meng] update JavaDeveloperApiExample
5db5325 [Xiangrui Meng] update OneVsRest
7bde7ae [Xiangrui Meng] merge master
697fdf9 [Xiangrui Meng] update Bucketizer
7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
629d402 [Xiangrui Meng] fix LRSuite
154516f [Xiangrui Meng] merge master
aa4a611 [Xiangrui Meng] fix examples/compile
a4794dd [Xiangrui Meng] change Param to use to reduce the size of diff
fdbc415 [Xiangrui Meng] all tests passed
c255f17 [Xiangrui Meng] fix tests in ParamsSuite
818e1db [Xiangrui Meng] merge master
e1160cf [Xiangrui Meng] fix tests
fbc39f0 [Xiangrui Meng] pass test:compile
108937e [Xiangrui Meng] pass compile
8726d39 [Xiangrui Meng] use parent uid in Param
eaeed35 [Xiangrui Meng] update Identifiable
(cherry picked from commit 1b8625f4258d6d1a049d0ba60e39e9757f5a568b)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement `fit` and `transform` without extra param values given the default implementation of `fit(dataset, extra)`:
~~~scala
def fit(dataset: DataFrame, extra: ParamMap): Model = {
copy(extra).fit(dataset)
}
~~~
Inside `fit` and `transform`, since only the embedded values are used, I added `$` as an alias for `getOrDefault` to make the code easier to read. For example, in `LinearRegression.fit` we have:
~~~scala
val effectiveRegParam = $(regParam) / yStd
val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
~~~
Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model).
Other changes:
* `Params$.inheritValues` is moved to `Params!.copyValues` and returns the target instance.
* `fittingParamMap` was removed because the `parent` carries this information.
* `validate` was renamed to `validateParams` to be more precise.
TODOs:
* [x] add tests for newly added methods
* [ ] update documentation
jkbradley dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes #5820 from mengxr/SPARK-5956 and squashes the following commits:
7bef88d [Xiangrui Meng] address comments
05229c3 [Xiangrui Meng] assert -> assertEquals
b2927b1 [Xiangrui Meng] organize imports
f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
93e7924 [Xiangrui Meng] add tests for hasParam & copy
463ecae [Xiangrui Meng] merge master
2b954c3 [Xiangrui Meng] update Binarizer
465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
282a1a8 [Xiangrui Meng] fix test
819dd2d [Xiangrui Meng] merge master
b642872 [Xiangrui Meng] example code runs
5a67779 [Xiangrui Meng] examples compile
c76b4d1 [Xiangrui Meng] fix all unit tests
0f4fd64 [Xiangrui Meng] fix some tests
9286a22 [Xiangrui Meng] copyValues to trained models
53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to copyValues
9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to validateParams
d882afc [Xiangrui Meng] test compile
f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra params in all spark.ml components
(cherry picked from commit e0833c5958bbd73ff27cfe6865648d7b6e5a99bc)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Main change: Added isValid field to Param. Modified all usages to use isValid when relevant. Added helper methods in ParamValidate.
Also overrode Params.validate() in:
* CrossValidator + model
* Pipeline + model
I made a few updates for the elastic net patch:
* I changed "tol" to "convergenceTol"
* I added some documentation
This PR is Scala + Java only. Python will be in a follow-up PR.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #5740 from jkbradley/enforce-validate and squashes the following commits:
ad9c6c1 [Joseph K. Bradley] re-generated sharedParams after merging with current master
76415e8 [Joseph K. Bradley] reverted convergenceTol to tol
af62f4b [Joseph K. Bradley] Removed changes to SparkBuild, python linalg. Fixed test failures. Renamed ParamValidate to ParamValidators. Removed explicit type from ParamValidators calls where possible.
bb2665a [Joseph K. Bradley] merged with elastic net pr
ecda302 [Joseph K. Bradley] fix rat tests, plus add a little doc
6895dfc [Joseph K. Bradley] small cleanups
069ac6d [Joseph K. Bradley] many cleanups
928fb84 [Joseph K. Bradley] Maybe done
a910ac7 [Joseph K. Bradley] still workin
6d60e2e [Joseph K. Bradley] Still workin
b987319 [Joseph K. Bradley] Partly done with adding checks, but blocking on adding checking functionality to Param
dbc9fb2 [Joseph K. Bradley] merged with master. enforcing Params.validate
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
extensibility
jira: https://issues.apache.org/jira/browse/SPARK-7090
LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
As Joseph Bradley jkbradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.
Concrete changes:
1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.
2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
-adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
-move the code from LDA.initalState to initalState of EMLDAOptimizer
3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.
4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.
Further work:
add OnlineLDAOptimizer and other possible Optimizers once ready.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #5661 from hhbyyh/ldaRefactor and squashes the following commits:
0e2e006 [Yuhao Yang] respond to review comments
08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
e756ce4 [Yuhao Yang] solve mima exception
d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
0bb8400 [Yuhao Yang] refactor LDA with Optimizer
ec2f857 [Yuhao Yang] protoptype for discussion
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The design doc was posted on the JIRA page. Python changes will be in a follow-up PR. jkbradley
1. Use codegen for shared params.
1. Move shared params to package `ml.param.shared`.
1. Set default values in `Params` instead of in `Param`.
1. Add a few methods to `Params` and `ParamMap`.
1. Move schema handling to `SchemaUtils` from `Params`.
- [x] check visibility of the methods added
Author: Xiangrui Meng <meng@databricks.com>
Closes #5431 from mengxr/SPARK-5957 and squashes the following commits:
d19236d [Xiangrui Meng] fix test
26ae2d7 [Xiangrui Meng] re-gen code and mark clear protected
38b78c7 [Xiangrui Meng] update Param.toString and remove Params.explain()
409e2d5 [Xiangrui Meng] address comments
2d637bd [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5957
eec2264 [Xiangrui Meng] make get* public in Params
4090d95 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5957
4fee9e7 [Xiangrui Meng] re-gen shared params
2737c2d [Xiangrui Meng] rename SharedParamCodeGen to SharedParamsCodeGen
e938f81 [Xiangrui Meng] update code to set default parameter values
28ed322 [Xiangrui Meng] merge master
55be1f3 [Xiangrui Meng] merge master
d63b5cc [Xiangrui Meng] fix examples
29b004c [Xiangrui Meng] update ParamsSuite
94fd98e [Xiangrui Meng] fix explain params
48d0e84 [Xiangrui Meng] add remove and update explainParams
4ac6348 [Xiangrui Meng] move schema utils to SchemaUtils add a few methods to Params
0d9594e [Xiangrui Meng] add getOrElse to ParamMap
eeeffe8 [Xiangrui Meng] map ++ paramMap => extractValues
0d3fc5b [Xiangrui Meng] setDefault after param
a9dbf59 [Xiangrui Meng] minor updates
d9302b8 [Xiangrui Meng] generate default values
1c72579 [Xiangrui Meng] pass test compile
abb7a3b [Xiangrui Meng] update default values handling
dcab97a [Xiangrui Meng] add codegen for shared params
|
|
|
|
|
|
|
|
|
|
|
| |
Use `sqlContext` in PySpark shell, make it consistent with SQL programming guide. `sqlCtx` is also kept for compatibility.
Author: Davies Liu <davies@databricks.com>
Closes #5425 from davies/sqlCtx and squashes the following commits:
af67340 [Davies Liu] sqlCtx -> sqlContext
15a278f [Davies Liu] use sqlContext in python shell
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added Scala, Java and Python streaming examples showing DataFrame and SQL operations within streaming.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #4975 from tdas/streaming-sql-examples and squashes the following commits:
705cba1 [Tathagata Das] Fixed python lint error
75a3fad [Tathagata Das] Fixed python lint error
5fbf789 [Tathagata Das] Removed empty lines at the end
874b943 [Tathagata Das] Added examples streaming + sql examples.
|
|
|
|
|
|
|
|
|
|
|
| |
Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
Author: Sean Owen <sowen@cloudera.com>
Closes #4950 from srowen/SPARK-6225 and squashes the following commits:
3080972 [Sean Owen] Ordered imports: Java, Scala, 3rd party, Spark
c67985b [Sean Owen] Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add parameter parsing in FPGrowth example app in Scala and Java
And a sample data file is added in data/mllib folder
Author: Jacky Li <jacky.likun@huawei.com>
Closes #4714 from jackylk/parameter and squashes the following commits:
8c478b3 [Jacky Li] fix according to comments
3bb74f6 [Jacky Li] make FPGrowth exampl app take parameters
f0e4d10 [Jacky Li] make FPGrowth exampl app take parameters
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For SPARK-5867:
* The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
* It should also include Python examples now.
For SPARK-5892:
* Fix Python docs
* Various other cleanups
BTW, I accidentally merged this with master. If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]
CC: mengxr (ML), davies (Python docs)
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits:
f191bb0 [Joseph K. Bradley] small cleanups
e786efa [Joseph K. Bradley] small doc corrections
6b1ab4a [Joseph K. Bradley] fixed python lint test
946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example. Changed spark.ml Java examples to use DataFrames API instead of sql()
da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
34b067f [Joseph K. Bradley] small doc correction
da16aef [Joseph K. Bradley] Fixed python mllib docs
8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
b05a80d [Joseph K. Bradley] organize imports. doc cleanups
e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the previous version, PIC stores clustering assignments as an `RDD[(Long, Int)]`. This is mapped to `RDD<Tuple2<Object, Object>>` in Java and hence Java users have to cast types manually. We should either create a new method called `javaAssignments` that returns `JavaRDD[(java.lang.Long, java.lang.Int)]` or wrap the result pair in a class. I chose the latter approach in this PR. Now assignments are stored as an `RDD[Assignment]`, where `Assignment` is a class with `id` and `cluster`.
Similarly, in FPGrowth, the frequent itemsets are stored as an `RDD[(Array[Item], Long)]`, which is mapped to `RDD<Tuple2<Object, Object>>`. Though we provide a "Java-friendly" method `javaFreqItemsets` that returns `JavaRDD[(Array[Item], java.lang.Long)]`. It doesn't really work because `Array[Item]` is mapped to `Object` in Java. So in this PR I created a class `FreqItemset` to wrap the results. It has `items` and `freq`, as well as a `javaItems` method that returns `List<Item>` in Java.
I'm not certain that the names I chose are proper: `Assignment`/`id`/`cluster` and `FreqItemset`/`items`/`freq`. Please let me know if there are better suggestions.
CC: jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #4695 from mengxr/SPARK-5900 and squashes the following commits:
865b5ca [Xiangrui Meng] make Assignment serializable
cffa96e [Xiangrui Meng] fix test
9c0e590 [Xiangrui Meng] remove unused Tuple2
1b9db3d [Xiangrui Meng] make PIC and FPGrowth Java-friendly
|
|
|
|
|
|
|
|
|
|
|
|
| |
Updated PIC user guide to reflect API changes and added a simple Java example. The API is still not very Java-friendly. I created SPARK-5990 for this issue.
Author: Xiangrui Meng <meng@databricks.com>
Closes #4680 from mengxr/SPARK-5897 and squashes the following commits:
847d216 [Xiangrui Meng] apache header
87719a2 [Xiangrui Meng] remove PIC image
2dd921f [Xiangrui Meng] update PIC user guide and add a Java example
|
|
|
|
|
|
|
|
|
|
| |
The API is still not very Java-friendly because `Array[Item]` in `freqItemsets` is recognized as `Object` in Java. We might want to define a case class to wrap the return pair to make it Java friendly.
Author: Xiangrui Meng <meng@databricks.com>
Closes #4661 from mengxr/SPARK-5519 and squashes the following commits:
58ccc25 [Xiangrui Meng] add user guide with example code for fp-growth
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, Spark Streaming Programming Guide after updateStateByKey explanation links to file stateful_network_wordcount.py and note "For the complete Scala code ..." for any language tab selected. This is an incoherence.
I've changed the guide and link its pertinent example file. JavaStatefulNetworkWordCount.java example was not created so I added to the commit.
Author: gasparms <gmunoz@stratio.com>
Closes #4589 from gasparms/feature/streaming-guide and squashes the following commits:
7f37f89 [gasparms] More style changes
ec202b0 [gasparms] Follow spark style guide
f527328 [gasparms] Improve example to look like scala example
4d8785c [gasparms] Remove throw exception
e92e6b8 [gasparms] Fix incoherence
92db405 [gasparms] Fix Streaming Programming Guide. Change files according the selected language
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.
Author: Davies Liu <davies@databricks.com>
Closes #4498 from davies/create and squashes the following commits:
08469c1 [Davies Liu] remove Scala/Java API for now
c80a7a9 [Davies Liu] fix hive test
d1bd8f2 [Davies Liu] cleanup applySchema
9526e97 [Davies Liu] createDataFrame from RDD with columns
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Trivial. add sc stop and reorganize import
https://issues.apache.org/jira/browse/SPARK-5717
Author: JqueryFan <firing@126.com>
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #4503 from hhbyyh/scstop and squashes the following commits:
7837a2c [JqueryFan] revert import change
2e85cc1 [Yuhao Yang] add stop and reorganize import
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the LDA user guide from jkbradley with Java and Scala code example.
Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #4465 from mengxr/lda-guide and squashes the following commits:
6dcb7d1 [Xiangrui Meng] update java example in the user guide
76169ff [Xiangrui Meng] update java example
36c3ae2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lda-guide
c2a1efe [Joseph K. Bradley] Added LDA programming guide, plus Java example (which is in the guide and probably should be removed).
|
|
|
|
|
|
|
|
| |
Author: Xiangrui Meng <meng@databricks.com>
Closes #4442 from mengxr/java6-fix and squashes the following commits:
2098500 [Xiangrui Meng] fix a compilation error with java 6
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
**UPDATE**: Most of the APIs are being kept private[spark] to allow further discussion. Here is a list of changes which are public:
* new output columns: rawPrediction, probabilities
* The “score” column is now called “rawPrediction”
* Classifiers now provide numClasses
* Params.get and .set are now protected instead of private[ml].
* ParamMap now has a size method.
* new classes: LinearRegression, LinearRegressionModel
* LogisticRegression now has an intercept.
### Sketch of APIs (most of which are private[spark] for now)
Abstract classes for learning algorithms (+ corresponding Model abstractions):
* Classifier (+ ClassificationModel)
* ProbabilisticClassifier (+ ProbabilisticClassificationModel)
* Regressor (+ RegressionModel)
* Predictor (+ PredictionModel)
* *For all of these*:
* There is no strongly typed training-time API.
* There is a strongly typed test-time (prediction) API which helps developers implement new algorithms.
Concrete classes: learning algorithms
* LinearRegression
* LogisticRegression (updated to use new abstract classes)
* Also, removed "score" in favor of "probability" output column. Changed BinaryClassificationEvaluator to match. (SPARK-5031)
Other updates:
* params.scala: Changed Params.set/get to be protected instead of private[ml]
* This was needed for the example of defining a class from outside of the MLlib namespace.
* VectorUDT: Will later change from private[spark] to public.
* This is needed for outside users to write their own validateAndTransformSchema() methods using vectors.
* Also, added equals() method.f
* SPARK-4942 : ML Transformers should allow output cols to be turned on,off
* Update validateAndTransformSchema
* Update transform
* (Updated examples, test suites according to other changes)
New examples:
* DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace)
* Added Java version too
Test Suites:
* LinearRegressionSuite
* LogisticRegressionSuite
* + Java versions of above suites
CC: mengxr etrain shivaram
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #3637 from jkbradley/ml-api-part1 and squashes the following commits:
405bfb8 [Joseph K. Bradley] Last edits based on code review. Small cleanups
fec348a [Joseph K. Bradley] Added JavaDeveloperApiExample.java and fixed other issues: Made developer API private[spark] for now. Added constructors Java can understand to specialized Param types.
8316d5e [Joseph K. Bradley] fixes after rebasing on master
fc62406 [Joseph K. Bradley] fixed test suites after last commit
bcb9549 [Joseph K. Bradley] Fixed issues after rebasing from master (after move from SchemaRDD to DataFrame)
9872424 [Joseph K. Bradley] fixed JavaLinearRegressionSuite.java Java sql api
f542997 [Joseph K. Bradley] Added MIMA excludes for VectorUDT (now public), and added DeveloperApi annotation to it
216d199 [Joseph K. Bradley] fixed after sql datatypes PR got merged
f549e34 [Joseph K. Bradley] Updates based on code review. Major ones are: * Created weakly typed Predictor.train() method which is called by fit() so that developers do not have to call schema validation or copy parameters. * Made Predictor.featuresDataType have a default value of VectorUDT. * NOTE: This could be dangerous since the FeaturesType type parameter cannot have a default value.
343e7bd [Joseph K. Bradley] added blanket mima exclude for ml package
82f340b [Joseph K. Bradley] Fixed bug in LogisticRegression (introduced in this PR). Fixed Java suites
0a16da9 [Joseph K. Bradley] Fixed Linear/Logistic RegressionSuites
c3c8da5 [Joseph K. Bradley] small cleanup
934f97b [Joseph K. Bradley] Fixed bugs from previous commit.
1c61723 [Joseph K. Bradley] * Made ProbabilisticClassificationModel into a subclass of ClassificationModel. Also introduced ProbabilisticClassifier. * This was to support output column “probabilityCol” in transform().
4e2f711 [Joseph K. Bradley] rat fix
bc654e1 [Joseph K. Bradley] Added spark.ml LinearRegressionSuite
8d13233 [Joseph K. Bradley] Added methods: * Classifier: batch predictRaw() * Predictor: train() without paramMap ProbabilisticClassificationModel.predictProbabilities() * Java versions of all above batch methods + others
1680905 [Joseph K. Bradley] Added JavaLabeledPointSuite.java for spark.ml, and added constructor to LabeledPoint which defaults weight to 1.0
adbe50a [Joseph K. Bradley] * fixed LinearRegression train() to use embedded paramMap * added Predictor.predict(RDD[Vector]) method * updated Linear/LogisticRegressionSuites
58802e3 [Joseph K. Bradley] added train() to Predictor subclasses which does not take a ParamMap.
57d54ab [Joseph K. Bradley] * Changed semantics of Predictor.train() to merge the given paramMap with the embedded paramMap. * remove threshold_internal from logreg * Added Predictor.copy() * Extended LogisticRegressionSuite
e433872 [Joseph K. Bradley] Updated docs. Added LabeledPointSuite to spark.ml
54b7b31 [Joseph K. Bradley] Fixed issue with logreg threshold being set correctly
0617d61 [Joseph K. Bradley] Fixed bug from last commit (sorting paramMap by parameter names in toString). Fixed bug in persisting logreg data. Added threshold_internal to logreg for faster test-time prediction (avoiding map lookup).
601e792 [Joseph K. Bradley] Modified ParamMap to sort parameters in toString. Cleaned up classes in class hierarchy, before implementing tests and examples.
d705e87 [Joseph K. Bradley] Added LinearRegression and Regressor back from ml-api branch
52f4fde [Joseph K. Bradley] removing everything except for simple class hierarchy for classification
d35bb5d [Joseph K. Bradley] fixed compilation issues, but have not added tests yet
bfade12 [Joseph K. Bradley] Added lots of classes for new ML API:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities.
TODOs:
With the exception of Python support, other tasks can be done in separate, follow-up PRs.
- [ ] Audit of the API
- [ ] Documentation
- [ ] More test cases to cover the new API
- [x] Python support
- [ ] Type alias SchemaRDD
Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>
Closes #4173 from rxin/df1 and squashes the following commits:
0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1
23b4427 [Reynold Xin] Mima.
828f70d [Reynold Xin] Merge pull request #7 from davies/df
257b9e6 [Davies Liu] add repartition
6bf2b73 [Davies Liu] fix collect with UDT and tests
e971078 [Reynold Xin] Missing quotes.
b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now.
a728bf2 [Reynold Xin] Example rename.
e8aa3d3 [Reynold Xin] groupby -> groupBy.
9662c9e [Davies Liu] improve DataFrame Python API
4ae51ea [Davies Liu] python API for dataframe
1e5e454 [Reynold Xin] Fixed a bug with symbol conversion.
2ca74db [Reynold Xin] Couple minor fixes.
ea98ea1 [Reynold Xin] Documentation & literal expressions.
2b22684 [Reynold Xin] Got rid of IntelliJ problems.
02bbfbc [Reynold Xin] Tightening imports.
ffbce66 [Reynold Xin] Fixed compilation error.
59b6d8b [Reynold Xin] Style violation.
b85edfb [Reynold Xin] ALS.
8c37f0a [Reynold Xin] Made MLlib and examples compile
6d53134 [Reynold Xin] Hive module.
d35efd5 [Reynold Xin] Fixed compilation error.
ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite.
66d5ef1 [Reynold Xin] SQLContext minor patch.
c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After the following patches, the main (Scala) API is now usable for Java users directly.
https://github.com/apache/spark/pull/4056
https://github.com/apache/spark/pull/4054
https://github.com/apache/spark/pull/4049
https://github.com/apache/spark/pull/4030
https://github.com/apache/spark/pull/3965
https://github.com/apache/spark/pull/3958
Author: Reynold Xin <rxin@databricks.com>
Closes #4065 from rxin/sql-java-api and squashes the following commits:
b1fd860 [Reynold Xin] Fix Mima
6d86578 [Reynold Xin] Ok one more attempt in fixing Python...
e8f1455 [Reynold Xin] Fix Python again...
3e53f91 [Reynold Xin] Fixed Python.
83735da [Reynold Xin] Fix BigDecimal test.
e9f1de3 [Reynold Xin] Use scala BigDecimal.
500d2c4 [Reynold Xin] Fix Decimal.
ba3bfa2 [Reynold Xin] Updated javadoc for RowFactory.
c4ae1c5 [Reynold Xin] [SPARK-5193][SQL] Remove Spark SQL Java-specific API.
|
|
|
|
|
|
|
|
|
|
|
|
| |
and some minor changes in ScalaDoc.
Author: Xiangrui Meng <meng@databricks.com>
Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits:
c559768 [Xiangrui Meng] minor code update
ce94da8 [Xiangrui Meng] Java Bean -> JavaBean
0b5c182 [Xiangrui Meng] fix links in ml-guide
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Documentation:
* Added ml-guide.md, linked from mllib-guide.md
* Updated mllib-guide.md with small section pointing to ml-guide.md
Examples:
* CrossValidatorExample
* SimpleParamsExample
* (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)
Bug fixes:
* PipelineModel: did not use ParamMaps correctly
* UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!)
CC: mengxr shivaram etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.
Author: Joseph K. Bradley <joseph@databricks.com>
Author: jkbradley <joseph.kurata.bradley@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #3588 from jkbradley/ml-package-docs and squashes the following commits:
d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit). updated examples for CV and Params for spark.ml
c38469c [Joseph K. Bradley] Updated ml-guide with CV examples
99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params. Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs
3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype
41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DecisionTree API fix
Major changes:
* Added programming guide sections for tree ensembles
* Added examples for tree ensembles
* Updated DecisionTree programming guide with more info on parameters
* **API change**: Standardized the tree parameter for the number of classes (for classification)
Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
* Use train/test split, and compute test error instead of training error.
* Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)
Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming
I have run all examples and relevant unit tests.
CC: mengxr manishamde codedeft
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:
70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
6fab846 [Joseph K. Bradley] small fixes based on review
b9f8576 [Joseph K. Bradley] updated decision tree doc
375204c [Joseph K. Bradley] fixed python style
2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file. added header. Fixed small bug in same example in the programming guide.
706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
c76c823 [Joseph K. Bradley] added migration guide for mllib
abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
cdfdfbc [Joseph K. Bradley] added examples for GBT
6372a2b [Joseph K. Bradley] updated decision tree examples to use random split. tested all of them.
ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide. still need to update their examples
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are some inconsistencies in the gradient boosting APIs. The target is a general boosting meta-algorithm, but the implementation is attached to trees. This was partially due to the delay of SPARK-1856. But for the 1.2 release, we should make the APIs consistent.
1. WeightedEnsembleModel -> private[tree] TreeEnsembleModel and renamed members accordingly.
1. GradientBoosting -> GradientBoostedTrees
1. Add RandomForestModel and GradientBoostedTreesModel and hide CombiningStrategy
1. Slightly refactored TreeEnsembleModel (Vote takes weights into consideration.)
1. Remove `trainClassifier` and `trainRegressor` from `GradientBoostedTrees` because they are the same as `train`
1. Rename class `train` method to `run` because it hides the static methods with the same name in Java. Deprecated `DecisionTree.train` class method.
1. Simplify BoostingStrategy and make sure the input strategy is not modified. Users should put algo and numClasses in treeStrategy. We create ensembleStrategy inside boosting.
1. Fix a bug in GradientBoostedTreesSuite with AbsoluteError
1. doc updates
manishamde jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #3374 from mengxr/SPARK-4486 and squashes the following commits:
7097251 [Xiangrui Meng] address joseph's comments
98dea09 [Xiangrui Meng] address manish's comments
4aae3b7 [Xiangrui Meng] add RandomForestModel and GradientBoostedTreesModel, hide CombiningStrategy
ea4c467 [Xiangrui Meng] fix unit tests
751da4e [Xiangrui Meng] rename class method train -> run
19030a5 [Xiangrui Meng] update boosting public APIs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR refactors / extends the status API introduced in #2696.
- Change StatusAPI from a mixin trait to a class. Before, the new status API methods were directly accessible through SparkContext, whereas now they're accessed through a `sc.statusAPI` field. As long as we were going to add these methods directly to SparkContext, the mixin trait seemed like a good idea, but this might be simpler to reason about and may avoid pitfalls that I've run into while attempting to refactor other parts of SparkContext to use mixins (see #3071, for example).
- Change the name from SparkStatusAPI to SparkStatusTracker.
- Make `getJobIdsForGroup(null)` return ids for jobs that aren't associated with any job group.
- Add `getActiveStageIds()` and `getActiveJobIds()` methods that return the ids of whatever's currently active in this SparkContext. This should simplify davies's progress bar code.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #3197 from JoshRosen/progress-api-improvements and squashes the following commits:
30b0afa [Josh Rosen] Rename SparkStatusAPI to SparkStatusTracker.
d1b08d8 [Josh Rosen] Add missing newlines
2cc7353 [Josh Rosen] Add missing file.
d5eab1f [Josh Rosen] Add getActive[Stage|Job]Ids() methods.
a227984 [Josh Rosen] getJobIdsForGroup(null) should return jobs for default group
c47e294 [Josh Rosen] Remove StatusAPI mixin trait.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR adds package "org.apache.spark.ml" with pipeline and parameters, as discussed on the JIRA. This is a joint work of jkbradley etrain shivaram and many others who helped on the design, also with help from marmbrus and liancheng on the Spark SQL side. The design doc can be found at:
https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
**org.apache.spark.ml**
This is a new package with new set of ML APIs that address practical machine learning pipelines. (Sorry for taking so long!) It will be an alpha component, so this is definitely not something set in stone. The new set of APIs, inspired by the MLI project from AMPLab and scikit-learn, takes leverage on Spark SQL's schema support and execution plan optimization. It introduces the following components that help build a practical pipeline:
1. Transformer, which transforms a dataset into another
2. Estimator, which fits models to data, where models are transformers
3. Evaluator, which evaluates model output and returns a scalar metric
4. Pipeline, a simple pipeline that consists of transformers and estimators
Parameters could be supplied at fit/transform or embedded with components.
1. Param: a strong-typed parameter key with self-contained doc
2. ParamMap: a param -> value map
3. Params: trait for components with parameters
For any component that implements `Params`, user can easily check the doc by calling `explainParams`:
~~~
> val lr = new LogisticRegression
> lr.explainParams
maxIter: max number of iterations (default: 100)
regParam: regularization constant (default: 0.1)
labelCol: label column name (default: label)
featuresCol: features column name (default: features)
~~~
or user can check individual param:
~~~
> lr.maxIter
maxIter: max number of iterations (default: 100)
~~~
**Please start with the example code in test suites and under `org.apache.spark.examples.ml`, where I put several examples:**
1. run a simple logistic regression job
~~~
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(1.0)
val model = lr.fit(dataset)
model.transform(dataset, model.threshold -> 0.8) // overwrite threshold
.select('label, 'score, 'prediction).collect()
.foreach(println)
~~~
2. run logistic regression with cross-validation and grid search using areaUnderROC (default) as the metric
~~~
val lr = new LogisticRegression
val lrParamMaps = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 100.0))
.addGrid(lr.maxIter, Array(0, 5))
.build()
val eval = new BinaryClassificationEvaluator
val cv = new CrossValidator()
.setEstimator(lr)
.setEstimatorParamMaps(lrParamMaps)
.setEvaluator(eval)
.setNumFolds(3)
val bestModel = cv.fit(dataset)
~~~
3. run a pipeline that consists of a standard scaler and a logistic regression component
~~~
val scaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
val lr = new LogisticRegression()
.setFeaturesCol(scaler.getOutputCol)
val pipeline = new Pipeline()
.setStages(Array(scaler, lr))
val model = pipeline.fit(dataset)
val predictions = model.transform(dataset)
.select('label, 'score, 'prediction)
.collect()
.foreach(println)
~~~
4. a simple text classification pipeline, which recognizes "spark":
~~~
val training = sparkContext.parallelize(Seq(
LabeledDocument(0L, "a b c d e spark", 1.0),
LabeledDocument(1L, "b d", 0.0),
LabeledDocument(2L, "spark f g h", 1.0),
LabeledDocument(3L, "hadoop mapreduce", 0.0)))
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)
val test = sparkContext.parallelize(Seq(
Document(4L, "spark i j k"),
Document(5L, "l m"),
Document(6L, "mapreduce spark"),
Document(7L, "apache hadoop")))
model.transform(test)
.select('id, 'text, 'prediction, 'score)
.collect()
.foreach(println)
~~~
Java examples are very similar. I put example code that creates a simple text classification pipeline in Scala and Java, where a simple tokenizer is defined as a transformer outside `org.apache.spark.ml`.
**What are missing now and will be added soon:**
1. ~~Runtime check of schemas. So before we touch the data, we will go through the schema and make sure column names and types match the input parameters.~~
2. ~~Java examples.~~
3. ~~Store training parameters in trained models.~~
4. (later) Serialization and Python API.
Author: Xiangrui Meng <meng@databricks.com>
Closes #3099 from mengxr/SPARK-3530 and squashes the following commits:
2cc93fd [Xiangrui Meng] hide APIs as much as I can
34319ba [Xiangrui Meng] use local instead local[2] for unit tests
2524251 [Xiangrui Meng] rename PipelineStage.transform to transformSchema
c9daab4 [Xiangrui Meng] remove mockito version
1397ab5 [Xiangrui Meng] use sqlContext from LocalSparkContext instead of TestSQLContext
6ffc389 [Xiangrui Meng] try to fix unit test
a59d8b7 [Xiangrui Meng] doc updates
977fd9d [Xiangrui Meng] add scala ml package object
6d97fe6 [Xiangrui Meng] add AlphaComponent annotation
731f0e4 [Xiangrui Meng] update package doc
0435076 [Xiangrui Meng] remove ;this from setters
fa21d9b [Xiangrui Meng] update extends indentation
f1091b3 [Xiangrui Meng] typo
228a9f4 [Xiangrui Meng] do not persist before calling binary classification metrics
f51cd27 [Xiangrui Meng] rename default to defaultValue
b3be094 [Xiangrui Meng] refactor schema transform in lr
8791e8e [Xiangrui Meng] rename copyValues to inheritValues and make it do the right thing
51f1c06 [Xiangrui Meng] remove leftover code in Transformer
494b632 [Xiangrui Meng] compure score once
ad678e9 [Xiangrui Meng] more doc for Transformer
4306ed4 [Xiangrui Meng] org imports in text pipeline
6e7c1c7 [Xiangrui Meng] update pipeline
4f9e34f [Xiangrui Meng] more doc for pipeline
aa5dbd4 [Xiangrui Meng] fix typo
11be383 [Xiangrui Meng] fix unit tests
3df7952 [Xiangrui Meng] clean up
986593e [Xiangrui Meng] re-org java test suites
2b11211 [Xiangrui Meng] remove external data deps
9fd4933 [Xiangrui Meng] add unit test for pipeline
2a0df46 [Xiangrui Meng] update tests
2d52e4d [Xiangrui Meng] add @AlphaComponent to package-info
27582a4 [Xiangrui Meng] doc changes
73a000b [Xiangrui Meng] add schema transformation layer
6736e87 [Xiangrui Meng] more doc / remove HasMetricName trait
80a8b5e [Xiangrui Meng] rename SimpleTransformer to UnaryTransformer
62ca2bb [Xiangrui Meng] check param parent in set/get
1622349 [Xiangrui Meng] add getModel to PipelineModel
a0e0054 [Xiangrui Meng] update StandardScaler to use SimpleTransformer
d0faa04 [Xiangrui Meng] remove implicit mapping from ParamMap
c7f6921 [Xiangrui Meng] move ParamGridBuilder test to ParamGridBuilderSuite
e246f29 [Xiangrui Meng] re-org:
7772430 [Xiangrui Meng] remove modelParams add a simple text classification pipeline
b95c408 [Xiangrui Meng] remove implicits add unit tests to params
bab3e5b [Xiangrui Meng] update params
fe0ee92 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530
6e86d98 [Xiangrui Meng] some code clean-up
2d040b3 [Xiangrui Meng] implement setters inside each class, add Params.copyValues [ci skip]
fd751fc [Xiangrui Meng] add java-friendly versions of fit and tranform
3f810cd [Xiangrui Meng] use multi-model training api in cv
5b8f413 [Xiangrui Meng] rename model to modelParams
9d2d35d [Xiangrui Meng] test varargs and chain model params
f46e927 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530
1ef26e0 [Xiangrui Meng] specialize methods/types for Java
df293ed [Xiangrui Meng] switch to setter/getter
376db0a [Xiangrui Meng] pipeline and parameters
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Let's give this another go using a version of Hive that shades its JLine dependency.
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>
Closes #3159 from pwendell/scala-2.11-prashant and squashes the following commits:
e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script.
f65d17d [Patrick Wendell] Fixing build issue due to merge conflict
a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state.
7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant
583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver
3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests."
935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily."
925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily.
2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future.
8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven.
5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs.
2121071 [Patrick Wendell] Migrating version detection to PySpark
b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests.
1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11
f5cad4e [Patrick Wendell] Add Scala 2.11 docs
210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline"
48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles.
e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only"
67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check
8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only
e22b104 [Patrick Wendell] Small fix in pom file
ec402ab [Patrick Wendell] Various fixes
0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline
4eaec65 [Prashant Sharma] Changed scripts to ignore target.
5167bea [Prashant Sharma] small correction
a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins.
80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests.
034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt.
d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11.
6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10
e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted.
937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION
cb059b0 [Prashant Sharma] Code review
0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Based on SPARK-2434, this PR generates runtime warnings for example implementations (Python, Scala) of PageRank.
Author: Varadharajan Mukundan <srinathsmn@gmail.com>
Closes #2894 from varadharajan/SPARK-4047 and squashes the following commits:
5f9406b [Varadharajan Mukundan] [SPARK-4047] - Point users to LogisticRegressionWithSGD and LogisticRegressionWithLBFGS instead of LogisticRegressionModel
252f595 [Varadharajan Mukundan] a. Generate runtime warnings for
05a018b [Varadharajan Mukundan] Fix PageRank implementation's package reference
5c2bf54 [Varadharajan Mukundan] [SPARK-4047] - Generate runtime warnings for example implementation of PageRank
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here's my attempt to re-port `RecoverableNetworkWordCount` to Java, following the example of its Scala and Java siblings. I fixed a few minor doc/formatting issues along the way I believe.
Author: Sean Owen <sowen@cloudera.com>
Closes #2564 from srowen/SPARK-2548 and squashes the following commits:
0d0bf29 [Sean Owen] Update checkpoint call as in https://github.com/apache/spark/pull/2735
35f23e3 [Sean Owen] Remove old comment about running in standalone mode
179b3c2 [Sean Owen] Re-port RecoverableNetworkWordCount to Java example, and touch up doc / formatting in related examples
|
|
|
|
|
|
|
|
|
|
| |
数组下标越界
Author: xiao321 <1042460381@qq.com>
Closes #3153 from xiao321/patch-1 and squashes the following commits:
0ed17b5 [xiao321] Update JavaCustomReceiver.java
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
### Summary
* Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types.
* Added Scala and Java examples for GradientBoostedTrees
* small cleanups and fixes
### Details
GradientBoosting bug fixes (“bug” = bad default options)
* Force boostingStrategy.weakLearnerParams.algo = Regression
* Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance
* Only persist data if not yet persisted (since it causes an error if persisted twice)
BoostingStrategy
* numEstimators: renamed to numIterations
* removed subsamplingRate (duplicated by Strategy)
* removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type)
* Changed algo to var (not val) and added BeanProperty, with overload taking String argument
* Added assertValid() method
* Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy
Strategy (for DecisionTree)
* Changed algo to var (not val) and added BeanProperty, with overload taking String argument
* Added setCategoricalFeaturesInfo method taking Java Map.
* Cleaned up assertValid
* Changed val’s to def’s since parameters can now be changed.
CC: manishamde mengxr codedeft
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #3094 from jkbradley/gbt-api and squashes the following commits:
7a27e22 [Joseph K. Bradley] scalastyle fix
52013d5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into gbt-api
e9b8410 [Joseph K. Bradley] Summary of changes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This pull request is a first step towards the implementation of a stable, pull-based progress / status API for Spark (see [SPARK-2321](https://issues.apache.org/jira/browse/SPARK-2321)). For now, I'd like to discuss the basic implementation, API names, and overall interface design. Once we arrive at a good design, I'll go back and add additional methods to expose more information via these API.
#### Design goals:
- Pull-based API
- Usable from Java / Scala / Python (eventually, likely with a wrapper)
- Can be extended to expose more information without introducing binary incompatibilities.
- Returns immutable objects.
- Don't leak any implementation details, preserving our freedom to change the implementation.
#### Implementation:
- Add public methods (`getJobInfo`, `getStageInfo`) to SparkContext to allow status / progress information to be retrieved.
- Add public interfaces (`SparkJobInfo`, `SparkStageInfo`) for our API return values. These interfaces consist entirely of Java-style getter methods. The interfaces are currently implemented in Java. I decided to explicitly separate the interface from its implementation (`SparkJobInfoImpl`, `SparkStageInfoImpl`) in order to prevent users from constructing these responses themselves.
-Allow an existing JobProgressListener to be used when constructing a live SparkUI. This allows us to re-use this listeners in the implementation of this status API. There are a few reasons why this listener re-use makes sense:
- The status API and web UI are guaranteed to show consistent information.
- These listeners are already well-tested.
- The same garbage-collection / information retention configurations can apply to both this API and the web UI.
- Extend JobProgressListener to maintain `jobId -> Job` and `stageId -> Stage` mappings.
The progress API methods are implemented in a separate trait that's mixed into SparkContext. This helps to avoid SparkContext.scala from becoming larger and more difficult to read.
Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <joshrosen@apache.org>
Closes #2696 from JoshRosen/progress-reporting-api and squashes the following commits:
e6aa78d [Josh Rosen] Add tests.
b585c16 [Josh Rosen] Accept SparkListenerBus instead of more specific subclasses.
c96402d [Josh Rosen] Address review comments.
2707f98 [Josh Rosen] Expose current stage attempt id
c28ba76 [Josh Rosen] Update demo code:
646ff1d [Josh Rosen] Document spark.ui.retainedJobs.
7f47d6d [Josh Rosen] Clean up SparkUI constructors, per Andrew's feedback.
b77b3d8 [Josh Rosen] Merge remote-tracking branch 'origin/master' into progress-reporting-api
787444c [Josh Rosen] Move status API methods into trait that can be mixed into SparkContext.
f9a9a00 [Josh Rosen] More review comments:
3dc79af [Josh Rosen] Remove creation of unused listeners in SparkContext.
249ca16 [Josh Rosen] Address several review comments:
da5648e [Josh Rosen] Add example of basic progress reporting in Java.
7319ffd [Josh Rosen] Add getJobIdsForGroup() and num*Tasks() methods.
cc568e5 [Josh Rosen] Add note explaining that interfaces should not be implemented outside of Spark.
6e840d4 [Josh Rosen] Remove getter-style names and "consistent snapshot" semantics:
08cbec9 [Josh Rosen] Begin to sketch the interfaces for a stable, public status API.
ac2d13a [Josh Rosen] Add jobId->stage, stageId->stage mappings in JobProgressListener
24de263 [Josh Rosen] Create UI listeners in SparkContext instead of in Tabs:
|
|
|
|
|
|
|
|
|
|
| |
Thare are some inconsistent spellings 'MLlib' and 'MLLib' in some documents and source codes.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #2903 from sarutak/SPARK-4055 and squashes the following commits:
b031640 [Kousuke Saruta] Fixed inconsistent spelling "MLlib and MLLib"
|
|
|
|
|
|
|
|
|
|
| |
Changed the usage string to correctly reflect the file name.
Author: Karthik <karthik.gomadam@gmail.com>
Closes #2699 from namelessnerd/patch-1 and squashes the following commits:
8570e33 [Karthik] Update JavaCustomReceiver.java
|
|
|
|
|
|
|
|
|
|
| |
Call SparkContext.stop() in all examples (and touch up minor nearby code style issues while at it)
Author: Sean Owen <sowen@cloudera.com>
Closes #2575 from srowen/SPARK-2626 and squashes the following commits:
5b2baae [Sean Owen] Call SparkContext.stop() in all examples (and touch up minor nearby code style issues while at it)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adjust the default values of decision tree, based on the memory requirement discussed in https://github.com/apache/spark/pull/2125 :
1. maxMemoryInMB: 128 -> 256
2. maxBins: 100 -> 32
3. maxDepth: 4 -> 5 (in some example code)
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #2322 from mengxr/tree-defaults and squashes the following commits:
cda453a [Xiangrui Meng] fix tests
5900445 [Xiangrui Meng] update comments
8c81831 [Xiangrui Meng] update default values of tree:
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #1895 from sarutak/SPARK-2976 and squashes the following commits:
1cf7e69 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2976
d1e0666 [Kousuke Saruta] Modified styles
c5e80a4 [Kousuke Saruta] Remove tab from JavaPageRank.java and JavaKinesisWordCountASL.java
c003b36 [Kousuke Saruta] Removed tab from sorttable.js
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Updated DecisionTree documentation, with examples for Java, Python.
Added same Java example to code as well.
CC: @mengxr @manishamde @atalwalkar
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Closes #2063 from jkbradley/dt-docs and squashes the following commits:
2dd2c19 [Joseph K. Bradley] Last updates based on github review.
9dd1b6b [Joseph K. Bradley] Updated decision tree doc.
d802369 [Joseph K. Bradley] Updates based on comments: cache data, corrected doc text.
b9bee04 [Joseph K. Bradley] Updated DT examples
57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example in docs, and corrected doc example as needed.
d939a92 [Joseph K. Bradley] Updated DecisionTree documentation. Added Java, Python examples.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There have been user complaints that the difference between `registerAsTable` and `saveAsTable` is too subtle. This PR addresses this by renaming `registerAsTable` to `registerTempTable`, which more clearly reflects what is happening. `registerAsTable` remains, but will cause a deprecation warning.
Author: Michael Armbrust <michael@databricks.com>
Closes #1743 from marmbrus/registerTempTable and squashes the following commits:
d031348 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
4dff086 [Michael Armbrust] Fix .java files too
89a2f12 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
0b7b71e [Michael Armbrust] Rename registerAsTable to registerTempTable
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-2060
Programming guide: http://yhuai.github.io/site/sql-programming-guide.html
Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #999 from yhuai/newJson and squashes the following commits:
227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
ce8eedd [Yin Huai] rxin's comments.
bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
94ffdaa [Yin Huai] Remove "get" from method names.
ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
79ea9ba [Yin Huai] Fix typos.
5428451 [Yin Huai] Newline
1f908ce [Yin Huai] Remove extra line.
d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
7ea750e [Yin Huai] marmbrus's comments.
6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
83013fb [Yin Huai] Update Java Example.
e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map.
6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
4fbddf0 [Yin Huai] Programming guide.
9df8c5a [Yin Huai] Python API.
7027634 [Yin Huai] Java API.
cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset.
d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
ab810b0 [Yin Huai] Make JsonRDD private.
6df0891 [Yin Huai] Apache header.
8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema.
8ffed79 [Yin Huai] Update the example.
a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution.
65b87f0 [Yin Huai] Fix sampling...
8846af5 [Yin Huai] API doc.
52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
0387523 [Yin Huai] Address PR comments.
666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
a2313a6 [Yin Huai] Address PR comments.
f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used.
0576406 [Yin Huai] Add Apache license header.
af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD.
f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Pretty self-explanatory
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #722 from tdas/example-fix and squashes the following commits:
7839979 [Tathagata Das] Minor changes.
0673441 [Tathagata Das] Fixed java docs of java streaming example
e687123 [Tathagata Das] Fixed scala style errors.
9b8d112 [Tathagata Das] Fixed streaming examples docs to use run-example instead of spark-submit.
|