aboutsummaryrefslogtreecommitdiff
path: root/examples/src/main
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARKR][DOCS] minor code sample update in R programming guideFelix Cheung2016-07-181-1/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix code style from ad hoc review of RC4 doc ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14250 from felixcheung/rdocs2rc4.
* [SPARK-16403][EXAMPLES] Cleanup to remove unused imports, consistent style, ↵Bryan Cutler2016-07-1456-646/+142
| | | | | | | | | | | | | | | | | | | | | | | | | minor fixes ## What changes were proposed in this pull request? Cleanup of examples, mostly from PySpark-ML to fix minor issues: unused imports, style consistency, pipeline_example is a duplicate, use future print funciton, and a spelling error. * The "Pipeline Example" is duplicated by "Simple Text Classification Pipeline" in Scala, Python, and Java. * "Estimator Transformer Param Example" is duplicated by "Simple Params Example" in Scala, Python and Java * Synced random_forest_classifier_example.py with Scala by adding IndexToString label converted * Synced train_validation_split.py (in Scala ModelSelectionViaTrainValidationExample) by adjusting data split, adding grid for intercept. * RegexTokenizer was doing nothing in tokenizer_example.py and JavaTokenizerExample.java, synced with Scala version ## How was this patch tested? local tests and run modified examples Author: Bryan Cutler <cutlerb@gmail.com> Closes #14081 from BryanCutler/examples-cleanup-SPARK-16403.
* [SPARKR][MINOR] R examples and test updatesFelix Cheung2016-07-132-1/+4
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Minor example updates ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14171 from felixcheung/rexample.
* [SPARK-16303][DOCS][EXAMPLES] Updated SQL programming guide and examplesaokolnychyi2016-07-138-269/+1193
| | | | | | | | | | | | | | - Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project. - Removed the inconsistency between Scala and Java Spark SQL examples - Scala and Java Spark SQL examples were updated The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review. ![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png) Author: aokolnychyi <okolnychyyanton@gmail.com> Closes #14119 from aokolnychyi/spark_16303.
* [SPARK-16114][SQL] structured streaming event time window exampleJames Thomas2016-07-116-4/+326
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? A structured streaming example with event time windowing. ## How was this patch tested? Run locally Author: James Thomas <jamesjoethomas@gmail.com> Closes #13957 from jjthomas/current.
* [SPARKR][DOC] SparkR ML user guides update for 2.0Yanbo Liang2016-07-111-11/+11
| | | | | | | | | | | | | ## What changes were proposed in this pull request? * Update SparkR ML section to make them consistent with SparkR API docs. * Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page. ## How was this patch tested? Only docs update, manually check the generated docs. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14011 from yanboliang/r-user-guide-update.
* [SPARK-16381][SQL][SPARKR] Update SQL examples and programming guide for R ↵Xin Ren2016-07-113-2/+199
| | | | | | | | | | | | | | | | | | | | | | | language binding https://issues.apache.org/jira/browse/SPARK-16381 ## What changes were proposed in this pull request? Update SQL examples and programming guide for R language binding. Here I just follow example https://github.com/apache/spark/compare/master...liancheng:example-snippet-extraction, created a separate R file to store all the example code. ## How was this patch tested? Manual test on my local machine. Screenshot as below: ![screen shot 2016-07-06 at 4 52 25 pm](https://cloud.githubusercontent.com/assets/3925641/16638180/13925a58-439a-11e6-8d57-8451a63dcae9.png) Author: Xin Ren <iamshrek@126.com> Closes #14082 from keypointt/SPARK-16381.
* [SPARK-16260][ML][EXAMPLE] PySpark ML Example Improvements and Cleanupwm624@hotmail.com2016-07-038-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1). Remove unused import in Scala example; 2). Move spark session import outside example off; 3). Change parameter setting the same as Scala; 4). Change comment to be consistent; 5). Make sure that Scala and python using the same data set; I did one pass and fixed the above issues. There are missing examples in python, which might be added later. TODO: For some examples, there are comments on how to run examples; But there are many missing. We can add them later. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manually test them Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14021 from wangmiao1981/ann.
* [SPARK-16345][DOCUMENTATION][EXAMPLES][GRAPHX] Extract graphx programming ↵WeichenXu2016-07-026-0/+420
| | | | | | | | | | | | | | | | | | | | | | | | | | | | guide example snippets from source files instead of hard code them ## What changes were proposed in this pull request? I extract 6 example programs from GraphX programming guide and replace them with `include_example` label. The 6 example programs are: - AggregateMessagesExample.scala - SSSPExample.scala - TriangleCountingExample.scala - ConnectedComponentsExample.scala - ComprehensiveExample.scala - PageRankExample.scala All the example code can run using `bin/run-example graphx.EXAMPLE_NAME` ## How was this patch tested? Manual. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14015 from WeichenXu123/graphx_example_plugin.
* [SPARK-16294][SQL] Labelling support for the include_example Jekyll pluginCheng Lian2016-06-293-2/+18
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds labelling support for the `include_example` Jekyll plugin, so that we may split a single source file into multiple line blocks with different labels, and include them in multiple code snippets in the generated HTML page. ## How was this patch tested? Manually tested. <img width="923" alt="screenshot at jun 29 19-53-21" src="https://cloud.githubusercontent.com/assets/230655/16451099/66a76db2-3e33-11e6-84fb-63104c2f0688.png"> Author: Cheng Lian <lian@databricks.com> Closes #13972 from liancheng/include-example-with-labels.
* [SPARK-16261][EXAMPLES][ML] Fixed incorrect appNames in ML ExamplesBryan Cutler2016-06-294-4/+4
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Some appNames in ML examples are incorrect, mostly in PySpark but one in Scala. This corrects the names. ## How was this patch tested? Style, local tests Author: Bryan Cutler <cutlerb@gmail.com> Closes #13949 from BryanCutler/pyspark-example-appNames-fix-SPARK-16261.
* [SPARK-16114][SQL] structured streaming network word count examplesJames Thomas2016-06-283-0/+234
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Network word count example for structured streaming ## How was this patch tested? Run locally Author: James Thomas <jamesjoethomas@gmail.com> Author: James Thomas <jamesthomas@Jamess-MacBook-Pro.local> Closes #13816 from jjthomas/master.
* [SPARK-16231][PYSPARK][ML][EXAMPLES] dataframe_example.py fails to convert ↵Bryan Cutler2016-06-271-1/+3
| | | | | | | | | | | | | | ML style vectors ## What changes were proposed in this pull request? Need to convert ML Vectors to the old MLlib style before doing Statistics.colStats operations on the DataFrame ## How was this patch tested? Ran example, local tests Author: Bryan Cutler <cutlerb@gmail.com> Closes #13928 from BryanCutler/pyspark-ml-example-vector-conv-SPARK-16231.
* [SPARK-16214][EXAMPLES] fix the denominator of SparkPi杨浩2016-06-271-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? reduce the denominator of SparkPi by 1 ## How was this patch tested? integration tests Author: 杨浩 <yanghaogn@163.com> Closes #13910 from yanghaogn/patch-1.
* [SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer ↵GayathriMurali2016-06-243-6/+21
| | | | | | | | | | | | and CountVectorizer ## What changes were proposed in this pull request? Made changes to HashingTF,QuantileVectorizer and CountVectorizer Author: GayathriMurali <gayathri.m@intel.com> Closes #13745 from GayathriMurali/SPARK-15997.
* [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example ↵Felix Cheung2016-06-203-27/+22
| | | | | | | | | | | | | | | | | updates ## What changes were proposed in this pull request? roxygen2 doc, programming guide, example updates ## How was this patch tested? manual checks shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13751 from felixcheung/rsparksessiondoc.
* [SPARK-15129][R][DOC] R API changes in MLGayathriMurali2016-06-171-2/+2
| | | | | | | | | | ## What changes were proposed in this pull request? Make user guide changes to SparkR documentation for all changes that happened in 2.0 to Machine Learning APIs Author: GayathriMurali <gayathri.m@intel.com> Closes #13285 from GayathriMurali/SPARK-15129.
* [SPARK-15608][ML][EXAMPLES][DOC] add examples and documents of ml.isotonic ↵WeichenXu2016-06-166-14/+203
| | | | | | | | | | | | | | | | | | | | | | | | | regression ## What changes were proposed in this pull request? add ml doc for ml isotonic regression add scala example for ml isotonic regression add java example for ml isotonic regression add python example for ml isotonic regression modify scala example for mllib isotonic regression modify java example for mllib isotonic regression modify python example for mllib isotonic regression add data/mllib/sample_isotonic_regression_libsvm_data.txt delete data/mllib/sample_isotonic_regression_data.txt ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #13381 from WeichenXu123/add_isotonic_regression_doc.
* [SPARK-15996][R] Fix R examples by removing deprecated functionsDongjoon Hyun2016-06-162-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, R examples(`dataframe.R` and `data-manipulation.R`) fail like the following. We had better update them before releasing 2.0 RC. This PR updates them to use up-to-date APIs. ```bash $ bin/spark-submit examples/src/main/r/dataframe.R ... Warning message: 'createDataFrame(sqlContext...)' is deprecated. Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead. See help("Deprecated") ... Warning message: 'read.json(sqlContext...)' is deprecated. Use 'read.json(path)' instead. See help("Deprecated") ... Error: could not find function "registerTempTable" Execution halted ``` ## How was this patch tested? Manual. ``` curl -LO http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv bin/spark-submit examples/src/main/r/dataframe.R bin/spark-submit examples/src/main/r/data-manipulation.R flights.csv ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13714 from dongjoon-hyun/SPARK-15996.
* [SPARK-15898][SQL] DataFrameReader.text should return DataFrameWenchen Fan2016-06-1210-10/+10
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We want to maintain API compatibility for DataFrameReader.text, and will introduce a new API called DataFrameReader.textFile which returns Dataset[String]. affected PRs: https://github.com/apache/spark/pull/11731 https://github.com/apache/spark/pull/13104 https://github.com/apache/spark/pull/13184 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #13604 from cloud-fan/revert.
* [SPARK-15086][CORE][STREAMING] Deprecate old Java accumulator APISean Owen2016-06-121-5/+5
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? - Deprecate old Java accumulator API; should use Scala now - Update Java tests and examples - Don't bother testing old accumulator API in Java 8 (too) - (fix a misspelling too) ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #13606 from srowen/SPARK-15086.
* [SPARK-14615][ML][FOLLOWUP] Fix Python examples to use the new ML Vector and ↵hyukjinkwon2016-06-1010-19/+18
| | | | | | | | | | | | | | | | | | | | | | | | | Matrix APIs in the ML pipeline based algorithms ## What changes were proposed in this pull request? This PR fixes Python examples to use the new ML Vector and Matrix APIs in the ML pipeline based algorithms. I firstly executed this shell command, `grep -r "from pyspark.mllib" .` and then executed them all. Some of tests in `ml` produced the error messages as below: ``` pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Input type must be VectorUDT but got org.apache.spark.mllib.linalg.VectorUDTf71b0bce.' ``` So, I fixed them to use new ones just identically with some Python tests fixed in https://github.com/apache/spark/pull/12627 ## How was this patch tested? Manually tested for all the examples listed by `grep -r "from pyspark.mllib" .`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13393 from HyukjinKwon/SPARK-14615.
* [SPARK-15773][CORE][EXAMPLE] Avoid creating local variable `sc` in examples ↵Dongjoon Hyun2016-06-1012-52/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | if possible ## What changes were proposed in this pull request? Instead of using local variable `sc` like the following example, this PR uses `spark.sparkContext`. This makes examples more concise, and also fixes some misleading, i.e., creating SparkContext from SparkSession. ``` - println("Creating SparkContext") - val sc = spark.sparkContext - println("Writing local file to DFS") val dfsFilename = dfsDirPath + "/dfs_read_write_test" - val fileRDD = sc.parallelize(fileContents) + val fileRDD = spark.sparkContext.parallelize(fileContents) ``` This will change 12 files (+30 lines, -52 lines). ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13520 from dongjoon-hyun/SPARK-15773.
* [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable publicJoseph K. Bradley2016-06-061-0/+122
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Made DefaultParamsReadable, DefaultParamsWritable public. Also added relevant doc and annotations. Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable. ## How was this patch tested? Wrote example making use of the now-public APIs. Compiled and ran locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #13461 from jkbradley/defaultparamswritable.
* [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ↵Yanbo Liang2016-06-0618-36/+36
| | | | | | | | | | | | | | | | | | ML examples ## What changes were proposed in this pull request? Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken. ```python pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.' ``` We should use ```accuracy``` to replace ```precision``` in these examples. ## How was this patch tested? Offline tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13519 from yanboliang/spark-15771.
* [SPARK-15605][ML][EXAMPLES] Fix broken ML JavaDeveloperApiExample.Yanbo Liang2016-06-021-240/+0
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? See [SPARK-15605](https://issues.apache.org/jira/browse/SPARK-15605) for the detail of this bug. This PR fix 2 major bugs in this example: * The java example class use Param ```maxIter```, it will fail when calling ```Param.shouldOwn```. We need add a public method which return the ```maxIter``` Object. Because ```Params.params``` use java reflection to list all public method whose return type is ```Param```, and invoke them to get all defined param objects in the instance. * The ```uid``` member defined in Java class will be initialized after Scala traits such as ```HasFeaturesCol```. So when ```HasFeaturesCol``` being constructed, they get ```uid``` with null, which will cause ```Param.shouldOwn``` check fail. so, here is my changes: * Add public method: ```public IntParam getMaxIterParam() {return maxIter;}``` * Use Java anonymous class overriding ```uid()``` to defined the ```uid```, and it solve the second problem described above. * To make the ```getMaxIterParam ``` can be invoked using java reflection, we must make the two class (MyJavaLogisticRegression and MyJavaLogisticRegressionModel) public. so I make them become inner public static class. ## How was this patch tested? Offline tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13353 from yanboliang/spark-15605.
* [SPARK-15208][WIP][CORE][STREAMING][DOCS] Update Spark examples with ↵Liwei Lin2016-06-021-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | AccumulatorV2 ## What changes were proposed in this pull request? The patch updates the codes & docs in the example module as well as the related doc module: - [ ] [docs] `streaming-programming-guide.md` - [x] scala code part - [ ] java code part - [ ] python code part - [x] [examples] `RecoverableNetworkWordCount.scala` - [ ] [examples] `JavaRecoverableNetworkWordCount.java` - [ ] [examples] `recoverable_network_wordcount.py` ## How was this patch tested? Ran the examples and verified results manually. Author: Liwei Lin <lwlin7@gmail.com> Closes #12981 from lw-lin/accumulatorV2-examples.
* [SPARK-15618][SQL][MLLIB] Use SparkSession.builder.sparkContext if applicable.Dongjoon Hyun2016-05-313-9/+4
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR changes function `SparkSession.builder.sparkContext(..)` from **private[sql]** into **private[spark]**, and uses it if applicable like the followings. ``` - val spark = SparkSession.builder().config(sc.getConf).getOrCreate() + val spark = SparkSession.builder().sparkContext(sc).getOrCreate() ``` ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13365 from dongjoon-hyun/SPARK-15618.
* [SPARK-15645][STREAMING] Fix some typos of Streaming moduleXin Ren2016-05-302-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? No code change, just some typo fixing. ## How was this patch tested? Manually run project build with testing, and build is successful. Author: Xin Ren <iamshrek@126.com> Closes #13385 from keypointt/codeWalkThroughStreaming.
* [SPARK-15562][ML] Delete temp directory after program exit in DataFrameExampledding32016-05-271-2/+2
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Temp directory used to save records is not deleted after program exit in DataFrameExample. Although it called deleteOnExit, it doesn't work as the directory is not empty. Similar things happend in ContextCleanerSuite. Update the code to make sure temp directory is deleted after program exit. ## How was this patch tested? unit tests and local build. Author: dding3 <ding.ding@intel.com> Closes #13328 from dding3/master.
* [SPARK-15449][MLLIB][EXAMPLE] Wrong Data Format - Documentation Issuewm624@hotmail.com2016-05-273-21/+10
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) In the MLLib naivebayes example, scala and python example doesn't use libsvm data, but Java does. I make changes in scala and python example to use the libsvm data as the same as Java example. ## How was this patch tested? Manual tests Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13301 from wangmiao1981/example.
* [MINOR] Fix Typos 'a -> an'Zheng RuiFeng2016-05-262-2/+2
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `a` -> `an` I use regex to generate potential error lines: `grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala` and review them line by line. ## How was this patch tested? local build `lint-java` checking Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13317 from zhengruifeng/a_an.
* [SPARK-15457][MLLIB][ML] Eliminate some warnings from MLlib about deprecationsSean Owen2016-05-2611-23/+19
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items: * WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples. * Use in PythonMLlibAPI: Change to using private constructors * Streaming algs: No warnings after we un-deprecate the classes * Examples: Deprecate or change ones which use deprecated APIs * MulticlassMetrics fields (precision, etc.) * LinearRegressionSummary.model field ## How was this patch tested? Existing tests. Checked for warnings manually. Author: Sean Owen <sowen@cloudera.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #13314 from jkbradley/warning-cleanups.
* [SPARK-15492][ML][DOC] Binarization scala example copy & paste to ↵wm624@hotmail.com2016-05-262-4/+3
| | | | | | | | | | | | | | | | | | | spark-shell error ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) The Binarization scala example val dataFrame : Dataframe = spark.createDataFrame(data).toDF("label", "feature"), which can't be pasted in the spark-shell as Dataframe is not imported. Compared with other examples, this explicit type is not required. So I removed Dataframe in the code. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manually tested Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13266 from wangmiao1981/unit.
* [MINOR] [PYSPARK] [EXAMPLES] Changed examples to use ↵Bryan Cutler2016-05-255-5/+5
| | | | | | | | | | | | | | | SparkSession.sparkContext instead of _sc ## What changes were proposed in this pull request? Some PySpark examples need a SparkContext and get it by accessing _sc directly from the session. These examples should use the provided property `sparkContext` in `SparkSession` instead. ## How was this patch tested? Ran modified examples Author: Bryan Cutler <cutlerb@gmail.com> Closes #13303 from BryanCutler/pyspark-session-sparkContext-MINOR.
* [SPARK-15396][SQL][DOC] It can't connect hive metastore databasegatorsmile2016-05-211-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? The `hive.metastore.warehouse.dir` property in hive-site.xml is deprecated since Spark 2.0.0. Users might not be able to connect to the existing metastore if they do not use the new conf parameter `spark.sql.warehouse.dir`. This PR is to update the document and example for explaining the latest changes in the configuration of default location of database. Below is the screenshot of the latest generated docs: <img width="681" alt="screenshot 2016-05-20 08 38 10" src="https://cloud.githubusercontent.com/assets/11567269/15433296/a05c4ace-1e66-11e6-8d2b-73682b32e9c2.png"> <img width="789" alt="screenshot 2016-05-20 08 53 26" src="https://cloud.githubusercontent.com/assets/11567269/15433734/645dc42e-1e68-11e6-9476-effc9f8721bb.png"> <img width="789" alt="screenshot 2016-05-20 08 53 37" src="https://cloud.githubusercontent.com/assets/11567269/15433738/68569f92-1e68-11e6-83d3-ef5bb221a8d8.png"> No change is made in the R's example. <img width="860" alt="screenshot 2016-05-20 08 54 38" src="https://cloud.githubusercontent.com/assets/11567269/15433779/965b8312-1e68-11e6-8bc4-53c88ceacde2.png"> #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #13225 from gatorsmile/document.
* [SPARK-15031][EXAMPLE] Use SparkSession in examplesZheng RuiFeng2016-05-2033-143/+276
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031) `MLLLIB` is not recommended to use now, so examples in `MLLIB` are ignored in this PR. `StreamingContext` can not be directly obtained from `SparkSession`, so example in `Streaming` are ignored too. cc andrewor14 ## How was this patch tested? manual tests with spark-submit Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13164 from zhengruifeng/use_sparksession_ii.
* [SPARK-15222][SPARKR][ML] SparkR ML examples update in 2.0Yanbo Liang2016-05-201-17/+112
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Update example code in examples/src/main/r/ml.R to reflect the new algorithms. * spark.glm and glm * spark.survreg * spark.naiveBayes * spark.kmeans ## How was this patch tested? Offline test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13000 from yanboliang/spark-15222.
* [SPARK-15398][ML] Update the warning message to recommend ML usageZheng RuiFeng2016-05-1912-37/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? MLlib are not recommended to use, and some methods are even deprecated. Update the warning message to recommend ML usage. ``` def showWarning() { System.err.println( """WARN: This is a naive implementation of Logistic Regression and is given as an example! |Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or |org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS |for more conventional use. """.stripMargin) } ``` To ``` def showWarning() { System.err.println( """WARN: This is a naive implementation of Logistic Regression and is given as an example! |Please use org.apache.spark.ml.classification.LogisticRegression |for more conventional use. """.stripMargin) } ``` ## How was this patch tested? local build Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13190 from zhengruifeng/update_recd.
* [SPARK-15363][ML][EXAMPLE] Example code shouldn't use VectorImplicits._, ↵wm624@hotmail.com2016-05-191-2/+2
| | | | | | | | | | | | | | | | | | | | | asML/fromML ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) In this DataFrame example, we use VectorImplicits._, which is private API. Since Vectors object has public API, we use Vectors.fromML instead of implicts. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manually run the example. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13213 from wangmiao1981/ml.
* [SPARK-15296][MLLIB] Refactor All Java Tests that use SparkSessionSandeep Singh2016-05-191-1/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion ## How was this patch tested? Existing Tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13101 from techaddict/SPARK-15296.
* [SPARK-15031][EXAMPLES][FOLLOW-UP] Make Python param example working with ↵hyukjinkwon2016-05-193-15/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | SparkSession ## What changes were proposed in this pull request? It seems most of Python examples were changed to use SparkSession by https://github.com/apache/spark/pull/12809. This PR said both examples below: - `simple_params_example.py` - `aft_survival_regression.py` are not changed because it dose not work. It seems `aft_survival_regression.py` is changed by https://github.com/apache/spark/pull/13050 but `simple_params_example.py` is not yet. This PR corrects the example and make this use SparkSession. In more detail, it seems `threshold` is replaced to `thresholds` here and there by https://github.com/apache/spark/commit/5a23213c148bfe362514f9c71f5273ebda0a848a. However, when it calls `lr.fit(training, paramMap)` this overwrites the values. So, `threshold` was 5 and `thresholds` becomes 5.5 (by `1 / (1 + thresholds(0) / thresholds(1)`). According to the comment below. this is not allowed, https://github.com/apache/spark/blob/354f8f11bd4b20fa99bd67a98da3525fd3d75c81/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L58-L61. So, in this PR, it sets the equivalent value so that this does not throw an exception. ## How was this patch tested? Manully (`mvn package -DskipTests && spark-submit simple_params_example.py`) Author: hyukjinkwon <gurwls223@gmail.com> Closes #13135 from HyukjinKwon/SPARK-15031.
* [SPARK-15322][MLLIB][CORE][SQL] update deprecate accumulator usage into ↵WeichenXu2016-05-181-5/+6
| | | | | | | | | | | | | | | | accumulatorV2 in spark project ## What changes were proposed in this pull request? I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself) ## How was this patch tested? Exisiting unit tests Author: WeichenXu <WeichenXu123@outlook.com> Closes #13112 from WeichenXu123/update_accuV2_in_mllib.
* [SPARK-15171][SQL] Remove the references to deprecated method ↵Sean Zhong2016-05-187-13/+13
| | | | | | | | | | | | | | | | | dataset.registerTempTable ## What changes were proposed in this pull request? Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`. ## How was this patch tested? This PR only changes the unit test code, examples, and comments. It should be safe. This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged. Author: Sean Zhong <seanzhong@databricks.com> Closes #13098 from clockfly/spark-15171-remove-deprecation.
* [SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based ↵DB Tsai2016-05-1720-27/+28
| | | | | | | | | | | | | | | | | | algorithms ## What changes were proposed in this pull request? Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Xiangrui Meng <meng@databricks.com> Closes #12627 from dbtsai/SPARK-14615-NewML.
* [SPARK-15318][ML][EXAMPLE] spark.ml Collaborative Filtering example does not ↵wm624@hotmail.com2016-05-171-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | work in spark-shell ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) copy & paste example in ml-collaborative-filtering.html into spark-shell, we see the following errors. scala> case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long) defined class Rating scala> object Rating { def parseRating(str: String): Rating = { | val fields = str.split("::") | assert(fields.size == 4) | Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong) | } } <console>:29: error: Rating.type does not take parameters Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong) ^ In standard scala repl, it has the same error. Scala/spark-shell repl has some quirks (e.g. packages are also not well supported). The reason of errors is that scala/spark-shell repl discards previous definitions when we define the Object with the same class name. Solution: We can rename the Object Rating. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manually test it: 1). ./bin/run-example ALSExample 2). copy & paste example in the generated document. It works fine. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13110 from wangmiao1981/repl.
* [SPARK-14434][ML] User guide doc and examples for GaussianMixture in spark.mlwm624@hotmail.com2016-05-173-0/+170
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Add guide doc and examples for GaussianMixture in Spark.ml in Java, Scala and Python. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual compile and test all examples Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12788 from wangmiao1981/example.
* [SPARK-14979][ML][PYSPARK] Add examples for GeneralizedLinearRegressionYanbo Liang2016-05-163-0/+227
| | | | | | | | | | | | ## What changes were proposed in this pull request? Add Scala/Java/Python examples for ```GeneralizedLinearRegression```. ## How was this patch tested? They are examples and have been tested offline. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12754 from yanboliang/spark-14979.
* [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempViewSean Zhong2016-05-126-10/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #12945 from clockfly/spark-15171.
* [SPARK-15031][SPARK-15134][EXAMPLE][DOC] Use SparkSession and update indent ↵Zheng RuiFeng2016-05-1134-151/+192
| | | | | | | | | | | | | | | | in examples ## What changes were proposed in this pull request? 1, Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031) 2, Update indent for `SparkContext` according to [SPARK-15134](https://issues.apache.org/jira/browse/SPARK-15134) 3, BTW, remove some duplicate space and add missing '.' ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13050 from zhengruifeng/use_sparksession.