aboutsummaryrefslogtreecommitdiff
path: root/examples/src/main/java
Commit message (Collapse)AuthorAgeFilesLines
* [MINOR][BUILD] Fix Java CheckStyle ErrorWeiqing Yang2016-08-241-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? As Spark 2.0.1 will be released soon (mentioned in the spark dev mailing list), besides the critical bugs, it's better to fix the code style errors before the release. Before: ``` ./dev/lint-java Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[525] (sizes) LineLength: Line is longer than 100 characters (found 119). [ERROR] src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103). ``` After: ``` ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` ## How was this patch tested? Manual. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #14768 from Sherry302/fixjavastyle.
* [SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and ↵hyukjinkwon2016-08-112-17/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | indentation in documentation ## What changes were proposed in this pull request? Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments. This PR fixes three things below: - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java. - Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples. - Fix `StructuredNetworkWordCountWindowed` and `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially). ## How was this patch tested? N/A Closes https://github.com/apache/spark/pull/14491 Author: hyukjinkwon <gurwls223@gmail.com> Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local> Closes #14564 from HyukjinKwon/SPARK-16886.
* [SPARK-16945] Fix Java Lint errorsWeiqing Yang2016-08-081-1/+2
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR is to fix the minor Java linter errors as following: [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[42,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[97,10] (modifier) RedundantModifier: Redundant 'final' modifier. ## How was this patch tested? Manual test. dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #14532 from Sherry302/master.
* [SPARK-16421][EXAMPLES][ML] Improve ML Example OutputsBryan Cutler2016-08-0527-79/+195
| | | | | | | | | | | | ## What changes were proposed in this pull request? Improve example outputs to better reflect the functionality that is being presented. This mostly consisted of modifying what was printed at the end of the example, such as calling show() with truncate=False, but sometimes required minor tweaks in the example data to get relevant output. Explicitly set parameters when they are used as part of the example. Fixed Java examples that failed to run because of using old-style MLlib Vectors or problem with schema. Synced examples between different APIs. ## How was this patch tested? Ran each example for Scala, Python, and Java and made sure output was legible on a terminal of width 100. Author: Bryan Cutler <cutlerb@gmail.com> Closes #14308 from BryanCutler/ml-examples-improve-output-SPARK-16260.
* [SPARK-16816] Modify java example which is also reflect in documentation exmaplesandy2016-08-021-0/+16
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Modify java example which is also reflect in document. ## How was this patch tested? run test cases. Author: sandy <phalodi@gmail.com> Closes #14436 from phalodi/SPARK-16816.
* [SPARK-16734][EXAMPLES][SQL] Revise examples of all language bindingsCheng Lian2016-08-022-7/+18
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed. ## How was this patch tested? Manually tested. Author: Cheng Lian <lian@databricks.com> Closes #14368 from liancheng/revise-examples.
* [SPARK-16800][EXAMPLES][ML] Fix Java examples that fail to run due to exceptionBryan Cutler2016-07-3012-38/+49
| | | | | | | | | | | | ## What changes were proposed in this pull request? Some Java examples are using mllib.linalg.Vectors instead of ml.linalg.Vectors and causes an exception when run. Also there are some Java examples that incorrectly specify data types in the schema, also causing an exception. ## How was this patch tested? Ran corrected examples locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #14405 from BryanCutler/java-examples-ml.Vectors-fix-SPARK-16800.
* [SPARK-16380][EXAMPLES] Update SQL examples and programming guide for Python ↵Cheng Lian2016-07-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | language binding This PR is based on PR #14098 authored by wangmiao1981. ## What changes were proposed in this pull request? This PR replaces the original Python Spark SQL example file with the following three files: - `sql/basic.py` Demonstrates basic Spark SQL features. - `sql/datasource.py` Demonstrates various Spark SQL data sources. - `sql/hive.py` Demonstrates Spark SQL Hive interaction. This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag. ## How was this patch tested? Manually tested. Author: wm624@hotmail.com <wm624@hotmail.com> Author: Cheng Lian <lian@databricks.com> Closes #14317 from liancheng/py-examples-update.
* [MINOR][BUILD] Fix Java Linter `LineLength` errorsDongjoon Hyun2016-07-191-2/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes four java linter `LineLength` errors. Those are all `LineLength` errors, but we had better remove all java linter errors before release. ## How was this patch tested? After pass the Jenkins, `./dev/lint-java`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14255 from dongjoon-hyun/minor_java_linter.
* [SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example updateCheng Lian2016-07-182-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL". ## How was this patch tested? Manually verified the generated HTML page. Author: Cheng Lian <lian@databricks.com> Closes #14245 from liancheng/minor-scala-example-update.
* [SPARK-16403][EXAMPLES] Cleanup to remove unused imports, consistent style, ↵Bryan Cutler2016-07-145-215/+16
| | | | | | | | | | | | | | | | | | | | | | | | | minor fixes ## What changes were proposed in this pull request? Cleanup of examples, mostly from PySpark-ML to fix minor issues: unused imports, style consistency, pipeline_example is a duplicate, use future print funciton, and a spelling error. * The "Pipeline Example" is duplicated by "Simple Text Classification Pipeline" in Scala, Python, and Java. * "Estimator Transformer Param Example" is duplicated by "Simple Params Example" in Scala, Python and Java * Synced random_forest_classifier_example.py with Scala by adding IndexToString label converted * Synced train_validation_split.py (in Scala ModelSelectionViaTrainValidationExample) by adjusting data split, adding grid for intercept. * RegexTokenizer was doing nothing in tokenizer_example.py and JavaTokenizerExample.java, synced with Scala version ## How was this patch tested? local tests and run modified examples Author: Bryan Cutler <cutlerb@gmail.com> Closes #14081 from BryanCutler/examples-cleanup-SPARK-16403.
* [SPARK-16303][DOCS][EXAMPLES] Updated SQL programming guide and examplesaokolnychyi2016-07-134-186/+684
| | | | | | | | | | | | | | - Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project. - Removed the inconsistency between Scala and Java Spark SQL examples - Scala and Java Spark SQL examples were updated The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review. ![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png) Author: aokolnychyi <okolnychyyanton@gmail.com> Closes #14119 from aokolnychyi/spark_16303.
* [SPARK-16114][SQL] structured streaming event time window exampleJames Thomas2016-07-112-2/+118
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? A structured streaming example with event time windowing. ## How was this patch tested? Run locally Author: James Thomas <jamesjoethomas@gmail.com> Closes #13957 from jjthomas/current.
* [SPARK-16294][SQL] Labelling support for the include_example Jekyll pluginCheng Lian2016-06-291-0/+5
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds labelling support for the `include_example` Jekyll plugin, so that we may split a single source file into multiple line blocks with different labels, and include them in multiple code snippets in the generated HTML page. ## How was this patch tested? Manually tested. <img width="923" alt="screenshot at jun 29 19-53-21" src="https://cloud.githubusercontent.com/assets/230655/16451099/66a76db2-3e33-11e6-84fb-63104c2f0688.png"> Author: Cheng Lian <lian@databricks.com> Closes #13972 from liancheng/include-example-with-labels.
* [SPARK-16114][SQL] structured streaming network word count examplesJames Thomas2016-06-281-0/+82
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Network word count example for structured streaming ## How was this patch tested? Run locally Author: James Thomas <jamesjoethomas@gmail.com> Author: James Thomas <jamesthomas@Jamess-MacBook-Pro.local> Closes #13816 from jjthomas/master.
* [SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer ↵GayathriMurali2016-06-241-1/+6
| | | | | | | | | | | | and CountVectorizer ## What changes were proposed in this pull request? Made changes to HashingTF,QuantileVectorizer and CountVectorizer Author: GayathriMurali <gayathri.m@intel.com> Closes #13745 from GayathriMurali/SPARK-15997.
* [SPARK-15608][ML][EXAMPLES][DOC] add examples and documents of ml.isotonic ↵WeichenXu2016-06-162-7/+74
| | | | | | | | | | | | | | | | | | | | | | | | | regression ## What changes were proposed in this pull request? add ml doc for ml isotonic regression add scala example for ml isotonic regression add java example for ml isotonic regression add python example for ml isotonic regression modify scala example for mllib isotonic regression modify java example for mllib isotonic regression modify python example for mllib isotonic regression add data/mllib/sample_isotonic_regression_libsvm_data.txt delete data/mllib/sample_isotonic_regression_data.txt ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #13381 from WeichenXu123/add_isotonic_regression_doc.
* [SPARK-15898][SQL] DataFrameReader.text should return DataFrameWenchen Fan2016-06-125-5/+5
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We want to maintain API compatibility for DataFrameReader.text, and will introduce a new API called DataFrameReader.textFile which returns Dataset[String]. affected PRs: https://github.com/apache/spark/pull/11731 https://github.com/apache/spark/pull/13104 https://github.com/apache/spark/pull/13184 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #13604 from cloud-fan/revert.
* [SPARK-15086][CORE][STREAMING] Deprecate old Java accumulator APISean Owen2016-06-121-5/+5
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? - Deprecate old Java accumulator API; should use Scala now - Update Java tests and examples - Don't bother testing old accumulator API in Java 8 (too) - (fix a misspelling too) ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #13606 from srowen/SPARK-15086.
* [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ↵Yanbo Liang2016-06-066-12/+12
| | | | | | | | | | | | | | | | | | ML examples ## What changes were proposed in this pull request? Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken. ```python pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.' ``` We should use ```accuracy``` to replace ```precision``` in these examples. ## How was this patch tested? Offline tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13519 from yanboliang/spark-15771.
* [SPARK-15605][ML][EXAMPLES] Fix broken ML JavaDeveloperApiExample.Yanbo Liang2016-06-021-240/+0
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? See [SPARK-15605](https://issues.apache.org/jira/browse/SPARK-15605) for the detail of this bug. This PR fix 2 major bugs in this example: * The java example class use Param ```maxIter```, it will fail when calling ```Param.shouldOwn```. We need add a public method which return the ```maxIter``` Object. Because ```Params.params``` use java reflection to list all public method whose return type is ```Param```, and invoke them to get all defined param objects in the instance. * The ```uid``` member defined in Java class will be initialized after Scala traits such as ```HasFeaturesCol```. So when ```HasFeaturesCol``` being constructed, they get ```uid``` with null, which will cause ```Param.shouldOwn``` check fail. so, here is my changes: * Add public method: ```public IntParam getMaxIterParam() {return maxIter;}``` * Use Java anonymous class overriding ```uid()``` to defined the ```uid```, and it solve the second problem described above. * To make the ```getMaxIterParam ``` can be invoked using java reflection, we must make the two class (MyJavaLogisticRegression and MyJavaLogisticRegressionModel) public. so I make them become inner public static class. ## How was this patch tested? Offline tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13353 from yanboliang/spark-15605.
* [SPARK-15449][MLLIB][EXAMPLE] Wrong Data Format - Documentation Issuewm624@hotmail.com2016-05-271-2/+2
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) In the MLLib naivebayes example, scala and python example doesn't use libsvm data, but Java does. I make changes in scala and python example to use the libsvm data as the same as Java example. ## How was this patch tested? Manual tests Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13301 from wangmiao1981/example.
* [MINOR] Fix Typos 'a -> an'Zheng RuiFeng2016-05-261-1/+1
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `a` -> `an` I use regex to generate potential error lines: `grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala` and review them line by line. ## How was this patch tested? local build `lint-java` checking Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13317 from zhengruifeng/a_an.
* [SPARK-15457][MLLIB][ML] Eliminate some warnings from MLlib about deprecationsSean Owen2016-05-262-5/+3
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items: * WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples. * Use in PythonMLlibAPI: Change to using private constructors * Streaming algs: No warnings after we un-deprecate the classes * Examples: Deprecate or change ones which use deprecated APIs * MulticlassMetrics fields (precision, etc.) * LinearRegressionSummary.model field ## How was this patch tested? Existing tests. Checked for warnings manually. Author: Sean Owen <sowen@cloudera.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #13314 from jkbradley/warning-cleanups.
* [SPARK-15031][EXAMPLE] Use SparkSession in examplesZheng RuiFeng2016-05-207-38/+62
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031) `MLLLIB` is not recommended to use now, so examples in `MLLIB` are ignored in this PR. `StreamingContext` can not be directly obtained from `SparkSession`, so example in `Streaming` are ignored too. cc andrewor14 ## How was this patch tested? manual tests with spark-submit Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13164 from zhengruifeng/use_sparksession_ii.
* [SPARK-15398][ML] Update the warning message to recommend ML usageZheng RuiFeng2016-05-191-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? MLlib are not recommended to use, and some methods are even deprecated. Update the warning message to recommend ML usage. ``` def showWarning() { System.err.println( """WARN: This is a naive implementation of Logistic Regression and is given as an example! |Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or |org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS |for more conventional use. """.stripMargin) } ``` To ``` def showWarning() { System.err.println( """WARN: This is a naive implementation of Logistic Regression and is given as an example! |Please use org.apache.spark.ml.classification.LogisticRegression |for more conventional use. """.stripMargin) } ``` ## How was this patch tested? local build Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13190 from zhengruifeng/update_recd.
* [SPARK-15296][MLLIB] Refactor All Java Tests that use SparkSessionSandeep Singh2016-05-191-1/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion ## How was this patch tested? Existing Tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13101 from techaddict/SPARK-15296.
* [SPARK-15031][EXAMPLES][FOLLOW-UP] Make Python param example working with ↵hyukjinkwon2016-05-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | SparkSession ## What changes were proposed in this pull request? It seems most of Python examples were changed to use SparkSession by https://github.com/apache/spark/pull/12809. This PR said both examples below: - `simple_params_example.py` - `aft_survival_regression.py` are not changed because it dose not work. It seems `aft_survival_regression.py` is changed by https://github.com/apache/spark/pull/13050 but `simple_params_example.py` is not yet. This PR corrects the example and make this use SparkSession. In more detail, it seems `threshold` is replaced to `thresholds` here and there by https://github.com/apache/spark/commit/5a23213c148bfe362514f9c71f5273ebda0a848a. However, when it calls `lr.fit(training, paramMap)` this overwrites the values. So, `threshold` was 5 and `thresholds` becomes 5.5 (by `1 / (1 + thresholds(0) / thresholds(1)`). According to the comment below. this is not allowed, https://github.com/apache/spark/blob/354f8f11bd4b20fa99bd67a98da3525fd3d75c81/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L58-L61. So, in this PR, it sets the equivalent value so that this does not throw an exception. ## How was this patch tested? Manully (`mvn package -DskipTests && spark-submit simple_params_example.py`) Author: hyukjinkwon <gurwls223@gmail.com> Closes #13135 from HyukjinKwon/SPARK-15031.
* [SPARK-15171][SQL] Remove the references to deprecated method ↵Sean Zhong2016-05-182-5/+5
| | | | | | | | | | | | | | | | | dataset.registerTempTable ## What changes were proposed in this pull request? Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`. ## How was this patch tested? This PR only changes the unit test code, examples, and comments. It should be safe. This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged. Author: Sean Zhong <seanzhong@databricks.com> Closes #13098 from clockfly/spark-15171-remove-deprecation.
* [SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based ↵DB Tsai2016-05-174-9/+9
| | | | | | | | | | | | | | | | | | algorithms ## What changes were proposed in this pull request? Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Xiangrui Meng <meng@databricks.com> Closes #12627 from dbtsai/SPARK-14615-NewML.
* [SPARK-14434][ML] User guide doc and examples for GaussianMixture in spark.mlwm624@hotmail.com2016-05-171-0/+64
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Add guide doc and examples for GaussianMixture in Spark.ml in Java, Scala and Python. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual compile and test all examples Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12788 from wangmiao1981/example.
* [SPARK-14979][ML][PYSPARK] Add examples for GeneralizedLinearRegressionYanbo Liang2016-05-161-0/+83
| | | | | | | | | | | | ## What changes were proposed in this pull request? Add Scala/Java/Python examples for ```GeneralizedLinearRegression```. ## How was this patch tested? They are examples and have been tested offline. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12754 from yanboliang/spark-14979.
* [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempViewSean Zhong2016-05-122-5/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #12945 from clockfly/spark-15171.
* [SPARK-15031][SPARK-15134][EXAMPLE][DOC] Use SparkSession and update indent ↵Zheng RuiFeng2016-05-1117-42/+70
| | | | | | | | | | | | | | | | in examples ## What changes were proposed in this pull request? 1, Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031) 2, Update indent for `SparkContext` according to [SPARK-15134](https://issues.apache.org/jira/browse/SPARK-15134) 3, BTW, remove some duplicate space and add missing '.' ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13050 from zhengruifeng/use_sparksession.
* [SPARK-14933][HOTFIX] Replace `sqlContext` with `spark`.Dongjoon Hyun2016-05-111-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This fixes compile errors. ## How was this patch tested? Pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13053 from dongjoon-hyun/hotfix_sqlquerysuite.
* [SPARK-15150][EXAMPLE][DOC] Update LDA examplesZheng RuiFeng2016-05-112-47/+27
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1,create a libsvm-type dataset for lda: `data/mllib/sample_lda_libsvm_data.txt` 2,add python example 3,directly read the datafile in examples 4,BTW, change to `SparkSession` in `aft_survival_regression.py` ## How was this patch tested? manual tests `./bin/spark-submit examples/src/main/python/ml/lda_example.py` Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12927 from zhengruifeng/lda_pe.
* [SPARK-15149][EXAMPLE][DOC] update kmeans exampleZheng RuiFeng2016-05-111-46/+14
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Python example for ml.kmeans already exists, but not included in user guide. 1,small changes like: `example_on` `example_off` 2,add it to user guide 3,update examples to directly read datafile ## How was this patch tested? manual tests `./bin/spark-submit examples/src/main/python/ml/kmeans_example.py Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12925 from zhengruifeng/km_pe.
* [SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ↵Zheng RuiFeng2016-05-111-31/+18
| | | | | | | | | | | | | | | | | | ml.BisectingKMeans ## What changes were proposed in this pull request? 1, add BisectingKMeans to ml-clustering.md 2, add the missing Scala BisectingKMeansExample 3, create a new datafile `data/mllib/sample_kmeans_data.txt` ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11844 from zhengruifeng/doc_bkm.
* [SPARK-15141][EXAMPLE][DOC] Update OneVsRest ExamplesZheng RuiFeng2016-05-111-184/+30
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1, Add python example for OneVsRest 2, remove args-parsing ## How was this patch tested? manual tests `./bin/spark-submit examples/src/main/python/ml/one_vs_rest_example.py` Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12920 from zhengruifeng/ovr_pe.
* [MINOR][DOCS] Remove remaining sqlContext in documentation at exampleshyukjinkwon2016-05-091-1/+1
| | | | | | | | | | This PR removes `sqlContext` in examples. Actual usage was all replaced in https://github.com/apache/spark/pull/12809 but there are some in comments. Manual style checking. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13006 from HyukjinKwon/minor-docs.
* [MINOR][ML][PYSPARK] ALS example cleanupNick Pentreath2016-05-071-5/+1
| | | | | | | | | | | | Cleans up ALS examples by removing unnecessary casts to double for `rating` and `prediction` columns, since `RegressionEvaluator` now supports `Double` & `Float` input types. ## How was this patch tested? Manual compile and run with `run-example ml.ALSExample` and `spark-submit examples/src/main/python/ml/als_example.py`. Author: Nick Pentreath <nickp@za.ibm.com> Closes #12892 from MLnick/als-examples-cleanup.
* [SPARK-15134][EXAMPLE] Indent SparkSession builder patterns and update ↵Dongjoon Hyun2016-05-0538-58/+165
| | | | | | | | | | | | | | | | | | | binary_classification_metrics_example.py ## What changes were proposed in this pull request? This issue addresses the comments in SPARK-15031 and also fix java-linter errors. - Use multiline format in SparkSession builder patterns. - Update `binary_classification_metrics_example.py` to use `SparkSession`. - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far) ## How was this patch tested? After passing the Jenkins tests and run `dev/lint-java` manually. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12911 from dongjoon-hyun/SPARK-15134.
* [SPARK-15031][EXAMPLE] Use SparkSession in Scala/Python/Java example.Dongjoon Hyun2016-05-0452-509/+319
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`. - Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files. - Add `getConf` in Python SparkContext class: `python/pyspark/context.py` - Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**: - `SqlNetworkWordCount.scala` - `JavaSqlNetworkWordCount.java` - `sql_network_wordcount.py` Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue. - `simple_params_example.py` - `aft_survival_regression.py` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12809 from dongjoon-hyun/SPARK-15031.
* [SPARK-14756][CORE] Use parseLong instead of valueOfAzeem Jiva2016-04-261-4/+4
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Use Long.parseLong which returns a primative. Use a series of appends() reduces the creation of an extra StringBuilder type ## How was this patch tested? Unit tests Author: Azeem Jiva <azeemj@gmail.com> Closes #12520 from javawithjiva/minor.
* [SPARK-14873][CORE] Java sampleByKey methods take ju.Map but with Scala ↵Sean Owen2016-04-231-12/+9
| | | | | | | | | | | | | | | | Double values; results in type Object ## What changes were proposed in this pull request? Java `sampleByKey` methods should accept `Map` with `java.lang.Double` values ## How was this patch tested? Existing (updated) Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #12637 from srowen/SPARK-14873.
* [SPARK-8393][STREAMING] JavaStreamingContext#awaitTermination() throws ↵Sean Owen2016-04-219-11/+13
| | | | | | | | | | | | | | | | non-declared InterruptedException ## What changes were proposed in this pull request? `JavaStreamingContext.awaitTermination` methods should be declared as `throws[InterruptedException]` so that this exception can be handled in Java code. Note this is not just a doc change, but an API change, since now (in Java) the method has a checked exception to handle. All await-like methods in Java APIs behave this way, so seems worthwhile for 2.0. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #12418 from srowen/SPARK-8393.
* [SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTFYuhao Yang2016-04-201-0/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this. ## How was this patch tested? unit tests and doc generation Author: Yuhao Yang <hhbyyh@gmail.com> Closes #12454 from hhbyyh/tfdoc.
* [SPARK-13089][ML] [Doc] spark.ml Naive Bayes user guide and examplesYuhao Yang2016-04-131-0/+64
| | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-13089 Add section in ml-classification.md for NaiveBayes DataFrame-based API, plus example code (using include_example to clip code from examples/ folder files). Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11015 from hhbyyh/naiveBayesDoc.
* [MINOR][ML] Fixed MLlib build warningsJoseph K. Bradley2016-04-121-0/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixes to eliminate warnings during package and doc builds. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12263 from jkbradley/warning-cleanups.
* [SPARK-14500] [ML] Accept Dataset[_] instead of DataFrame in MLlib APIsXiangrui Meng2016-04-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR updates MLlib APIs to accept `Dataset[_]` as input where `DataFrame` was the input type. This PR doesn't change the output type. In Java, `Dataset[_]` maps to `Dataset<?>`, which includes `Dataset<Row>`. Some implementations were changed in order to return `DataFrame`. Tests and examples were updated. Note that this is a breaking change for subclasses of Transformer/Estimator. Lol, we don't have to rename the input argument, which has been `dataset` since Spark 1.2. TODOs: - [x] update MiMaExcludes (seems all covered by explicit filters from SPARK-13920) - [x] Python - [x] add a new test to accept Dataset[LabeledPoint] - [x] remove unused imports of Dataset ## How was this patch tested? Exiting unit tests with some modifications. cc: rxin jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #12274 from mengxr/SPARK-14500.