aboutsummaryrefslogtreecommitdiff
path: root/examples/src/main
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-14744][EXAMPLES] Clean up examples packaging, remove outdated examples.Marcelo Vanzin2016-04-259-904/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | First, make all dependencies in the examples module provided, and explicitly list a couple of ones that somehow are promoted to compile by maven. This means that to run streaming examples, the streaming connector package needs to be provided to run-examples using --packages or --jars, just like regular apps. Also, remove a couple of outdated examples. HBase has had Spark bindings for a while and is even including them in the HBase distribution in the next version, making the examples obsolete. The same applies to Cassandra, which seems to have a proper Spark binding library already. I just tested the build, which passes, and ran SparkPi. The examples jars directory now has only two jars: ``` $ ls -1 examples/target/scala-2.11/jars/ scopt_2.11-3.3.0.jar spark-examples_2.11-2.0.0-SNAPSHOT.jar ``` Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12544 from vanzin/SPARK-14744.
* [SPARK-14883][DOCS] Fix wrong R examples and make them up-to-dateDongjoon Hyun2016-04-242-12/+12
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules. - Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later. - Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency - Fix datatypes in `sparkr.md`. - Update a data result in `sparkr.md`. - Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`. - Other minor syntax fixes and a typo. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12649 from dongjoon-hyun/SPARK-14883.
* [SPARK-14873][CORE] Java sampleByKey methods take ju.Map but with Scala ↵Sean Owen2016-04-231-12/+9
| | | | | | | | | | | | | | | | Double values; results in type Object ## What changes were proposed in this pull request? Java `sampleByKey` methods should accept `Map` with `java.lang.Double` values ## How was this patch tested? Existing (updated) Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #12637 from srowen/SPARK-14873.
* [SPARK-8393][STREAMING] JavaStreamingContext#awaitTermination() throws ↵Sean Owen2016-04-219-11/+13
| | | | | | | | | | | | | | | | non-declared InterruptedException ## What changes were proposed in this pull request? `JavaStreamingContext.awaitTermination` methods should be declared as `throws[InterruptedException]` so that this exception can be handled in Java code. Note this is not just a doc change, but an API change, since now (in Java) the method has a checked exception to handle. All await-like methods in Java APIs behave this way, so seems worthwhile for 2.0. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #12418 from srowen/SPARK-8393.
* [SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTFYuhao Yang2016-04-203-0/+6
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this. ## How was this patch tested? unit tests and doc generation Author: Yuhao Yang <hhbyyh@gmail.com> Closes #12454 from hhbyyh/tfdoc.
* [SPARK-14515][DOC] Add python example for ChiSqSelectorZheng RuiFeng2016-04-181-0/+44
| | | | | | | | | | | | ## What changes were proposed in this pull request? Add the missing python example for ChiSqSelector ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12283 from zhengruifeng/chi2_pe.
* [MINOR] Revert removing explicit typing (changed in some examples and ↵hyukjinkwon2016-04-182-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | StatFunctions) ## What changes were proposed in this pull request? This PR reverts some changes in https://github.com/apache/spark/pull/12413. (please see the discussion in that PR). from ```scala words.foreachRDD { (rdd, time) => ... ``` to ```scala words.foreachRDD { (rdd: RDD[String], time: Time) => ... ``` Also, this was discussed in dev-mailing list, [here](http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-Scala-style-explicit-typing-within-transformation-functions-and-anonymous-val-td17173.html) ## How was this patch tested? This was tested with `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12452 from HyukjinKwon/revert-explicit-typing.
* [SPARK-14299][EXAMPLES] Remove duplications for scala.examples.mlXusen Yin2016-04-184-192/+17
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14299 Delete duplications in scala/examples/ml. TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ## How was this patch tested? Existing tests passed. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Xusen Yin <yinxusen@gmail.com> Closes #12366 from yinxusen/SPARK-14299-2.
* [MINOR] Remove inappropriate type notation and extra anonymous closure ↵hyukjinkwon2016-04-163-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | within functional transformations ## What changes were proposed in this pull request? This PR removes - Inappropriate type notations For example, from ```scala words.foreachRDD { (rdd: RDD[String], time: Time) => ... ``` to ```scala words.foreachRDD { (rdd, time) => ... ``` - Extra anonymous closure within functional transformations. For example, ```scala .map(item => { ... }) ``` which can be just simply as below: ```scala .map { item => ... } ``` and corrects some obvious style nits. ## How was this patch tested? This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly. The rules applied were below: - For the first correction, ```xml <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters> </check> ``` ```xml <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters> </check> ``` - For the second correction ```xml <check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters> </check> ``` **Those rules were not added** Author: hyukjinkwon <gurwls223@gmail.com> Closes #12413 from HyukjinKwon/SPARK-style.
* [MINOR][SQL] Remove extra anonymous closure within functional transformationshyukjinkwon2016-04-142-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR removes extra anonymous closure within functional transformations. For example, ```scala .map(item => { ... }) ``` which can be just simply as below: ```scala .map { item => ... } ``` ## How was this patch tested? Related unit tests and `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12382 from HyukjinKwon/minor-extra-closers.
* [SPARK-13089][ML] [Doc] spark.ml Naive Bayes user guide and examplesYuhao Yang2016-04-133-0/+175
| | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-13089 Add section in ml-classification.md for NaiveBayes DataFrame-based API, plus example code (using include_example to clip code from examples/ folder files). Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11015 from hhbyyh/naiveBayesDoc.
* [SPARK-14509][DOC] Add python CountVectorizerExampleZheng RuiFeng2016-04-131-0/+44
| | | | | | | | | | | | ## What changes were proposed in this pull request? Add python CountVectorizerExample ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11917 from zhengruifeng/cv_pe.
* [SPARK-14508][BUILD] Add a new ScalaStyle Rule `OmitBracesInCase`Dongjoon Hyun2016-04-125-20/+10
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule. ``` case: Always omit braces in case clauses. ``` This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code. ## How was this patch tested? Pass the Jenkins tests (including Scala style checking) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12280 from dongjoon-hyun/SPARK-14508.
* [MINOR][ML] Fixed MLlib build warningsJoseph K. Bradley2016-04-121-0/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixes to eliminate warnings during package and doc builds. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12263 from jkbradley/warning-cleanups.
* [SPARK-14500] [ML] Accept Dataset[_] instead of DataFrame in MLlib APIsXiangrui Meng2016-04-112-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR updates MLlib APIs to accept `Dataset[_]` as input where `DataFrame` was the input type. This PR doesn't change the output type. In Java, `Dataset[_]` maps to `Dataset<?>`, which includes `Dataset<Row>`. Some implementations were changed in order to return `DataFrame`. Tests and examples were updated. Note that this is a breaking change for subclasses of Transformer/Estimator. Lol, we don't have to rename the input argument, which has been `dataset` since Spark 1.2. TODOs: - [x] update MiMaExcludes (seems all covered by explicit filters from SPARK-13920) - [x] Python - [x] add a new test to accept Dataset[LabeledPoint] - [x] remove unused imports of Dataset ## How was this patch tested? Exiting unit tests with some modifications. cc: rxin jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #12274 from mengxr/SPARK-14500.
* Update KMeansExample.scalaÖrjan Lundberg2016-04-101-1/+1
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? example does not work wo DataFrame import ## How was this patch tested? example doc only example does not work wo DataFrame import Author: Örjan Lundberg <orjan.lundberg@gmail.com> Closes #12277 from oluies/patch-1.
* [SPARK-14301][EXAMPLES] Java examples code merge and clean up.Yong Tang2016-04-108-534/+16
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This fix tries to remove duplicate Java code in examples/mllib and examples/ml. The following changes have been made: ``` deleted: ml/JavaCrossValidatorExample.java (duplicate of JavaModelSelectionViaCrossValidationExample.java) deleted: ml/JavaTrainValidationSplitExample.java (duplicated of JavaModelSelectionViaTrainValidationSplitExample.java) deleted: mllib/JavaFPGrowthExample.java (duplicate of JavaSimpleFPGrowth.java) deleted: mllib/JavaLDAExample.java (duplicate of JavaLatentDirichletAllocationExample.java) deleted: mllib/JavaKMeans.java (merged with JavaKMeansExample.java) deleted: mllib/JavaLR.java (duplicate of JavaLinearRegressionWithSGDExample.java) updated: mllib/JavaKMeansExample.java (merged with mllib/JavaKMeans.java) ``` ## How was this patch tested? Existing tests passed. Author: Yong Tang <yong.tang.github@outlook.com> Closes #12143 from yongtang/SPARK-14301.
* [SPARK-14339][DOC] Add python examples for DCT,MinMaxScaler,MaxAbsScalerZheng RuiFeng2016-04-093-0/+131
| | | | | | | | | | | | ## What changes were proposed in this pull request? add three python examples ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12063 from zhengruifeng/dct_pe.
* [SPARK-14444][BUILD] Add a new scalastyle `NoScalaDoc` to prevent ↵Dongjoon Hyun2016-04-061-2/+4
| | | | | | | | | | | | | | | | | | | | | ScalaDoc-style multiline comments ## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings. ``` /** In Spark, we don't use the ScalaDoc style so this * is not correct. */ ``` ## How was this patch tested? Pass the Jenkins tests (including `lint-scala`). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12221 from dongjoon-hyun/SPARK-14444.
* [SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static ↵Dongjoon Hyun2016-04-035-17/+17
| | | | | | | | | | | | | | | | | | | | | analysis results ## What changes were proposed in this pull request? This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines). - Fix typos(exception/log strings, testcase name, comments) in 44 lines. - Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011) - Use diamond operators in 40 lines. (New codes after SPARK-13702) - Fix redundant semicolon in 5 lines. - Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala. ## How was this patch tested? Manual and pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12139 from dongjoon-hyun/SPARK-14355.
* [MINOR][DOCS] Use multi-line JavaDoc comments in Scala code.Dongjoon Hyun2016-04-028-41/+43
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes. (All comment-only changes over 77 files: +786 lines, −747 lines) ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
* [MINOR] Typo fixesJacek Laskowski2016-04-021-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Typo fixes. No functional changes. ## How was this patch tested? Built the sources and ran with samples. Author: Jacek Laskowski <jacek@japila.pl> Closes #11802 from jaceklaskowski/typo-fixes.
* [MINOR] Fix newly added java-lint errorsDongjoon Hyun2016-03-262-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes some newly added java-lint errors(unused-imports, line-lengsth). ## How was this patch tested? Pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11968 from dongjoon-hyun/SPARK-14167.
* [SPARK-13874][DOC] Remove docs of streaming-akka, streaming-zeromq, ↵Shixiong Zhu2016-03-261-59/+0
| | | | | | | | | | | | | | | | | | streaming-mqtt and streaming-twitter ## What changes were proposed in this pull request? This PR removes all docs about the old streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter projects since I have already copied them to https://github.com/spark-packages Also remove mqtt_wordcount.py that I forgot to remove previously. ## How was this patch tested? Jenkins PR Build. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11824 from zsxwing/remove-doc.
* [SPARK-14073][STREAMING][TEST-MAVEN] Move flume back to SparkShixiong Zhu2016-03-253-0/+212
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR moves flume back to Spark as per the discussion in the dev mail-list. ## How was this patch tested? Existing Jenkins tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11895 from zsxwing/move-flume-back.
* [SPARK-13017][DOCS] Replace example code in mllib-feature-extraction.md ↵Xin Ren2016-03-2414-0/+847
| | | | | | | | | | | | | | | | | | | | | using include_example Replace example code in mllib-feature-extraction.md using include_example https://issues.apache.org/jira/browse/SPARK-13017 The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. `{% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}` Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/TFIDFExample.scala` and pick code blocks marked "example" and replace code block in `{% highlight %}` in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 Author: Xin Ren <iamshrek@126.com> Closes #11142 from keypointt/SPARK-13017.
* [SPARK-13019][DOCS] fix for scala-2.10 build: Replace example code in ↵Xin Ren2016-03-2418-0/+1020
| | | | | | | | | | | | | | | | | | | | | | mllib-statistics.md using include_example ## What changes were proposed in this pull request? This PR for ticket SPARK-13019 is based on previous PR(https://github.com/apache/spark/pull/11108). Since PR(https://github.com/apache/spark/pull/11108) is breaking scala-2.10 build, more work is needed to fix build errors. What I did new in this PR is adding keyword argument for 'fractions': ` val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)` ` val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)` I reopened ticket on JIRA but sorry I don't know how to reopen a GitHub pull request, so I just submitting a new pull request. ## How was this patch tested? Manual build testing on local machine, build based on scala-2.10. Author: Xin Ren <iamshrek@126.com> Closes #11901 from keypointt/SPARK-13019.
* Revert "[SPARK-13019][DOCS] Replace example code in mllib-statistics.md ↵Xiangrui Meng2016-03-2118-1020/+0
| | | | | | using include_example" This reverts commit 1af8de200c4d3357bcb09e7bbc6deece00e885f2.
* [SPARK-13019][DOCS] Replace example code in mllib-statistics.md using ↵Xin Ren2016-03-2118-0/+1020
| | | | | | | | | | | | | | | | | | | | include_example https://issues.apache.org/jira/browse/SPARK-13019 The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. `{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}` Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` and pick code blocks marked "example" and replace code block in `{% highlight %}` in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 Author: Xin Ren <iamshrek@126.com> Closes #11108 from keypointt/SPARK-13019.
* [SPARK-14011][CORE][SQL] Enable `LineLength` Java checkstyle ruleDongjoon Hyun2016-03-2112-34/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? [Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`. ```xml - <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places --> - <!-- <module name="LineLength"> <property name="max" value="100"/> <property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/> </module> - --> <module name="NoLineWrap"/> <module name="EmptyBlock"> <property name="option" value="TEXT"/> -167,5 +164,7 </module> <module name="CommentsIndentation"/> <module name="UnusedImports"/> + <module name="RedundantImport"/> + <module name="RedundantModifier"/> ``` ## How was this patch tested? Currently, `lint-java` is disabled in Jenkins. It needs a manual test. After passing the Jenkins tests, `dev/lint-java` should passes locally. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11831 from dongjoon-hyun/SPARK-14011.
* [MINOR][DOC] Fix nits in JavaStreamingTestExampleZheng RuiFeng2016-03-181-9/+11
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix some nits discussed in https://github.com/apache/spark/pull/11776#issuecomment-198207419 use !rdd.isEmpty instead of rdd.count > 0 use static instead of AtomicInteger remove unneeded "throws Exception" ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11821 from zhengruifeng/je_fix.
* [SPARK-13928] Move org.apache.spark.Logging into ↵Wenchen Fan2016-03-173-2/+4
| | | | | | | | | | | | | | | | org.apache.spark.internal.Logging ## What changes were proposed in this pull request? Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11764 from cloud-fan/logger.
* [MINOR][DOC] Add JavaStreamingTestExampleZheng RuiFeng2016-03-171-0/+121
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add the java example of StreamingTest ## How was this patch tested? manual tests in CLI: bin/run-example mllib.JavaStreamingTestExample dataDir 5 100 Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11776 from zhengruifeng/streaming_je.
* [SPARK-13843][STREAMING] Remove streaming-flume, streaming-mqtt, ↵Shixiong Zhu2016-03-1412-1321/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | streaming-zeromq, streaming-akka, streaming-twitter to Spark packages ## What changes were proposed in this pull request? Currently there are a few sub-projects, each for integrating with different external sources for Streaming. Now that we have better ability to include external libraries (spark packages) and with Spark 2.0 coming up, we can move the following projects out of Spark to https://github.com/spark-packages - streaming-flume - streaming-akka - streaming-mqtt - streaming-zeromq - streaming-twitter They are just some ancillary packages and considering the overhead of maintenance, running tests and PR failures, it's better to maintain them out of Spark. In addition, these projects can have their different release cycles and we can release them faster. I have already copied these projects to https://github.com/spark-packages ## How was this patch tested? Jenkins tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11672 from zsxwing/remove-external-pkg.
* [MINOR][DOCS] Fix more typos in comments/strings.Dongjoon Hyun2016-03-141-1/+1
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes 135 typos over 107 files: * 121 typos in comments * 11 typos in testcase name * 3 typos in log messages ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11689 from dongjoon-hyun/fix_more_typos.
* [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> ↵Sean Owen2016-03-132-2/+6
| | | | | | | | | | | | | | | | | | | | byte[] conversions (and remaining Coverity items) ## What changes were proposed in this pull request? - Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8 - Same for `InputStreamReader` and `OutputStreamWriter` constructors - Standardizes on UTF-8 everywhere - Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`) - (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit https://github.com/srowen/spark/commit/1deecd8d9ca986d8adb1a42d315890ce5349d29c ) ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11657 from srowen/SPARK-13823.
* [SPARK-13841][SQL] Removes Dataset.collectRows()/takeRows()Cheng Lian2016-03-1313-14/+17
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful. ## How was this patch tested? Existing tests should do the work. Author: Cheng Lian <lian@databricks.com> Closes #11678 from liancheng/remove-collect-rows-and-take-rows.
* [SPARK-13814] [PYSPARK] Delete unnecessary imports in python examples filesZheng RuiFeng2016-03-1115-27/+0
| | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-13814 ## What changes were proposed in this pull request? delete unnecessary imports in python examples files ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11651 from zhengruifeng/del_import_pe.
* [SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrameCheng Lian2016-03-1110-10/+0
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? PR #11443 temporarily disabled MiMA check, this PR re-enables it. One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API changes. ## How was this patch tested? Tested by MiMA check triggered by Jenkins. Author: Cheng Lian <lian@databricks.com> Closes #11656 from liancheng/re-enable-mima.
* [HOT-FIX][SQL][ML] Fix compile error from use of DataFrame in Java ↵Nick Pentreath2016-03-111-3/+4
| | | | | | | | | | | | | | | | MaxAbsScaler example ## What changes were proposed in this pull request? Fix build failure introduced in #11392 (change `DataFrame` -> `Dataset<Row>`). ## How was this patch tested? Existing build/unit tests Author: Nick Pentreath <nick.pentreath@gmail.com> Closes #11653 from MLnick/java-maxabs-example-fix.
* [SPARK-13512][ML] add example and doc for MaxAbsScalerYuhao Yang2016-03-112-0/+101
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-13512 Add example and doc for ml.feature.MaxAbsScaler. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11392 from hhbyyh/maxabsdoc.
* [SPARK-13672][ML] Add python examples of BisectingKMeans in ML and MLLIBZheng RuiFeng2016-03-112-0/+107
| | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-13672 ## What changes were proposed in this pull request? add two python examples of BisectingKMeans for ml and mllib ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11515 from zhengruifeng/mllib_bkm_pe.
* [SPARK-13244][SQL] Migrates DataFrame to DatasetCheng Lian2016-03-1052-213/+252
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`. Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`). There are several noticeable API changes related to those returning arrays: 1. `collect`/`take` - Old APIs in class `DataFrame`: ```scala def collect(): Array[Row] def take(n: Int): Array[Row] ``` - New APIs in class `Dataset[T]`: ```scala def collect(): Array[T] def take(n: Int): Array[T] def collectRows(): Array[Row] def takeRows(n: Int): Array[Row] ``` Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side. Normally, Java users may fall back to `collectAsList` and `takeAsList`. The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here). 1. `randomSplit` - Old APIs in class `DataFrame`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] def randomSplit(weights: Array[Double]): Array[DataFrame] ``` - New APIs in class `Dataset[T]`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]] def randomSplit(weights: Array[Double]): Array[Dataset[T]] ``` Similar problem as above, but hasn't been addressed for Java API yet. We can probably add `randomSplitAsList` to fix this one. 1. `groupBy` Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods. To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`. Other noticeable changes: 1. Dataset always do eager analysis now We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure. However, Dataset encoders requires eager analysi during Dataset construction. To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures. This plan is passed by `QueryExecution.assertAnalyzed`. ## How was this patch tested? Existing tests do the work. ## TODO - [ ] Fix all tests - [ ] Re-enable MiMA check - [ ] Update ScalaDoc (`since`, `group`, and example code) Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Author: Cheng Lian <liancheng@users.noreply.github.com> Closes #11443 from liancheng/ds-to-df.
* [SPARK-3854][BUILD] Scala style: require spaces before `{`.Dongjoon Hyun2016-03-106-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time. ``` // Correct: if (true) { println("Wow!") } // Incorrect: if (true){ println("Wow!") } ``` IntelliJ also shows new warnings based on this. ## How was this patch tested? Pass the Jenkins ScalaStyle test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11637 from dongjoon-hyun/SPARK-3854.
* [SPARK-13706][ML] Add Python Example for Train Validation SplitJeremyNixon2016-03-101-0/+68
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This pull request adds a python example for train validation split. ## How was this patch tested? This was style tested through lint-python, generally tested with ./dev/run-tests, and run in notebook and shell environments. It was viewed in docs locally with jekyll serve. This contribution is my original work and I license it to Spark under its open source license. Author: JeremyNixon <jnixon2@gmail.com> Closes #11547 from JeremyNixon/tvs_example.
* [SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance ↵Dongjoon Hyun2016-03-0925-42/+40
| | | | | | | | | | | | | | | | | | | | | | | | | creation in Java code. ## What changes were proposed in this pull request? In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator. ``` - final ArrayList<Product2<Object, Object>> dataToWrite = - new ArrayList<Product2<Object, Object>>(); + final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>(); ``` Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this. ## How was this patch tested? Manual. Pass the existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11541 from dongjoon-hyun/SPARK-13702.
* [SPARK-13692][CORE][SQL] Fix trivial Coverity/Checkstyle defectsDongjoon Hyun2016-03-0922-4/+46
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle. - Implement both null and type checking in equals functions. - Fix wrong type casting logic in SimpleJavaBean2.equals. - Add `implement Cloneable` to `UTF8String` and `SortedIterator`. - Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. - Fix coding style: Add '{}' to single `for` statement in mllib examples. - Remove unused imports in `ColumnarBatch` and `JavaKinesisStreamSuite`. - Remove unused fields in `ChunkFetchIntegrationSuite`. - Add `stop()` to prevent resource leak. Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583](https://issues.apache.org/jira/browse/SPARK-13583). ## How was this patch tested? manual via `./dev/lint-java` and Coverity site. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11530 from dongjoon-hyun/SPARK-13692.
* Fixing the type of the sentiment happiness valueYury Liavitski2016-03-071-2/+2
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Added the conversion to int for the 'happiness value' read from the file. Otherwise, later on line 75 the multiplication will multiply a string by a number, yielding values like "-2-2" instead of -4. ## How was this patch tested? Tested manually. Author: Yury Liavitski <seconds.before@gmail.com> Author: Yury Liavitski <yury.liavitski@il111.ice.local> Closes #11540 from heliocentrist/fix-sentiment-value-type.
* [MINOR] Fix typos in comments and testcase name of codeDongjoon Hyun2016-03-039-10/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes typos in comments and testcase name of code. ## How was this patch tested? manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
* [SPARK-13013][DOCS] Replace example code in mllib-clustering.md using ↵Xin Ren2016-03-0314-3/+697
| | | | | | | | | | | | | | | | | | | | | include_example Replace example code in mllib-clustering.md using include_example https://issues.apache.org/jira/browse/SPARK-13013 The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. `{% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %}` Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala` and pick code blocks marked "example" and replace code block in `{% highlight %}` in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 Author: Xin Ren <iamshrek@126.com> Closes #11116 from keypointt/SPARK-13013.