aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-10060] [ML] [DOC] spark.ml DecisionTree user guideJoseph K. Bradley2015-08-195-13/+519
| | | | | | | | | | | | New user guide section ml-decision-tree.md, including code examples. I have run all examples, including the Java ones. CC: manishamde yanboliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8244 from jkbradley/ml-dt-docs.
* [SPARK-8949] Print warnings when using preferred locations featureHan JU2015-08-191-0/+5
| | | | | | | | | | | | Add warnings according to SPARK-8949 in `SparkContext` - warnings in scaladoc - log warnings when preferred locations feature is used through `SparkContext`'s constructor However I didn't found any documentation reference of this feature. Please direct me if you know any reference to this feature. Author: Han JU <ju.han.felix@gmail.com> Closes #7874 from darkjh/SPARK-8949.
* [SPARK-9977] [DOCS] Update documentation for StringIndexerlewuathe2015-08-191-1/+5
| | | | | | | | | By using `StringIndexer`, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label. I think it is better to make it explicit on documentation. Author: lewuathe <lewuathe@me.com> Closes #8205 from Lewuathe/SPARK-9977.
* [DOCS] [SQL] [PYSPARK] Fix typo in ntile functionMoussa Taifi2015-08-191-1/+1
| | | | | | | | Fix typo in ntile function. Author: Moussa Taifi <moutai10@gmail.com> Closes #8261 from moutai/patch-2.
* [SPARK-10070] [DOCS] Remove Guava dependencies in user guidesSean Owen2015-08-192-35/+38
| | | | | | | | | | | | `Lists.newArrayList` -> `Arrays.asList` CC jkbradley feynmanliang Anybody into replacing usages of `Lists.newArrayList` in the examples / source code too? this method isn't useful in Java 7 and beyond. Author: Sean Owen <sowen@cloudera.com> Closes #8272 from srowen/SPARK-10070.
* Fix Broken LinkBill Chambers2015-08-191-1/+1
| | | | | | | | Link was broken because it included tick marks. Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #8302 from anabranch/patch-1.
* [SPARK-9967] [SPARK-10099] [STREAMING] Renamed conf ↵Tathagata Das2015-08-184-8/+8
| | | | | | | | | | | | spark.streaming.backpressure.{enable-->enabled} and fixed deprecated annotations Small changes - Renamed conf spark.streaming.backpressure.{enable --> enabled} - Change Java Deprecated annotations to Scala deprecated annotation with more information. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8299 from tdas/SPARK-9967.
* [SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal ↵Josh Rosen2015-08-184-16/+18
| | | | | | | | | | | | | | | | | | | | accesses cacheLocs In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine): ```scala val numItems = 100000 val s = Seq.fill(numItems)(1) for (i <- 0 until numItems) s(i) ``` It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput. This patch fixes this by replacing `Seq` with `Array`. Author: Josh Rosen <joshrosen@databricks.com> Closes #8178 from JoshRosen/dagscheduler-perf.
* [SPARK-9508] GraphX Pregel docs update with new Pregel codeAlexander Ulanov2015-08-181-10/+8
| | | | | | | | SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be modified accordingly since it lists the old Pregel code Author: Alexander Ulanov <nashb@yandex.ru> Closes #7831 from avulanov/SPARK-9508-pregel-doc2.
* [SPARK-9705] [DOC] fix docs about Python versionDavies Liu2015-08-182-3/+15
| | | | | | | | cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #8245 from davies/python_doc.
* [SPARK-10093] [SPARK-10096] [SQL] Avoid transformation on executors & fix ↵Reynold Xin2015-08-184-7/+68
| | | | | | | | | | | | | | | | | | UDFs on complex types This is kind of a weird case, but given a sufficiently complex query plan (in this case a TungstenProject with an Exchange underneath), we could have NPEs on the executors due to the time when we were calling transformAllExpressions In general we should ensure that all transformations occur on the driver and not on the executors. Some reasons for avoid executor side transformations include: * (this case) Some operator constructors require state such as access to the Spark/SQL conf so doing a makeCopy on the executor can fail. * (unrelated reason for avoid executor transformations) ExprIds are calculated using an atomic integer, so you can violate their uniqueness constraint by constructing them anywhere other than the driver. This subsumes #8285. Author: Reynold Xin <rxin@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #8295 from rxin/SPARK-10096.
* [SPARK-10095] [SQL] use public API of BigIntegerDavies Liu2015-08-183-45/+11
| | | | | | | | | | | | In UnsafeRow, we use the private field of BigInteger for better performance, but it actually didn't contribute much (3% in one benchmark) to end-to-end runtime, and make it not portable (may fail on other JVM implementations). So we should use the public API instead. cc rxin Author: Davies Liu <davies@databricks.com> Closes #8286 from davies/portable_decimal.
* [SPARK-10075] [SPARKR] Add `when` expressino function in SparkRYu ISHIKAWA2015-08-185-0/+45
| | | | | | | | | | | | | | | - Add `when` and `otherwise` as `Column` methods - Add `When` as an expression function - Add `%otherwise%` infix as an alias of `otherwise` Since R doesn't support a feature like method chaining, `otherwise(when(condition, value), value)` style is a little annoying for me. If `%otherwise%` looks strange for shivaram, I can remove it. What do you think? ### JIRA [[SPARK-10075] Add `when` expressino function in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10075) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8266 from yu-iskw/SPARK-10075.
* [SPARK-9939] [SQL] Resorts to Java process API in CliSuite, ↵Cheng Lian2015-08-195-91/+149
| | | | | | | | | | | | | | HiveSparkSubmitSuite and HiveThriftServer2 test suites Scala process API has a known bug ([SI-8768] [1]), which may be the reason why several test suites which fork sub-processes are flaky. This PR replaces Scala process API with Java process API in `CliSuite`, `HiveSparkSubmitSuite`, and `HiveThriftServer2` related test suites to see whether it fix these flaky tests. [1]: https://issues.scala-lang.org/browse/SI-8768 Author: Cheng Lian <lian@databricks.com> Closes #8168 from liancheng/spark-9939/use-java-process-api.
* [SPARK-10102] [STREAMING] Fix a race condition that startReceiver may happen ↵zsxwing2015-08-181-3/+8
| | | | | | | | | | | | | | before setting trackerState to Started Test failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/3305/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/stop_gracefully/ There is a race condition that setting `trackerState` to `Started` could happen after calling `startReceiver`. Then `startReceiver` won't start the receivers because it uses `! isTrackerStarted` to check if ReceiverTracker is stopping or stopped. But actually, `trackerState` is `Initialized` and will be changed to `Started` soon. Therefore, we should use `isTrackerStopping || isTrackerStopped`. Author: zsxwing <zsxwing@gmail.com> Closes #8294 from zsxwing/SPARK-9504.
* [SPARK-10072] [STREAMING] BlockGenerator can deadlock when the queue of ↵Tathagata Das2015-08-181-10/+19
| | | | | | | | | | | | generate blocks fills up to capacity Generated blocks are inserted into an ArrayBlockingQueue, and another thread pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if that queue fills up to capacity (default is 10 blocks), then the inserting into queue (done in the function updateCurrentBuffer) get blocked inside a synchronized block. However, the thread that is pulling blocks from the queue uses the same lock to check the current (active or stopped) while pulling from the queue. Since the block generating threads is blocked (as the queue is full) on the lock, this thread that is supposed to drain the queue gets blocked. Ergo, deadlock. Solution: Moved blocking call to ArrayBlockingQueue outside the synchronized to prevent deadlock. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8257 from tdas/SPARK-10072.
* [SPARKR] [MINOR] Get rid of a long line warningYu ISHIKAWA2015-08-181-1/+3
| | | | | | | | | | | | ``` R/functions.R:74:1: style: lines should not be more than 100 characters. jc <- callJStatic("org.apache.spark.sql.functions", "lit", ifelse(class(x) == "Column", xjc, x)) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8297 from yu-iskw/minor-lint-r.
* [SPARK-9969] [YARN] Remove old MR classpath API supportjerryshao2015-08-181-11/+1
| | | | | | | | | | Here propose to remove old MRJobConfig#DEFAULT_APPLICATION_CLASSPATH support, since we now move to Yarn stable API. vanzin and sryza , any opinion on this? If we still want to support old API, I can close it. But as far as I know now major Hadoop releases has moved to stable API. Author: jerryshao <sshao@hortonworks.com> Closes #8192 from jerryshao/SPARK-9969.
* Bump SparkR version string to 1.5.0Hossein2015-08-181-1/+1
| | | | | | | | | | This patch is against master, but we need to apply it to 1.5 branch as well. cc shivaram and rxin Author: Hossein <hossein@databricks.com> Closes #8291 from falaki/SparkRVersion1.5.
* [SPARK-8473] [SPARK-9889] [ML] User guide and example code for DCTFeynman Liang2015-08-181-0/+71
| | | | | | | | mengxr jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8184 from feynmanliang/SPARK-9889-DCT-docs.
* [SPARK-10098] [STREAMING] [TEST] Cleanup active context after test in ↵Tathagata Das2015-08-181-10/+17
| | | | | | | | | | FailureSuite Failures in streaming.FailureSuite can leak StreamingContext and SparkContext which fails all subsequent tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8289 from tdas/SPARK-10098.
* [SPARK-10012] [ML] Missing test case for Params#arrayLengthGtlewuathe2015-08-181-0/+3
| | | | | | | | Currently there is no test case for `Params#arrayLengthGt`. Author: lewuathe <lewuathe@me.com> Closes #8223 from Lewuathe/SPARK-10012.
* [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.treeBryan Cutler2015-08-1824-1/+157
| | | | | | | | Added since tags to mllib.tree Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #7380 from BryanCutler/sinceTag-mllibTree-8924.
* [SPARK-10088] [SQL] Add support for "stored as avro" in HiveQL parser.Marcelo Vanzin2015-08-182-10/+13
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8282 from vanzin/SPARK-10088.
* [SPARK-10089] [SQL] Add missing golden files.Marcelo Vanzin2015-08-182-0/+503
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8283 from vanzin/SPARK-10089.
* [SPARK-9782] [YARN] Support YARN application tags via SparkConfDennis Huo2015-08-183-0/+65
| | | | | | | | | Add a new test case in yarn/ClientSuite which checks how the various SparkConf and ClientArguments propagate into the ApplicationSubmissionContext. Author: Dennis Huo <dhuo@google.com> Closes #8072 from dennishuo/dhuo-yarn-application-tags.
* [SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolationMichael Armbrust2015-08-183-11/+22
| | | | | | | | Turns out that inner classes of inner objects are referenced directly, and thus moving it will break binary compatibility. Author: Michael Armbrust <michael@databricks.com> Closes #8281 from marmbrus/binaryCompat.
* [SPARK-9574] [STREAMING] Remove unnecessary contents of ↵zsxwing2015-08-185-1/+249
| | | | | | | | | | spark-streaming-XXX-assembly jars Removed contents already included in Spark assembly jar from spark-streaming-XXX-assembly jars. Author: zsxwing <zsxwing@gmail.com> Closes #8069 from zsxwing/SPARK-9574.
* [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array importPiotr Migdal2015-08-181-2/+0
| | | | | | | | See https://issues.apache.org/jira/browse/SPARK-10085 Author: Piotr Migdal <pmigdal@gmail.com> Closes #8284 from stared/spark-10085.
* [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guideYanbo Liang2015-08-181-0/+28
| | | | | | | | Add Python example for mllib LDAModel user guide Author: Yanbo Liang <ybliang8@gmail.com> Closes #8227 from yanboliang/spark-10032.
* [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression ↵Yanbo Liang2015-08-181-0/+35
| | | | | | | | | | user guide Add Python examples for mllib IsotonicRegression user guide Author: Yanbo Liang <ybliang8@gmail.com> Closes #8225 from yanboliang/spark-10029.
* [SPARK-9900] [MLLIB] User guide for Association RulesFeynman Liang2015-08-183-15/+118
| | | | | | | | Updates FPM user guide to include Association Rules. Author: Feynman Liang <fliang@databricks.com> Closes #8207 from feynmanliang/SPARK-9900-arules.
* [SPARK-7736] [CORE] Fix a race introduced in PythonRunner.Marcelo Vanzin2015-08-181-1/+7
| | | | | | | | | | The fix for SPARK-7736 introduced a race where a port value of "-1" could be passed down to the pyspark process, causing it to fail to connect back to the JVM. This change adds code to fix that race. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8258 from vanzin/SPARK-7736.
* [SPARK-9028] [ML] Add CountVectorizer as an estimator to generate ↵Yuhao Yang2015-08-184-155/+402
| | | | | | | | | | | | | | | CountVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #7388 from hhbyyh/cvEstimator.
* [SPARK-10007] [SPARKR] Update `NAMESPACE` file in SparkR for simple ↵Yuu ISHIKAWA2015-08-181-3/+47
| | | | | | | | | | | parameters functions ### JIRA [[SPARK-10007] Update `NAMESPACE` file in SparkR for simple parameters functions - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10007) Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8277 from yu-iskw/SPARK-10007.
* [SPARK-8118] [SQL] Redirects Parquet JUL logger via SLF4JCheng Lian2015-08-185-43/+47
| | | | | | | | | | Parquet hard coded a JUL logger which always writes to stdout. This PR redirects it via SLF4j JUL bridge handler, so that we can control Parquet logs via `log4j.properties`. This solution is inspired by https://github.com/Parquet/parquet-mr/issues/390#issuecomment-46064909. Author: Cheng Lian <lian@databricks.com> Closes #8196 from liancheng/spark-8118/redirect-parquet-jul.
* [MINOR] fix the comments in IndexShuffleBlockResolverCodingCat2015-08-181-1/+1
| | | | | | | | | | it might be a typo introduced at the first moment or some leftover after some renaming...... the name of the method accessing the index file is called `getBlockData` now (not `getBlockLocation` as indicated in the comments) Author: CodingCat <zhunansjtu@gmail.com> Closes #8238 from CodingCat/minor_1.
* [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights publicYanbo Liang2015-08-171-2/+2
| | | | | | | | Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8263 from yanboliang/mlp-public.
* [SPARK-10038] [SQL] fix bug in generated unsafe projection when there is ↵Davies Liu2015-08-172-4/+29
| | | | | | | | | | | | binary in ArrayData The type for array of array in Java is slightly different than array of others. cc cloud-fan Author: Davies Liu <davies@databricks.com> Closes #8250 from davies/array_binary.
* [MINOR] Format the comment of `translate` at `functions.scala`Yu ISHIKAWA2015-08-171-8/+9
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8265 from yu-iskw/minor-translate-comment.
* [SPARK-7808] [ML] add package doc for ml.featureXiangrui Meng2015-08-171-0/+89
| | | | | | | | This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8260 from mengxr/SPARK-7808.
* [SPARK-10059] [YARN] Explicitly add JSP dependencies for tests.Marcelo Vanzin2015-08-171-3/+19
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8251 from vanzin/SPARK-10059.
* [SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample ↵jose.cambronero2015-08-171-4/+47
| | | | | | | | | | KS test added doc examples for python. Author: jose.cambronero <jose.cambronero@cloudera.com> Closes #8154 from josepablocam/spark_9902.
* [SPARK-7707] User guide and example code for KernelDensitySandy Ryza2015-08-171-0/+77
| | | | | | Author: Sandy Ryza <sandy@cloudera.com> Closes #8230 from sryza/sandy-spark-7707.
* [SPARK-9898] [MLLIB] Prefix Span user guideFeynman Liang2015-08-172-0/+97
| | | | | | | | | | Adds user guide for `PrefixSpan`, including Scala and Java example code. mengxr zhangjiajin Author: Feynman Liang <fliang@databricks.com> Closes #8253 from feynmanliang/SPARK-9898.
* SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regressionPrayag Chandran2015-08-179-12/+168
| | | | | | | | | | | | | Added since tags to mllib.regression Author: Prayag Chandran <prayagchandran@gmail.com> Closes #7518 from prayagchandran/sinceTags and squashes the following commits: fa4dda2 [Prayag Chandran] Re-formatting 6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags 1a0365f [Prayag Chandran] Reformating and adding a few more tags 89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression
* [SPARK-9768] [PYSPARK] [ML] Add Python API and user guide for ↵Yanbo Liang2015-08-172-9/+81
| | | | | | | | | | ml.feature.ElementwiseProduct Add Python API, user guide and example for ml.feature.ElementwiseProduct. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8061 from yanboliang/SPARK-9768.
* [SPARK-9974] [BUILD] [SQL] Makes sure ↵Cheng Lian2015-08-171-1/+1
| | | | | | | | | | | | | | | | com.twitter:parquet-hadoop-bundle:1.6.0 is in SBT assembly jar PR #7967 enables Spark SQL to persist Parquet tables in Hive compatible format when possible. One of the consequence is that, we have to set input/output classes to `MapredParquetInputFormat`/`MapredParquetOutputFormat`, which rely on com.twitter:parquet-hadoop:1.6.0 bundled with Hive 1.2.1. When loading such a table in Spark SQL, `o.a.h.h.ql.metadata.Table` first loads these input/output format classes, and thus classes in com.twitter:parquet-hadoop:1.6.0. However, the scope of this dependency is defined as "runtime", and is not packaged into Spark assembly jar. This results in a `ClassNotFoundException`. This issue can be worked around by asking users to add parquet-hadoop 1.6.0 via the `--driver-class-path` option. However, considering Maven build is immune to this problem, I feel it can be confusing and inconvenient for users. So this PR fixes this issue by changing scope of parquet-hadoop 1.6.0 to "compile". Author: Cheng Lian <lian@databricks.com> Closes #8198 from liancheng/spark-9974/bundle-parquet-1.6.0.
* [SPARK-8920] [MLLIB] Add @since tags to mllib.linalgSameer Abhyankar2015-08-178-17/+227
| | | | | | | Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.Samavihome> Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.local> Closes #7729 from sabhyankar/branch_8920.
* [SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listingFeynman Liang2015-08-171-13/+13
| | | | | | | | mengxr jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8255 from feynmanliang/SPARK-10068.