aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-10093] [SPARK-10096] [SQL] Avoid transformation on executors & fix ↵Reynold Xin2015-08-184-7/+68
| | | | | | | | | | | | | | | | | | UDFs on complex types This is kind of a weird case, but given a sufficiently complex query plan (in this case a TungstenProject with an Exchange underneath), we could have NPEs on the executors due to the time when we were calling transformAllExpressions In general we should ensure that all transformations occur on the driver and not on the executors. Some reasons for avoid executor side transformations include: * (this case) Some operator constructors require state such as access to the Spark/SQL conf so doing a makeCopy on the executor can fail. * (unrelated reason for avoid executor transformations) ExprIds are calculated using an atomic integer, so you can violate their uniqueness constraint by constructing them anywhere other than the driver. This subsumes #8285. Author: Reynold Xin <rxin@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #8295 from rxin/SPARK-10096.
* [SPARK-10095] [SQL] use public API of BigIntegerDavies Liu2015-08-183-45/+11
| | | | | | | | | | | | In UnsafeRow, we use the private field of BigInteger for better performance, but it actually didn't contribute much (3% in one benchmark) to end-to-end runtime, and make it not portable (may fail on other JVM implementations). So we should use the public API instead. cc rxin Author: Davies Liu <davies@databricks.com> Closes #8286 from davies/portable_decimal.
* [SPARK-10075] [SPARKR] Add `when` expressino function in SparkRYu ISHIKAWA2015-08-185-0/+45
| | | | | | | | | | | | | | | - Add `when` and `otherwise` as `Column` methods - Add `When` as an expression function - Add `%otherwise%` infix as an alias of `otherwise` Since R doesn't support a feature like method chaining, `otherwise(when(condition, value), value)` style is a little annoying for me. If `%otherwise%` looks strange for shivaram, I can remove it. What do you think? ### JIRA [[SPARK-10075] Add `when` expressino function in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10075) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8266 from yu-iskw/SPARK-10075.
* [SPARK-9939] [SQL] Resorts to Java process API in CliSuite, ↵Cheng Lian2015-08-195-91/+149
| | | | | | | | | | | | | | HiveSparkSubmitSuite and HiveThriftServer2 test suites Scala process API has a known bug ([SI-8768] [1]), which may be the reason why several test suites which fork sub-processes are flaky. This PR replaces Scala process API with Java process API in `CliSuite`, `HiveSparkSubmitSuite`, and `HiveThriftServer2` related test suites to see whether it fix these flaky tests. [1]: https://issues.scala-lang.org/browse/SI-8768 Author: Cheng Lian <lian@databricks.com> Closes #8168 from liancheng/spark-9939/use-java-process-api.
* [SPARK-10102] [STREAMING] Fix a race condition that startReceiver may happen ↵zsxwing2015-08-181-3/+8
| | | | | | | | | | | | | | before setting trackerState to Started Test failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/3305/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/stop_gracefully/ There is a race condition that setting `trackerState` to `Started` could happen after calling `startReceiver`. Then `startReceiver` won't start the receivers because it uses `! isTrackerStarted` to check if ReceiverTracker is stopping or stopped. But actually, `trackerState` is `Initialized` and will be changed to `Started` soon. Therefore, we should use `isTrackerStopping || isTrackerStopped`. Author: zsxwing <zsxwing@gmail.com> Closes #8294 from zsxwing/SPARK-9504.
* [SPARK-10072] [STREAMING] BlockGenerator can deadlock when the queue of ↵Tathagata Das2015-08-181-10/+19
| | | | | | | | | | | | generate blocks fills up to capacity Generated blocks are inserted into an ArrayBlockingQueue, and another thread pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if that queue fills up to capacity (default is 10 blocks), then the inserting into queue (done in the function updateCurrentBuffer) get blocked inside a synchronized block. However, the thread that is pulling blocks from the queue uses the same lock to check the current (active or stopped) while pulling from the queue. Since the block generating threads is blocked (as the queue is full) on the lock, this thread that is supposed to drain the queue gets blocked. Ergo, deadlock. Solution: Moved blocking call to ArrayBlockingQueue outside the synchronized to prevent deadlock. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8257 from tdas/SPARK-10072.
* [SPARKR] [MINOR] Get rid of a long line warningYu ISHIKAWA2015-08-181-1/+3
| | | | | | | | | | | | ``` R/functions.R:74:1: style: lines should not be more than 100 characters. jc <- callJStatic("org.apache.spark.sql.functions", "lit", ifelse(class(x) == "Column", xjc, x)) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8297 from yu-iskw/minor-lint-r.
* [SPARK-9969] [YARN] Remove old MR classpath API supportjerryshao2015-08-181-11/+1
| | | | | | | | | | Here propose to remove old MRJobConfig#DEFAULT_APPLICATION_CLASSPATH support, since we now move to Yarn stable API. vanzin and sryza , any opinion on this? If we still want to support old API, I can close it. But as far as I know now major Hadoop releases has moved to stable API. Author: jerryshao <sshao@hortonworks.com> Closes #8192 from jerryshao/SPARK-9969.
* Bump SparkR version string to 1.5.0Hossein2015-08-181-1/+1
| | | | | | | | | | This patch is against master, but we need to apply it to 1.5 branch as well. cc shivaram and rxin Author: Hossein <hossein@databricks.com> Closes #8291 from falaki/SparkRVersion1.5.
* [SPARK-8473] [SPARK-9889] [ML] User guide and example code for DCTFeynman Liang2015-08-181-0/+71
| | | | | | | | mengxr jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8184 from feynmanliang/SPARK-9889-DCT-docs.
* [SPARK-10098] [STREAMING] [TEST] Cleanup active context after test in ↵Tathagata Das2015-08-181-10/+17
| | | | | | | | | | FailureSuite Failures in streaming.FailureSuite can leak StreamingContext and SparkContext which fails all subsequent tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8289 from tdas/SPARK-10098.
* [SPARK-10012] [ML] Missing test case for Params#arrayLengthGtlewuathe2015-08-181-0/+3
| | | | | | | | Currently there is no test case for `Params#arrayLengthGt`. Author: lewuathe <lewuathe@me.com> Closes #8223 from Lewuathe/SPARK-10012.
* [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.treeBryan Cutler2015-08-1824-1/+157
| | | | | | | | Added since tags to mllib.tree Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #7380 from BryanCutler/sinceTag-mllibTree-8924.
* [SPARK-10088] [SQL] Add support for "stored as avro" in HiveQL parser.Marcelo Vanzin2015-08-182-10/+13
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8282 from vanzin/SPARK-10088.
* [SPARK-10089] [SQL] Add missing golden files.Marcelo Vanzin2015-08-182-0/+503
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8283 from vanzin/SPARK-10089.
* [SPARK-9782] [YARN] Support YARN application tags via SparkConfDennis Huo2015-08-183-0/+65
| | | | | | | | | Add a new test case in yarn/ClientSuite which checks how the various SparkConf and ClientArguments propagate into the ApplicationSubmissionContext. Author: Dennis Huo <dhuo@google.com> Closes #8072 from dennishuo/dhuo-yarn-application-tags.
* [SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolationMichael Armbrust2015-08-183-11/+22
| | | | | | | | Turns out that inner classes of inner objects are referenced directly, and thus moving it will break binary compatibility. Author: Michael Armbrust <michael@databricks.com> Closes #8281 from marmbrus/binaryCompat.
* [SPARK-9574] [STREAMING] Remove unnecessary contents of ↵zsxwing2015-08-185-1/+249
| | | | | | | | | | spark-streaming-XXX-assembly jars Removed contents already included in Spark assembly jar from spark-streaming-XXX-assembly jars. Author: zsxwing <zsxwing@gmail.com> Closes #8069 from zsxwing/SPARK-9574.
* [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array importPiotr Migdal2015-08-181-2/+0
| | | | | | | | See https://issues.apache.org/jira/browse/SPARK-10085 Author: Piotr Migdal <pmigdal@gmail.com> Closes #8284 from stared/spark-10085.
* [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guideYanbo Liang2015-08-181-0/+28
| | | | | | | | Add Python example for mllib LDAModel user guide Author: Yanbo Liang <ybliang8@gmail.com> Closes #8227 from yanboliang/spark-10032.
* [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression ↵Yanbo Liang2015-08-181-0/+35
| | | | | | | | | | user guide Add Python examples for mllib IsotonicRegression user guide Author: Yanbo Liang <ybliang8@gmail.com> Closes #8225 from yanboliang/spark-10029.
* [SPARK-9900] [MLLIB] User guide for Association RulesFeynman Liang2015-08-183-15/+118
| | | | | | | | Updates FPM user guide to include Association Rules. Author: Feynman Liang <fliang@databricks.com> Closes #8207 from feynmanliang/SPARK-9900-arules.
* [SPARK-7736] [CORE] Fix a race introduced in PythonRunner.Marcelo Vanzin2015-08-181-1/+7
| | | | | | | | | | The fix for SPARK-7736 introduced a race where a port value of "-1" could be passed down to the pyspark process, causing it to fail to connect back to the JVM. This change adds code to fix that race. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8258 from vanzin/SPARK-7736.
* [SPARK-9028] [ML] Add CountVectorizer as an estimator to generate ↵Yuhao Yang2015-08-184-155/+402
| | | | | | | | | | | | | | | CountVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #7388 from hhbyyh/cvEstimator.
* [SPARK-10007] [SPARKR] Update `NAMESPACE` file in SparkR for simple ↵Yuu ISHIKAWA2015-08-181-3/+47
| | | | | | | | | | | parameters functions ### JIRA [[SPARK-10007] Update `NAMESPACE` file in SparkR for simple parameters functions - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10007) Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8277 from yu-iskw/SPARK-10007.
* [SPARK-8118] [SQL] Redirects Parquet JUL logger via SLF4JCheng Lian2015-08-185-43/+47
| | | | | | | | | | Parquet hard coded a JUL logger which always writes to stdout. This PR redirects it via SLF4j JUL bridge handler, so that we can control Parquet logs via `log4j.properties`. This solution is inspired by https://github.com/Parquet/parquet-mr/issues/390#issuecomment-46064909. Author: Cheng Lian <lian@databricks.com> Closes #8196 from liancheng/spark-8118/redirect-parquet-jul.
* [MINOR] fix the comments in IndexShuffleBlockResolverCodingCat2015-08-181-1/+1
| | | | | | | | | | it might be a typo introduced at the first moment or some leftover after some renaming...... the name of the method accessing the index file is called `getBlockData` now (not `getBlockLocation` as indicated in the comments) Author: CodingCat <zhunansjtu@gmail.com> Closes #8238 from CodingCat/minor_1.
* [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights publicYanbo Liang2015-08-171-2/+2
| | | | | | | | Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8263 from yanboliang/mlp-public.
* [SPARK-10038] [SQL] fix bug in generated unsafe projection when there is ↵Davies Liu2015-08-172-4/+29
| | | | | | | | | | | | binary in ArrayData The type for array of array in Java is slightly different than array of others. cc cloud-fan Author: Davies Liu <davies@databricks.com> Closes #8250 from davies/array_binary.
* [MINOR] Format the comment of `translate` at `functions.scala`Yu ISHIKAWA2015-08-171-8/+9
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8265 from yu-iskw/minor-translate-comment.
* [SPARK-7808] [ML] add package doc for ml.featureXiangrui Meng2015-08-171-0/+89
| | | | | | | | This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8260 from mengxr/SPARK-7808.
* [SPARK-10059] [YARN] Explicitly add JSP dependencies for tests.Marcelo Vanzin2015-08-171-3/+19
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8251 from vanzin/SPARK-10059.
* [SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample ↵jose.cambronero2015-08-171-4/+47
| | | | | | | | | | KS test added doc examples for python. Author: jose.cambronero <jose.cambronero@cloudera.com> Closes #8154 from josepablocam/spark_9902.
* [SPARK-7707] User guide and example code for KernelDensitySandy Ryza2015-08-171-0/+77
| | | | | | Author: Sandy Ryza <sandy@cloudera.com> Closes #8230 from sryza/sandy-spark-7707.
* [SPARK-9898] [MLLIB] Prefix Span user guideFeynman Liang2015-08-172-0/+97
| | | | | | | | | | Adds user guide for `PrefixSpan`, including Scala and Java example code. mengxr zhangjiajin Author: Feynman Liang <fliang@databricks.com> Closes #8253 from feynmanliang/SPARK-9898.
* SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regressionPrayag Chandran2015-08-179-12/+168
| | | | | | | | | | | | | Added since tags to mllib.regression Author: Prayag Chandran <prayagchandran@gmail.com> Closes #7518 from prayagchandran/sinceTags and squashes the following commits: fa4dda2 [Prayag Chandran] Re-formatting 6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags 1a0365f [Prayag Chandran] Reformating and adding a few more tags 89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression
* [SPARK-9768] [PYSPARK] [ML] Add Python API and user guide for ↵Yanbo Liang2015-08-172-9/+81
| | | | | | | | | | ml.feature.ElementwiseProduct Add Python API, user guide and example for ml.feature.ElementwiseProduct. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8061 from yanboliang/SPARK-9768.
* [SPARK-9974] [BUILD] [SQL] Makes sure ↵Cheng Lian2015-08-171-1/+1
| | | | | | | | | | | | | | | | com.twitter:parquet-hadoop-bundle:1.6.0 is in SBT assembly jar PR #7967 enables Spark SQL to persist Parquet tables in Hive compatible format when possible. One of the consequence is that, we have to set input/output classes to `MapredParquetInputFormat`/`MapredParquetOutputFormat`, which rely on com.twitter:parquet-hadoop:1.6.0 bundled with Hive 1.2.1. When loading such a table in Spark SQL, `o.a.h.h.ql.metadata.Table` first loads these input/output format classes, and thus classes in com.twitter:parquet-hadoop:1.6.0. However, the scope of this dependency is defined as "runtime", and is not packaged into Spark assembly jar. This results in a `ClassNotFoundException`. This issue can be worked around by asking users to add parquet-hadoop 1.6.0 via the `--driver-class-path` option. However, considering Maven build is immune to this problem, I feel it can be confusing and inconvenient for users. So this PR fixes this issue by changing scope of parquet-hadoop 1.6.0 to "compile". Author: Cheng Lian <lian@databricks.com> Closes #8198 from liancheng/spark-9974/bundle-parquet-1.6.0.
* [SPARK-8920] [MLLIB] Add @since tags to mllib.linalgSameer Abhyankar2015-08-178-17/+227
| | | | | | | Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.Samavihome> Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.local> Closes #7729 from sabhyankar/branch_8920.
* [SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listingFeynman Liang2015-08-171-13/+13
| | | | | | | | mengxr jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8255 from feynmanliang/SPARK-10068.
* [SPARK-9592] [SQL] Fix Last function implemented based on AggregateExpression1.Yin Huai2015-08-172-2/+22
| | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-9592 #8113 has the fundamental fix. But, if we want to minimize the number of changed lines, we can go with this one. Then, in 1.6, we merge #8113. Author: Yin Huai <yhuai@databricks.com> Closes #8172 from yhuai/lastFix and squashes the following commits: b28c42a [Yin Huai] Regression test. af87086 [Yin Huai] Fix last.
* [SPARK-9526] [SQL] Utilize randomized tests to reveal potential bugs in sql ↵Yijie Shen2015-08-1710-6/+410
| | | | | | | | | | | | expressions JIRA: https://issues.apache.org/jira/browse/SPARK-9526 This PR is a follow up of #7830, aiming at utilizing randomized tests to reveal more potential bugs in sql expression. Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7855 from yjshen/property_check.
* [SPARK-10036] [SQL] Load JDBC driver in DataFrameReader.jdbc and ↵zsxwing2015-08-174-7/+20
| | | | | | | | | | | | | DataFrameWriter.jdbc This PR uses `JDBCRDD.getConnector` to load JDBC driver before creating connection in `DataFrameReader.jdbc` and `DataFrameWriter.jdbc`. Author: zsxwing <zsxwing@gmail.com> Closes #8232 from zsxwing/SPARK-10036 and squashes the following commits: adf75de [zsxwing] Add extraOptions to the connection properties 57f59d4 [zsxwing] Load JDBC driver in DataFrameReader.jdbc and DataFrameWriter.jdbc
* [SPARK-9950] [SQL] Wrong Analysis Error for grouping/aggregating on struct ↵Wenchen Fan2015-08-171-0/+5
| | | | | | | | | | | | | fields This issue has been fixed by https://github.com/apache/spark/pull/8215, this PR added regression test for it. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8222 from cloud-fan/minor and squashes the following commits: 0bbfb1c [Wenchen Fan] fix style... 7e2d8d9 [Wenchen Fan] add test
* [SPARK-7736] [CORE] [YARN] Make pyspark fail YARN app on failure.Marcelo Vanzin2015-08-174-8/+40
| | | | | | | | | | | | | | | The YARN backend doesn't like when user code calls `System.exit`, since it cannot know the exit status and thus cannot set an appropriate final status for the application. So, for pyspark, avoid that call and instead throw an exception with the exit code. SparkSubmit handles that exception and exits with the given exit code, while YARN uses the exit code as the failure code for the Spark app. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7751 from vanzin/SPARK-9416.
* [SPARK-9924] [WEB UI] Don't schedule checkForLogs while some of them are ↵Rohit Agarwal2015-08-171-7/+21
| | | | | | | | already running. Author: Rohit Agarwal <rohita@qubole.com> Closes #8153 from mindprince/SPARK-9924.
* [SPARK-7837] [SQL] Avoids double closing output writers when commitTask() failsCheng Lian2015-08-182-6/+61
| | | | | | | | When inserting data into a `HadoopFsRelation`, if `commitTask()` of the writer container fails, `abortTask()` will be invoked. However, both `commitTask()` and `abortTask()` try to close the output writer(s). The problem is that, closing underlying writers may not be an idempotent operation. E.g., `ParquetRecordWriter.close()` throws NPE when called twice. Author: Cheng Lian <lian@databricks.com> Closes #8236 from liancheng/spark-7837/double-closing.
* [SPARK-9959] [MLLIB] Association Rules Java CompatibilityFeynman Liang2015-08-171-2/+28
| | | | | | | | mengxr Author: Feynman Liang <fliang@databricks.com> Closes #8206 from feynmanliang/SPARK-9959-arules-java.
* [SPARK-9199] [CORE] Upgrade Tachyon version from 0.7.0 -> 0.7.1.Calvin Jia2015-08-172-2/+2
| | | | | | | | | | | | Updates the tachyon-client version to the latest release. The main difference between 0.7.0 and 0.7.1 on the client side is to support running Tachyon on local file system by default. No new non-Tachyon dependencies are added, and no code changes are required since the client API has not changed. Author: Calvin Jia <jia.calvin@gmail.com> Closes #8235 from calvinjia/spark-9199-master.
* [SPARK-9871] [SPARKR] Add expression functions into SparkR which have a ↵Yu ISHIKAWA2015-08-164-0/+75
| | | | | | | | | | | | | | | | | | variable parameter ### Summary - Add `lit` function - Add `concat`, `greatest`, `least` functions I think we need to improve `collect` function in order to implement `struct` function. Since `collect` doesn't work with arguments which includes a nested `list` variable. It seems that a list against `struct` still has `jobj` classes. So it would be better to solve this problem on another issue. ### JIRA [[SPARK-9871] Add expression functions into SparkR which have a variable parameter - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9871) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8194 from yu-iskw/SPARK-9856.