aboutsummaryrefslogtreecommitdiff
path: root/project
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-15085][STREAMING][KAFKA] Rename streaming-kafka artifactcody koeninger2016-05-112-5/+13
| | | | | | | | | | | | ## What changes were proposed in this pull request? Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions ## How was this patch tested? Unit tests Author: cody koeninger <cody@koeninger.org> Closes #12946 from koeninger/SPARK-15085.
* [SPARK-15250][SQL] Remove deprecated json API in DataFrameReaderhyukjinkwon2016-05-101-0/+3
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR removes the old `json(path: String)` API which is covered by the new `json(paths: String*)`. ## How was this patch tested? Jenkins tests (existing tests should cover this) Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #13040 from HyukjinKwon/SPARK-15250.
* [SPARK-14542][CORE] PipeRDD should allow configurable buffer size for…Sital Kedia2016-05-101-0/+4
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently PipedRDD internally uses PrintWriter to write data to the stdin of the piped process, which by default uses a BufferedWriter of buffer size 8k. In our experiment, we have seen that 8k buffer size is too small and the job spends significant amount of CPU time in system calls to copy the data. We should have a way to configure the buffer size for the writer. ## How was this patch tested? Ran PipedRDDSuite tests. Author: Sital Kedia <skedia@fb.com> Closes #12309 from sitalkedia/bufferedPipedRDD.
* [SPARK-10653][CORE] Remove unnecessary things from SparkEnvAlex Bozarth2016-05-091-0/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Removed blockTransferService and sparkFilesDir from SparkEnv since they're rarely used and don't need to be in stored in the env. Edited their few usages to accommodate the change. ## How was this patch tested? ran dev/run-tests locally Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #12970 from ajbozarth/spark10653.
* [SPARK-14738][BUILD] Separate docker integration tests from main buildLuciano Resende2016-05-061-1/+2
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Create a maven profile for executing the docker integration tests using maven Remove docker integration tests from main sbt build Update documentation on how to run docker integration tests from sbt ## How was this patch tested? Manual test of the docker integration tests as in : mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 compile test ## Other comments Note that the the DB2 Docker Tests are still disabled as there is a kernel version issue on the AMPLab Jenkins slaves and we would need to get them on the right level before enabling those tests. They do run ok locally with the updates from PR #12348 Author: Luciano Resende <lresende@apache.org> Closes #12508 from lresende/docker.
* [SPARK-14589][SQL] Enhance DB2 JDBC Dialect docker testsLuciano Resende2016-05-051-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Enhance the DB2 JDBC Dialect docker tests as they seemed to have had some issues on previous merge causing some tests to fail. ## How was this patch tested? By running the integration tests locally. Author: Luciano Resende <lresende@apache.org> Closes #12348 from lresende/SPARK-14589.
* [SPARK-14952][CORE][ML] Remove methods that were deprecated in 1.6.0Herman van Hovell2016-04-301-0/+5
| | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? This PR removes three methods the were deprecated in 1.6.0: - `PortableDataStream.close()` - `LinearRegression.weights` - `LogisticRegression.weights` The rationale for doing this is that the impact is small and that Spark 2.0 is a major release. #### How was this patch tested? Compilation succeded. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12732 from hvanhovell/SPARK-14952.
* [SPARK-14511][BUILD] Upgrade genjavadoc to latest upstreamJakob Odersky2016-04-291-3/+7
| | | | | | | | | | | | | ## What changes were proposed in this pull request? In the past, genjavadoc had issues with package private members which led the spark project to use a forked version. This issue has been fixed upstream (typesafehub/genjavadoc#70) and a release is available for scala versions 2.10, 2.11 **and 2.12**, hence a forked version for spark is no longer necessary. This pull request updates the build configuration to use the newest upstream genjavadoc. ## How was this patch tested? The build was run `sbt unidoc`. During the process javadoc emits some errors on the generated java stubs, however these errors were also present before the upgrade. Furthermore, the produced html is fine. Author: Jakob Odersky <jakob@odersky.com> Closes #12707 from jodersky/SPARK-14511-genjavadoc.
* Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵Yin Huai2016-04-282-9/+5
| | | | | | spark-mllib-local" This reverts commit dae538a4d7c36191c1feb02ba87ffc624ab960dc.
* [SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵Pravin Gadakh2016-04-282-5/+9
| | | | | | | | | | | | | | | | spark-mllib-local ## What changes were proposed in this pull request? This PR adds `since` tag into the matrix and vector classes in spark-mllib-local. ## How was this patch tested? Scala-style checks passed. Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #12416 from pravingadakh/SPARK-14613.
* [SPARK-14654][CORE] New accumulator APIWenchen Fan2016-04-281-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR introduces a new accumulator API which is much simpler than before: 1. the type hierarchy is simplified, now we only have an `Accumulator` class 2. Combine `initialValue` and `zeroValue` concepts into just one concept: `zeroValue` 3. there in only one `register` method, the accumulator registration and cleanup registration are combined. 4. the `id`,`name` and `countFailedValues` are combined into an `AccumulatorMetadata`, and is provided during registration. `SQLMetric` is a good example to show the simplicity of this new API. What we break: 1. no `setValue` anymore. In the new API, the intermedia type can be different from the result type, it's very hard to implement a general `setValue` 2. accumulator can't be serialized before registered. Problems need to be addressed in follow-ups: 1. with this new API, `AccumulatorInfo` doesn't make a lot of sense, the partial output is not partial updates, we need to expose the intermediate value. 2. `ExceptionFailure` should not carry the accumulator updates. Why do users care about accumulator updates for failed cases? It looks like we only use this feature to update the internal metrics, how about we sending a heartbeat to update internal metrics after the failure event? 3. the public event `SparkListenerTaskEnd` carries a `TaskMetrics`. Ideally this `TaskMetrics` don't need to carry external accumulators, as the only method of `TaskMetrics` that can access external accumulators is `private[spark]`. However, `SQLListener` use it to retrieve sql metrics. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #12612 from cloud-fan/acc.
* Revert "[SPARK-14683][DOCUMENTATION] Configure external links in ScalaDoc"Josh Rosen2016-04-271-2/+0
| | | | This reverts commit 3f49afee937a66d458e0c194e46d6a9e380e054e.
* [SPARK-14861][SQL] Replace internal usages of SQLContext with SparkSessionAndrew Or2016-04-251-0/+14
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore. In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait. **Reviewers**: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little. ## How was this patch tested? No change in functionality intended. Author: Andrew Or <andrew@databricks.com> Closes #12625 from andrewor14/spark-session-refactor.
* [SPARK-14790] Always run scalastyle on sbt compile and testEric Liang2016-04-251-1/+83
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Sbt compile and test should also run scalastyle. This makes it less likely you forget to run scalastyle and fail in jenkins. Scalastyle results are cached for efficiency. This patch was originally written by ahirreddy; I just fixed it up to work with scalastyle 0.8.0. ## How was this patch tested? Tested manually with `build/sbt package`. Author: Eric Liang <ekl@databricks.com> Closes #12555 from ericl/scalastyle.
* [SPARK-14807] Create a compatibility moduleYin Huai2016-04-221-3/+3
| | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR creates a compatibility module in sql (called `hive-1-x-compatibility`), which will host HiveContext in Spark 2.0 (moving HiveContext to here will be done separately). This module is not included in assembly because only users who still want to access HiveContext need it. ## How was this patch tested? I manually tested `sbt/sbt -Phive package` and `mvn -Phive package -DskipTests`. Author: Yin Huai <yhuai@databricks.com> Closes #12580 from yhuai/compatibility.
* [SPARK-6429] Implement hashCode and equals togetherJoan2016-04-221-0/+4
| | | | | | | | | | | ## What changes were proposed in this pull request? Implement some `hashCode` and `equals` together in order to enable the scalastyle. This is a first batch, I will continue to implement them but I wanted to know your thoughts. Author: Joan <joan@goyeau.com> Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
* [SPARK-14734][ML][MLLIB] Added asML, fromML methods for all spark.mllib ↵Joseph K. Bradley2016-04-211-0/+4
| | | | | | | | | | | | | | | | | Vector, Matrix types ## What changes were proposed in this pull request? For maintaining wrappers around spark.mllib algorithms in spark.ml, it will be useful to have ```private[spark]``` methods for converting from one linear algebra representation to another. This PR adds toNew, fromNew methods for all spark.mllib Vector and Matrix types. ## How was this patch tested? Unit tests for all conversions Author: Joseph K. Bradley <joseph@databricks.com> Closes #12504 from jkbradley/linalg-conversions.
* [SPARK-13643][SQL] Implement SparkSessionAndrew Or2016-04-211-0/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? After removing most of `HiveContext` in 8fc267ab3322e46db81e725a5cb1adb5a71b2b4d we can now move existing functionality in `SQLContext` to `SparkSession`. As of this PR `SQLContext` becomes a simple wrapper that has a `SparkSession` and delegates all functionality to it. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #12553 from andrewor14/implement-spark-session.
* [SPARK-4452] [CORE] Shuffle data structures can starve others on the same ↵Lianhui Wang2016-04-211-0/+3
| | | | | | | | | | | | | | | thread for memory ## What changes were proposed in this pull request? In #9241 It implemented a mechanism to call spill() on those SQL operators that support spilling if there is not enough memory for execution. But ExternalSorter and AppendOnlyMap in Spark core are not worked. So this PR make them benefit from #9241. Now when there is not enough memory for execution, it can get memory by spilling ExternalSorter and AppendOnlyMap in Spark core. ## How was this patch tested? add two unit tests for it. Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #10024 from lianhuiwang/SPARK-4452-2.
* [SPARK-14704][CORE] create accumulators in TaskMetricsWenchen Fan2016-04-191-0/+4
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Before this PR, we create accumulators at driver side(and register them) and send them to executor side, then we create `TaskMetrics` with these accumulators at executor side. After this PR, we will create `TaskMetrics` at driver side and send it to executor side, so that we can create accumulators inside `TaskMetrics` directly, which is cleaner. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12472 from cloud-fan/acc.
* [SPARK-14407][SQL] Hides HadoopFsRelation related data source API into ↵Cheng Lian2016-04-191-0/+4
| | | | | | | | | | | | | | | | | | | execution/datasources package #12178 ## What changes were proposed in this pull request? This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package. Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
* [SPARK-14042][CORE] Add custom coalescer supportNezih Yigitbasi2016-04-191-0/+4
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds support for specifying an optional custom coalescer to the `coalesce()` method. Currently I have only added this feature to the `RDD` interface, and once we sort out the details we can proceed with adding this feature to the other APIs (`Dataset` etc.) ## How was this patch tested? Added a unit test for this functionality. /cc rxin (per our discussion on the mailing list) Author: Nezih Yigitbasi <nyigitbasi@netflix.com> Closes #11865 from nezihyigitbasi/custom_coalesce_policy.
* [SPARK-14628][CORE][FOLLLOW-UP] Always tracking read/write metricsWenchen Fan2016-04-181-0/+14
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR is a follow up for https://github.com/apache/spark/pull/12417, now we always track input/output/shuffle metrics in spark JSON protocol and status API. Most of the line changes are because of re-generating the gold answer for `HistoryServerSuite`, and we add a lot of 0 values for read/write metrics. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12462 from cloud-fan/follow.
* [SPARK-14504][SQL] Enable Oracle docker testsLuciano Resende2016-04-181-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Enable Oracle docker tests ## How was this patch tested? Existing tests Author: Luciano Resende <lresende@apache.org> Closes #12270 from lresende/oracle.
* [SPARK-14683][DOCUMENTATION] Configure external links in ScalaDoc杨博 (Yang Bo)2016-04-161-0/+2
| | | | | | | | | | Right now Spark's Scaladoc does not link to Scala standard library and other dependencies. This would bother Spark starters because they may be not experienced Scala programmers. This patch fixes these links in ScalaDoc. Author: 杨博 (Yang Bo) <pop.atry@gmail.com> Closes #12444 from Atry/patch-1.
* [SPARK-14628][CORE] Simplify task metrics by always tracking read/write metricsReynold Xin2016-04-151-1/+4
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Part of the reason why TaskMetrics and its callers are complicated are due to the optional metrics we collect, including input, output, shuffle read, and shuffle write. I think we can always track them and just assign 0 as the initial values. It is usually very obvious whether a task is supposed to read any data or not. By always tracking them, we can remove a lot of map, foreach, flatMap, getOrElse(0L) calls throughout Spark. This patch also changes a few behaviors. 1. Removed the distinction of data read/write methods (e.g. Hadoop, Memory, Network, etc). 2. Accumulate all data reads and writes, rather than only the first method. (Fixes SPARK-5225) ## How was this patch tested? existing tests. This is bases on https://github.com/apache/spark/pull/12388, with more test fixes. Author: Reynold Xin <rxin@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #12417 from cloud-fan/metrics-refactor.
* [SPARK-14617] Remove deprecated APIs in TaskMetricsReynold Xin2016-04-141-1/+4
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes some of the deprecated APIs in TaskMetrics. This is part of my bigger effort to simplify accumulators and task metrics. ## How was this patch tested? N/A - only removals Author: Reynold Xin <rxin@databricks.com> Closes #12375 from rxin/SPARK-14617.
* [SPARK-14596][SQL] Remove not used SqlNewHadoopRDD and some more unused importshyukjinkwon2016-04-141-5/+0
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Old `HadoopFsRelation` API includes `buildInternalScan()` which uses `SqlNewHadoopRDD` in `ParquetRelation`. Because now the old API is removed, `SqlNewHadoopRDD` is not used anymore. So, this PR removes `SqlNewHadoopRDD` and several unused imports. This was discussed in https://github.com/apache/spark/pull/12326. ## How was this patch tested? Several related existing unit tests and `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12354 from HyukjinKwon/SPARK-14596.
* [SPARK-14475] Propagate user-defined context from driver to executorsEric Liang2016-04-111-0/+3
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This adds a new API call `TaskContext.getLocalProperty` for getting properties set in the driver from executors. These local properties are automatically propagated from the driver to executors. For streaming, the context for streaming tasks will be the initial driver context when ssc.start() is called. ## How was this patch tested? Unit tests. cc JoshRosen Author: Eric Liang <ekl@databricks.com> Closes #12248 from ericl/sc-2813.
* [SPARK-10521][SQL] Utilize Docker for test DB2 JDBC Dialect supportLuciano Resende2016-04-111-1/+3
| | | | | | | | Add integration tests based on docker to test DB2 JDBC dialect support Author: Luciano Resende <lresende@apache.org> Closes #9893 from lresende/SPARK-10521.
* [SPARK-14462][ML][MLLIB] Add the mllib-local build to maven pomDB Tsai2016-04-111-3/+3
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The previous PR was failing the build because of `spark-core:test` dependency, and that was reverted. In this PR, `FunSuite` with `// scalastyle:ignore funsuite` in mllib-local test was used, similar to sketch. Thanks. ## How was this patch tested? Unit tests mengxr tedyu holdenk Author: DB Tsai <dbt@netflix.com> Closes #12298 from dbtsai/dbtsai-mllib-local-build-fix.
* Revert "[SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom"Xiangrui Meng2016-04-091-3/+3
| | | | This reverts commit 1598d11bb0248384872cf88bc2b16f3b238046ad.
* [SPARK-14462][ML][MLLIB] add the mllib-local build to maven pomDB Tsai2016-04-091-3/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Closes #12241 from dbtsai/dbtsai-mllib-local-build.
* [SPARK-14451][SQL] Move encoder definition into Aggregator interfaceReynold Xin2016-04-091-0/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? When we first introduced Aggregators, we required the user of Aggregators to (implicitly) specify the encoders. It would actually make more sense to have the encoders be specified by the implementation of Aggregators, since each implementation should have the most state about how to encode its own data type. Note that this simplifies the Java API because Java users no longer need to explicitly specify encoders for aggregators. ## How was this patch tested? Updated unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12231 from rxin/SPARK-14451.
* [SPARK-14437][CORE] Use the address that NettyBlockTransferService listens ↵Shixiong Zhu2016-04-081-0/+3
| | | | | | | | | | | | | | | | | | | to create BlockManagerId ## What changes were proposed in this pull request? Here is why SPARK-14437 happens: BlockManagerId is created using NettyBlockTransferService.hostName which comes from `customHostname`. And `Executor` will set `customHostname` to the hostname which is detected by the driver. However, the driver may not be able to detect the correct address in some complicated network (Netty's Channel.remoteAddress doesn't always return a connectable address). In such case, `BlockManagerId` will be created using a wrong hostname. To fix this issue, this PR uses `hostname` provided by `SparkEnv.create` to create `NettyBlockTransferService` and set `NettyBlockTransferService.hostname` to this one directly. A bonus of this approach is NettyBlockTransferService won't bound to `0.0.0.0` which is much safer. ## How was this patch tested? Manually checked the bound address using local-cluster. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12240 from zsxwing/SPARK-14437.
* [SPARK-13048][ML][MLLIB] keepLastCheckpoint option for LDA EM optimizerJoseph K. Bradley2016-04-071-0/+3
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The EMLDAOptimizer should generally not delete its last checkpoint since that can cause failures when DistributedLDAModel methods are called (if any partitions need to be recovered from the checkpoint). This PR adds a "deleteLastCheckpoint" option which defaults to false. This is a change in behavior from Spark 1.6, in that the last checkpoint will not be removed by default. This involves adding the deleteLastCheckpoint option to both spark.ml and spark.mllib, and modifying PeriodicCheckpointer to support the option. This also: * Makes MLlibTestSparkContext extend TempDirectory and set the checkpointDir to tempDir * Updates LibSVMRelationSuite because of a name conflict with "tempDir" (and fixes a bug where it failed to delete a temp directory) * Adds a MIMA exclude for DistributedLDAModel constructor, which is already ```private[clustering]``` ## How was this patch tested? Added 2 new unit tests to spark.ml LDASuite, which calls into spark.mllib. Author: Joseph K. Bradley <joseph@databricks.com> Closes #12166 from jkbradley/emlda-save-checkpoint.
* [SPARK-14449][SQL] SparkContext should use SparkListenerInterfaceMichael Armbrust2016-04-071-0/+1
| | | | | | | | | | | | Currently all `SparkFirehoseListener` implementations are broken since we expect listeners to extend `SparkListener`, while the fire hose only extends `SparkListenerInterface`. This changes the addListener function and the config based injection to use the interface instead. The existing tests in SparkListenerSuite are improved such that they would have caught this. Follow-up to #12142 Author: Michael Armbrust <michael@databricks.com> Closes #12227 from marmbrus/fixListener.
* [SPARK-13430][PYSPARK][ML] Python API for training summaries of linear and ↵Bryan Cutler2016-04-061-0/+3
| | | | | | | | | | | | | | | logistic regression ## What changes were proposed in this pull request? Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML. ## How was this patch tested? Added unit tests to exercise the api calls for the summary classes. Also, manually verified values are expected and match those from Scala directly. Author: Bryan Cutler <cutlerb@gmail.com> Closes #11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
* [SPARK-529][SQL] Modify SQLConf to use new config API from core.Marcelo Vanzin2016-04-051-2/+16
| | | | | | | | | | | | Because SQL keeps track of all known configs, some customization was needed in SQLConf to allow that, since the core API does not have that feature. Tested via existing (and slightly updated) unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11570 from vanzin/SPARK-529-sql.
* [SPARK-14366] Remove sbt-idea pluginLuciano Resende2016-04-041-2/+0
| | | | | | | | | | ## What changes were proposed in this pull request? Remove sbt-idea plugin as importing sbt project provides much better support. Author: Luciano Resende <lresende@apache.org> Closes #12151 from lresende/SPARK-14366.
* [SPARK-13579][BUILD] Stop building the main Spark assembly.Marcelo Vanzin2016-04-041-23/+22
| | | | | | | | | | | | | | | | | | | | This change modifies the "assembly/" module to just copy needed dependencies to its build directory, and modifies the packaging script to pick those up (and remove duplicate jars packages in the examples module). I also made some minor adjustments to dependencies to remove some test jars from the final packaging, and remove jars that conflict with each other when packaged separately (e.g. servlet api). Also note that this change restores guava in applications' classpaths, even though it's still shaded inside Spark. This is now needed for the Hadoop libraries that are packaged with Spark, which now are not processed by the shade plugin. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11796 from vanzin/SPARK-13579.
* [SPARK-14358] Change SparkListener from a trait to an abstract classReynold Xin2016-04-041-1/+10
| | | | | | | | | | | | ## What changes were proposed in this pull request? Scala traits are difficult to maintain binary compatibility on, and as a result we had to introduce JavaSparkListener. In Spark 2.0 we can change SparkListener from a trait to an abstract class and then remove JavaSparkListener. ## How was this patch tested? Updated related unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12142 from rxin/SPARK-14358.
* [SPARK-13674] [SQL] Add wholestage codegen support to SampleLiang-Chi Hsieh2016-04-011-0/+4
| | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-13674 ## What changes were proposed in this pull request? Sample operator doesn't support wholestage codegen now. This pr is to add support to it. ## How was this patch tested? A test is added into `BenchmarkWholeStageCodegen`. Besides, all tests should be passed. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11517 from viirya/add-wholestage-sample.
* [SPARK-11262][ML] Unit test for gradient, loss layers, memory management for ↵Alexander Ulanov2016-03-311-0/+5
| | | | | | | | | | | | | | | | | | | | | multilayer perceptron 1.Implement LossFunction trait and implement squared error and cross entropy loss with it 2.Implement unit test for gradient and loss 3.Implement InPlace trait and in-place layer evaluation 4.Refactor interface for ActivationFunction 5.Update of Layer and LayerModel interfaces 6.Fix random weights assignment 7.Implement memory allocation by MLP model instead of individual layers These features decreased the memory usage and increased flexibility of internal API. Author: Alexander Ulanov <nashb@yandex.ru> Author: avulanov <avulanov@gmail.com> Closes #9229 from avulanov/mlp-refactoring.
* [SPARK-14211][SQL] Remove ANTLR3 based parserHerman van Hovell2016-03-312-55/+2
| | | | | | | | | | | | | | | | ### What changes were proposed in this pull request? This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`. ### How was this patch tested? Existing unit tests. cc rxin andrewor14 yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12071 from hvanhovell/SPARK-14211.
* [SPARK-14205][SQL] remove trait QueryableWenchen Fan2016-03-281-0/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? After DataFrame and Dataset are merged, the trait `Queryable` becomes unnecessary as it has only one implementation. We should remove it. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12001 from cloud-fan/df-ds.
* [SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4Herman van Hovell2016-03-282-2/+12
| | | | | | | | | | | | | | | | | | | | | | ### What changes were proposed in this pull request? The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4. This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs. This PR is a work in progress, and work needs to be done in the following area's: - [x] Error handling should be improved. - [x] Documentation should be improved. - [x] Multi-Insert needs to be tested. - [ ] Naming and package locations. ### How was this patch tested? Catalyst and SQL unit tests. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #11557 from hvanhovell/ngParser.
* [SPARK-14089][CORE][MLLIB] Remove methods that has been deprecated since ↵Liwei Lin2016-03-261-0/+13
| | | | | | | | | | | | | | | | | | 1.1, 1.2, 1.3, 1.4, and 1.5 ## What changes were proposed in this pull request? Removed methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5. ## How was this patch tested? - manully checked that no codes in Spark call these methods any more - existing test suits Author: Liwei Lin <lwlin7@gmail.com> Author: proflin <proflin.me@gmail.com> Closes #11910 from lw-lin/remove-deprecates.
* [SPARK-13874][DOC] Remove docs of streaming-akka, streaming-zeromq, ↵Shixiong Zhu2016-03-261-6/+1
| | | | | | | | | | | | | | | | | | streaming-mqtt and streaming-twitter ## What changes were proposed in this pull request? This PR removes all docs about the old streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter projects since I have already copied them to https://github.com/spark-packages Also remove mqtt_wordcount.py that I forgot to remove previously. ## How was this patch tested? Jenkins PR Build. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11824 from zsxwing/remove-doc.
* [SPARK-14073][STREAMING][TEST-MAVEN] Move flume back to SparkShixiong Zhu2016-03-251-8/+14
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR moves flume back to Spark as per the discussion in the dev mail-list. ## How was this patch tested? Existing Jenkins tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11895 from zsxwing/move-flume-back.