spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-6363][BUILD] Make Scala 2.11 the default Scala version	Josh Rosen	2016-01-30	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \|	This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds). The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance). After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break. Author: Josh Rosen <joshrosen@databricks.com> Closes #10608 from JoshRosen/SPARK-6363.
*	[SPARK-12933][SQL] Initial implementation of Count-Min sketch	Cheng Lian	2016-01-23	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds an initial implementation of count min sketch, contained in a new module spark-sketch under `common/sketch`. The implementation is based on the [`CountMinSketch` class in stream-lib][1]. As required by the [design doc][2], spark-sketch should have no external dependency. Two classes, `Murmur3_x86_32` and `Platform` are copied to spark-sketch from spark-unsafe for hashing facilities. They'll also be used in the upcoming bloom filter implementation. The following features will be added in future follow-up PRs: - Serialization support - DataFrame API integration [1]: https://github.com/addthis/stream-lib/blob/aac6b4d23a8686b000f80baa447e0922ecac3bcb/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java [2]: https://issues.apache.org/jira/secure/attachment/12782378/BloomFilterandCount-MinSketchinSpark2.0.pdf Author: Cheng Lian <lian@databricks.com> Closes #10851 from liancheng/count-min-sketch.
*	[SPARK-7997][CORE] Remove Akka from Spark Core and Streaming	Shixiong Zhu	2016-01-22	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \|	- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka. - Remove HttpFileServer - Remove Akka configs from SparkConf and SSLOptions - Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it. - Update comments and docs Author: Shixiong Zhu <shixiong@databricks.com> Closes #10854 from zsxwing/remove-akka.
*	[SPARK-7799][SPARK-12786][STREAMING] Add "streaming-akka" project	Shixiong Zhu	2016-01-20	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Include the following changes: 1. Add "streaming-akka" project and org.apache.spark.streaming.akka.AkkaUtils for creating an actorStream 2. Remove "StreamingContext.actorStream" and "JavaStreamingContext.actorStream" 3. Update the ActorWordCount example and add the JavaActorWordCount example 4. Make "streaming-zeromq" depend on "streaming-akka" and update the codes accordingly Author: Shixiong Zhu <shixiong@databricks.com> Closes #10744 from zsxwing/streaming-akka-2.
*	[SPARK-12842][TEST-HADOOP2.7] Add Hadoop 2.7 build profile	Josh Rosen	2016-01-15	1	-0/+10
\| \| \| \| \| \| \| \| \| \|	This patch adds a Hadoop 2.7 build profile in order to let us automate tests against that version. /cc rxin srowen Author: Josh Rosen <joshrosen@databricks.com> Closes #10775 from JoshRosen/add-hadoop-2.7-profile.
*	[SPARK-12269][STREAMING][KINESIS] Update aws-java-sdk version	BrianLondon	2016-01-11	1	-3/+3
\| \| \| \| \| \| \| \|	The current Spark Streaming kinesis connector references a quite old version 1.9.40 of the AWS Java SDK (1.10.40 is current). Numerous AWS features including Kinesis Firehose are unavailable in 1.9. Those two versions of the AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 respectively) such that one cannot include the current AWS SDK in a project that also uses the Spark Streaming Kinesis ASL. Author: BrianLondon <brian@seatgeek.com> Closes #10256 from BrianLondon/master.
*	[SPARK-12734][HOTFIX][TEST-MAVEN] Fix bug in Netty exclusions	Josh Rosen	2016-01-11	1	-43/+7
\| \| \| \| \| \| \| \| \| \|	This is a hotfix for a build bug introduced by the Netty exclusion changes in #10672. We can't exclude `io.netty:netty` because Akka depends on it. There's not a direct conflict between `io.netty:netty` and `io.netty:netty-all`, because the former puts classes in the `org.jboss.netty` namespace while the latter uses the `io.netty` namespace. However, there still is a conflict between `org.jboss.netty:netty` and `io.netty:netty`, so we need to continue to exclude the JBoss version of that artifact. While the diff here looks somewhat large, note that this is only a revert of a some of the changes from #10672. You can see the net changes in pom.xml at https://github.com/apache/spark/compare/3119206b7188c23055621dfeaf6874f21c711a82...5211ab8#diff-600376dffeb79835ede4a0b285078036 Author: Josh Rosen <joshrosen@databricks.com> Closes #10693 from JoshRosen/netty-hotfix.
*	[SPARK-12734][BUILD] Fix Netty exclusion and use Maven Enforcer to prevent ↵	Josh Rosen	2016-01-10	1	-1/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	future bugs Netty classes are published under multiple artifacts with different names, so our build needs to exclude the `io.netty:netty` and `org.jboss.netty:netty` versions of the Netty artifact. However, our existing exclusions were incomplete, leading to situations where duplicate Netty classes would wind up on the classpath and cause compile errors (or worse). This patch fixes the exclusion issue by adding more exclusions and uses Maven Enforcer's [banned dependencies](https://maven.apache.org/enforcer/enforcer-rules/bannedDependencies.html) rule to prevent these classes from accidentally being reintroduced. I also updated `dev/test-dependencies.sh` to run `mvn validate` so that the enforcer rules can run as part of pull request builds. /cc rxin srowen pwendell. I'd like to backport at least the exclusion portion of this fix to `branch-1.5` in order to fix the documentation publishing job, which fails nondeterministically due to incompatible versions of Netty classes taking precedence on the compile-time classpath. Author: Josh Rosen <rosenville@gmail.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #10672 from JoshRosen/enforce-netty-exclusions.
*	[SPARK-4628][BUILD] Remove all non-Maven-Central repositories from build	Josh Rosen	2016-01-08	1	-87/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch removes all non-Maven-central repositories from Spark's build, thereby avoiding any risk of future build-breaks due to us accidentally depending on an artifact which is not present in an immutable public Maven repository. I tested this by running ``` build/mvn \ -Phive \ -Phive-thriftserver \ -Pkinesis-asl \ -Pspark-ganglia-lgpl \ -Pyarn \ dependency:go-offline ``` inside of a fresh Ubuntu Docker container with no Ivy or Maven caches (I did a similar test for SBT). Author: Josh Rosen <joshrosen@databricks.com> Closes #10659 from JoshRosen/SPARK-4628.
*	[SPARK-4819] Remove Guava's "Optional" from public API	Sean Owen	2016-01-08	1	-11/+0
\| \| \| \| \| \| \| \| \| \|	Replace Guava `Optional` with (an API clone of) Java 8 `java.util.Optional` (edit: and a clone of Guava `Optional`) See also https://github.com/apache/spark/pull/10512 Author: Sean Owen <sowen@cloudera.com> Closes #10513 from srowen/SPARK-4819.
*	[SPARK-12573][SPARK-12574][SQL] Move SQL Parser from Hive to Catalyst	Herman van Hovell	2016-01-06	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR moves a major part of the new SQL parser to Catalyst. This is a prelude to start using this parser for all of our SQL parsing. The following key changes have been made: The ANTLR Parser & Supporting classes have been moved to the Catalyst project. They are now part of the ```org.apache.spark.sql.catalyst.parser``` package. These classes contained quite a bit of code that was originally from the Hive project, I have added aknowledgements whenever this applied. All Hive dependencies have been factored out. I have also taken this chance to clean-up the ```ASTNode``` class, and to improve the error handling. The HiveQl object that provides the functionality to convert an AST into a LogicalPlan has been refactored into three different classes, one for every SQL sub-project: - ```CatalystQl```: This implements Query and Expression parsing functionality. - ```SparkQl```: This is a subclass of CatalystQL and provides SQL/Core only functionality such as Explain and Describe. - ```HiveQl```: This is a subclass of ```SparkQl``` and this adds Hive-only functionality to the parser such as Analyze, Drop, Views, CTAS & Transforms. This class still depends on Hive. cc rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10583 from hvanhovell/SPARK-12575.
*	[SPARK-12453][STREAMING] Remove explicit dependency on aws-java-sdk	BrianLondon	2016-01-05	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Successfully ran kinesis demo on a live, aws hosted kinesis stream against master and 1.6 branches. For reasons I don't entirely understand it required a manual merge to 1.5 which I did as shown here: https://github.com/BrianLondon/spark/commit/075c22e89bc99d5e99be21f40e0d72154a1e23a2 The demo ran successfully on the 1.5 branch as well. According to `mvn dependency:tree` it is still pulling a fairly old version of the aws-java-sdk (1.9.37), but this appears to have fixed the kinesis regression in 1.5.2. Author: BrianLondon <brian@seatgeek.com> Closes #10492 from BrianLondon/remove-only.
*	[SPARK-12362][SQL][WIP] Inline Hive Parser	Herman van Hovell	2016-01-01	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR inlines the Hive SQL parser in Spark SQL. The previous (merged) incarnation of this PR passed all tests, but had and still has problems with the build. These problems are caused by a the fact that - for some reason - in some cases the ANTLR generated code is not included in the compilation fase. This PR is a WIP and should not be merged until we have sorted out the build issues. Author: Herman van Hovell <hvanhovell@questtec.nl> Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> Closes #10525 from hvanhovell/SPARK-12362.
*	[SPARK-10359] Enumerate dependencies in a file and diff against it for new ↵	Josh Rosen	2015-12-30	1	-0/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pull requests This patch adds a new build check which enumerates Spark's resolved runtime classpath and saves it to a file, then diffs against that file to detect whether pull requests have introduced dependency changes. The aim of this check is to make it simpler to reason about whether pull request which modify the build have introduced new dependencies or changed transitive dependencies in a way that affects the final classpath. This supplants the checks added in SPARK-4123 / #5093, which are currently disabled due to bugs. This patch is based on pwendell's work in #8531. Closes #8531. Author: Josh Rosen <joshrosen@databricks.com> Author: Patrick Wendell <patrick@databricks.com> Closes #10461 from JoshRosen/SPARK-10359.
*	Revert "[SPARK-12362][SQL][WIP] Inline Hive Parser"	Reynold Xin	2015-12-30	1	-5/+0
\| \| \| \|	This reverts commit b600bccf41a7b1958e33d8301a19214e6517e388 due to non-deterministic build breaks.
*	[SPARK-12362][SQL][WIP] Inline Hive Parser	Nong Li	2015-12-29	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a WIP. The PR has been taken over from nongli (see https://github.com/apache/spark/pull/10420). I have removed some additional dead code, and fixed a few issues which were caused by the fact that the inlined Hive parser is newer than the Hive parser we currently use in Spark. I am submitting this PR in order to get some feedback and testing done. There is quite a bit of work to do: - [ ] Get it to pass jenkins build/test. - [ ] Aknowledge Hive-project for using their parser. - [ ] Refactorings between HiveQl and the java classes. - [ ] Create our own ASTNode and integrate the current implicit extentions. - [ ] Move remaining ```SemanticAnalyzer``` and ```ParseUtils``` functionality to ```HiveQl```. - [ ] Removing Hive dependencies from the parser. This will require some edits in the grammar files. - [ ] Introduce our own context which needs to contain a ```TokenRewriteStream```. - [ ] Add ```useSQL11ReservedKeywordsForIdentifier``` and ```allowQuotedId``` to the catalyst or sql configuration. - [ ] Remove ```HiveConf``` from grammar files &HiveQl, and pass in our own configuration. - [ ] Moving the parser into sql/core. cc nongli rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> Closes #10509 from hvanhovell/SPARK-12362.
*	[SPARK-11807] Remove support for Hadoop < 2.2	Reynold Xin	2015-12-21	1	-13/+0
\| \| \| \| \| \| \| \|	i.e. Hadoop 1 and Hadoop 2.0 Author: Reynold Xin <rxin@databricks.com> Closes #10404 from rxin/SPARK-11807.
*	[SPARK-11808] Remove Bagel.	Reynold Xin	2015-12-19	1	-2/+1
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #10395 from rxin/SPARK-11808.
*	Bump master version to 2.0.0-SNAPSHOT.	Reynold Xin	2015-12-19	1	-1/+1
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #10387 from rxin/version-bump.
*	[SPARK-11796] Fix httpclient and httpcore depedency issues related to ↵	Mark Grover	2015-12-09	1	-0/+28
\| \| \| \| \| \| \| \| \| \|	docker-client This commit fixes dependency issues which prevented the Docker-based JDBC integration tests from running in the Maven build. Author: Mark Grover <mgrover@cloudera.com> Closes #9876 from markgrover/master_docker.
*	[SPARK-11652][CORE] Remote code execution with InvokerTransformer	Sean Owen	2015-12-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Fix commons-collection group ID to commons-collections for version 3.x Patches earlier PR at https://github.com/apache/spark/pull/9731 Author: Sean Owen <sowen@cloudera.com> Closes #10198 from srowen/SPARK-11652.2.
*	[SPARK-12112][BUILD] Upgrade to SBT 0.13.9	Josh Rosen	2015-12-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin). I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations. Author: Josh Rosen <joshrosen@databricks.com> Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
*	[SPARK-6990][BUILD] Add Java linting script; fix minor warnings	Dmitry Erastov	2015-12-04	1	-0/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This replaces https://github.com/apache/spark/pull/9696 Invoke Checkstyle and print any errors to the console, failing the step. Use Google's style rules modified according to https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide Some important checks are disabled (see TODOs in `checkstyle.xml`) due to multiple violations being present in the codebase. Suggest fixing those TODOs in a separate PR(s). More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/). Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles): > Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1 Also fix some of the minor violations that didn't require sweeping changes. Apologies for the previous botched PRs - I finally figured out the issue. cr: JoshRosen, pwendell > I state that the contribution is my original work, and I license the work to the project under the project's open source license. Author: Dmitry Erastov <derastov@gmail.com> Closes #9867 from dskrvk/master.
*	[SPARK-4424] Remove spark.driver.allowMultipleContexts override in tests	Josh Rosen	2015-11-23	1	-2/+0
\| \| \| \| \| \| \| \|	This patch removes `spark.driver.allowMultipleContexts=true` from our test configuration. The multiple SparkContexts check was originally disabled because certain tests suites in SQL needed to create multiple contexts. As far as I know, this configuration change is no longer necessary, so we should remove it in order to make it easier to find test cleanup bugs. Author: Josh Rosen <joshrosen@databricks.com> Closes #9865 from JoshRosen/SPARK-4424.
*	[SPARK-11652][CORE] Remote code execution with InvokerTransformer	Sean Owen	2015-11-18	1	-0/+7
\| \| \| \| \| \| \| \|	Update to Commons Collections 3.2.2 to avoid any potential remote code execution vulnerability Author: Sean Owen <sowen@cloudera.com> Closes #9731 from srowen/SPARK-11652.
*	[SPARK-11583] [CORE] MapStatus Using RoaringBitmap More Properly	Kent Yao	2015-11-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	This PR upgrade the version of RoaringBitmap to 0.5.10, to optimize the memory layout, will be much smaller when most of blocks are empty. This PR is based on #9661 (fix conflicts), see all of the comments at https://github.com/apache/spark/pull/9661 . Author: Kent Yao <yaooqinn@hotmail.com> Author: Davies Liu <davies@databricks.com> Author: Charles Allen <charles@allen-net.com> Closes #9746 from davies/roaring_mapstatus.
*	Revert "[SPARK-11271][SPARK-11016][CORE] Use Spark BitSet instead of ↵	Davies Liu	2015-11-16	1	-0/+5
\| \| \| \| \| \|	RoaringBitmap to reduce memory usage" This reverts commit e209fa271ae57dc8849f8b1241bf1ea7d6d3d62c.
*	[SPARK-6152] Use shaded ASM5 to support closure cleaning of Java 8 compiled ↵	Josh Rosen	2015-11-11	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	classes This patch modifies Spark's closure cleaner (and a few other places) to use ASM 5, which is necessary in order to support cleaning of closures that were compiled by Java 8. In order to avoid ASM dependency conflicts, Spark excludes ASM from all of its dependencies and uses a shaded version of ASM 4 that comes from `reflectasm` (see [SPARK-782](https://issues.apache.org/jira/browse/SPARK-782) and #232). This patch updates Spark to use a shaded version of ASM 5.0.4 that was published by the Apache XBean project; the POM used to create the shaded artifact can be found at https://github.com/apache/geronimo-xbean/blob/xbean-4.4/xbean-asm5-shaded/pom.xml. http://movingfulcrum.tumblr.com/post/80826553604/asm-framework-50-the-missing-migration-guide was a useful resource while upgrading the code to use the new ASM5 opcodes. I also added a new regression tests in the `java8-tests` subproject; the existing tests were insufficient to catch this bug, which only affected Scala 2.11 user code which was compiled targeting Java 8. Author: Josh Rosen <joshrosen@databricks.com> Closes #9512 from JoshRosen/SPARK-6152.
*	[SPARK-9818] Re-enable Docker tests for JDBC data source	Josh Rosen	2015-11-10	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \|	This patch re-enables tests for the Docker JDBC data source. These tests were reverted in #4872 due to transitive dependency conflicts introduced by the `docker-client` library. This patch should avoid those problems by using a version of `docker-client` which shades its transitive dependencies and by performing some build-magic to work around problems with that shaded JAR. In addition, I significantly refactored the tests to simplify the setup and teardown code and to fix several Docker networking issues which caused problems when running in `boot2docker`. Closes #8101. Author: Josh Rosen <joshrosen@databricks.com> Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #9503 from JoshRosen/docker-jdbc-tests.
*	[SPARK-11198][STREAMING][KINESIS] Support de-aggregation of records during ↵	Burak Yavuz	2015-11-09	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \|	recovery While the KCL handles de-aggregation during the regular operation, during recovery we use the lower level api, and therefore need to de-aggregate the records. tdas Testing is an issue, we need protobuf magic to do the aggregated records. Maybe we could depend on KPL for tests? Author: Burak Yavuz <brkyvz@gmail.com> Closes #9403 from brkyvz/kinesis-deaggregation.
*	[SPARK-11491] Update build to use Scala 2.10.5	Josh Rosen	2015-11-04	1	-2/+2
\| \| \| \| \| \| \| \|	Spark should build against Scala 2.10.5, since that includes a fix for Scaladoc that will fix doc snapshot publishing: https://issues.scala-lang.org/browse/SI-8479 Author: Josh Rosen <joshrosen@databricks.com> Closes #9450 from JoshRosen/upgrade-to-scala-2.10.5.
*	[SPARK-10949] Update Snappy version to 1.1.2	Adam Roberts	2015-11-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is an updated version of #8995 by a-roberts. Original description follows: Snappy now supports concatenation of serialized streams, this patch contains a version number change and the "does not support" test is now a "supports" test. Snappy 1.1.2 changelog mentions: > snappy-java-1.1.2 (22 September 2015) > This is a backward compatible release for 1.1.x. > Add AIX (32-bit) support. > There is no upgrade for the native libraries of the other platforms. > A major change since 1.1.1 is a support for reading concatenated results of SnappyOutputStream(s) > snappy-java-1.1.2-RC2 (18 May 2015) > Fix #107: SnappyOutputStream.close() is not idempotent > snappy-java-1.1.2-RC1 (13 May 2015) > SnappyInputStream now supports reading concatenated compressed results of SnappyOutputStream > There has been no compressed format change since 1.0.5.x. So You can read the compressed results > interchangeablly between these versions. > Fixes a problem when java.io.tmpdir does not exist. Closes #8995. Author: Adam Roberts <aroberts@uk.ibm.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #9439 from JoshRosen/update-snappy.
*	[SPARK-11413][BUILD] Bump joda-time version to 2.9 for java 8 and s3	Yongjia Wang	2015-11-02	1	-1/+1
\| \| \| \| \| \| \| \| \|	It's a known issue that joda-time before 2.8.1 is incompatible with java 1.8u60 or later, which causes s3 request to fail. This affects Spark when using s3 as data source. https://github.com/aws/aws-sdk-java/issues/444 Author: Yongjia Wang <yongjiaw@gmail.com> Closes #9379 from yongjiaw/SPARK-11413.
*	[SPARK-11271][SPARK-11016][CORE] Use Spark BitSet instead of RoaringBitmap ↵	Liang-Chi Hsieh	2015-11-02	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \|	to reduce memory usage JIRA: https://issues.apache.org/jira/browse/SPARK-11271 As reported in the JIRA ticket, when there are too many tasks, the memory usage of MapStatus will cause problem. Use BitSet instead of RoaringBitMap should be more efficient in memory usage. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9243 from viirya/mapstatus-bitset.
*	[SPARK-11127][STREAMING] upgrade AWS SDK and Kinesis Client Library (KCL)	Xiangrui Meng	2015-10-25	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	AWS SDK 1.9.40 is the latest 1.9.x release. KCL 1.5.1 is the latest release that using AWS SDK 1.9.x. The main goal is to have Kinesis consumer be able to read messages generated from Kinesis Producer Library (KPL). The API should be compatible with old versions. tdas brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #9153 from mengxr/SPARK-11127.
*	[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py.	Marcelo Vanzin	2015-10-07	1	-2/+22
\| \| \| \| \| \|	Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8775 from vanzin/SPARK-10300.
*	[SPARK-10889] [STREAMING] Bump KCL to add MillisBehindLatest metric	Avrohom Katz	2015-10-04	1	-1/+1
\| \| \| \| \| \| \| \|	I don't believe the API changed at all. Author: Avrohom Katz <iambpentameter@gmail.com> Closes #8957 from akatz/kcl-upgrade.
*	Revert "[SPARK-10300] [BUILD] [TESTS] Add support for test tags in ↵	Marcelo Vanzin	2015-09-15	1	-15/+2
\| \| \| \| \| \|	run-tests.py." This reverts commit 8abef21dac1a6538c4e4e0140323b83d804d602b.
*	[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py.	Marcelo Vanzin	2015-09-15	1	-2/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change does two things: - tag a few tests and adds the mechanism in the build to be able to disable those tags, both in maven and sbt, for both junit and scalatest suites. - add some logic to run-tests.py to disable some tags depending on what files have changed; that's used to disable expensive tests when a module hasn't explicitly been changed, to speed up testing for changes that don't directly affect those modules. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8437 from vanzin/test-tags.
*	Update version to 1.6.0-SNAPSHOT.	Reynold Xin	2015-09-15	1	-1/+1
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #8350 from rxin/1.6.
*	[SPARK-10222] [GRAPHX] [DOCS] More thoroughly deprecate Bagel in favor of GraphX	Sean Owen	2015-09-13	1	-1/+1
\| \| \| \| \| \| \| \|	Finish deprecating Bagel; remove reference to nonexistent example Author: Sean Owen <sowen@cloudera.com> Closes #8731 from srowen/SPARK-10222.
*	[SPARK-9284] [TESTS] Allow all tests to run without an assembly.	Marcelo Vanzin	2015-08-28	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change aims at speeding up the dev cycle a little bit, by making sure that all tests behave the same w.r.t. where the code to be tested is loaded from. Namely, that means that tests don't rely on the assembly anymore, rather loading all needed classes from the build directories. The main change is to make sure all build directories (classes and test-classes) are added to the classpath of child processes when running tests. YarnClusterSuite required some custom code since the executors are run differently (i.e. not through the launcher library, like standalone and Mesos do). I also found a couple of tests that could leak a SparkContext on failure, and added code to handle those. With this patch, it's possible to run the following command from a clean source directory and have all tests pass: mvn -Pyarn -Phadoop-2.4 -Phive-thriftserver install Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7629 from vanzin/SPARK-9284.
*	[SPARK-6196] [BUILD] Remove MapR profiles in favor of hadoop-provided	Sean Owen	2015-08-25	1	-38/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Follow up to https://github.com/apache/spark/pull/7047 pwendell mentioned that MapR should use `hadoop-provided` now, and indeed the new build script does not produce `mapr3`/`mapr4` artifacts anymore. Hence the action seems to be to remove the profiles, which are now not used. CC trystanleftwich Author: Sean Owen <sowen@cloudera.com> Closes #8338 from srowen/SPARK-6196.
*	[SPARK-9439] [YARN] External shuffle service robust to NM restarts using leveldb	Imran Rashid	2015-08-21	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-9439 In general, Yarn apps should be robust to NodeManager restarts. However, if you run spark with the external shuffle service on, after a NM restart all shuffles fail, b/c the shuffle service has lost some state with info on each executor. (Note the shuffle data is perfectly fine on disk across a NM restart, the problem is we've lost the small bit of state that lets us find those files.) The solution proposed here is that the external shuffle service can write out its state to leveldb (backed by a local file) every time an executor is added. When running with yarn, that file is in the NM's local dir. Whenever the service is started, it looks for that file, and if it exists, it reads the file and re-registers all executors there. Nothing is changed in non-yarn modes with this patch. The service is not given a place to save the state to, so it operates the same as before. This should make it easy to update other cluster managers as well, by just supplying the right file & the equivalent of yarn's `initializeApplication` -- I'm not familiar enough with those modes to know how to do that. Author: Imran Rashid <irashid@cloudera.com> Closes #7943 from squito/leveldb_external_shuffle_service_NM_restart and squashes the following commits: 0d285d3 [Imran Rashid] review feedback 70951d6 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart 5c71c8c [Imran Rashid] save executor to db before registering; style 2499c8c [Imran Rashid] explicit dependency on jackson-annotations 795d28f [Imran Rashid] review feedback 81f80e2 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart 594d520 [Imran Rashid] use json to serialize application executor info 1a7980b [Imran Rashid] version 8267d2a [Imran Rashid] style e9f99e8 [Imran Rashid] cleanup the handling of bad dbs a little 9378ba3 [Imran Rashid] fail gracefully on corrupt leveldb files acedb62 [Imran Rashid] switch to writing out one record per executor 79922b7 [Imran Rashid] rely on yarn to call stopApplication; assorted cleanup 12b6a35 [Imran Rashid] save registered executors when apps are removed; add tests c878fbe [Imran Rashid] better explanation of shuffle service port handling 694934c [Imran Rashid] only open leveldb connection once per service d596410 [Imran Rashid] store executor data in leveldb 59800b7 [Imran Rashid] Files.move in case renaming is unsupported 32fe5ae [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart d7450f0 [Imran Rashid] style f729e2b [Imran Rashid] debugging 4492835 [Imran Rashid] lol, dont use a PrintWriter b/c of scalastyle checks 0a39b98 [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart 55f49fc [Imran Rashid] make sure the service doesnt die if the registered executor file is corrupt; add tests 245db19 [Imran Rashid] style 62586a6 [Imran Rashid] just serialize the whole executors map bdbbf0d [Imran Rashid] comments, remove some unnecessary changes 857331a [Imran Rashid] better tests & comments bb9d1e6 [Imran Rashid] formatting bdc4b32 [Imran Rashid] rename 86e0cb9 [Imran Rashid] for tests, shuffle service finds an open port 23994ff [Imran Rashid] style 7504de8 [Imran Rashid] style a36729c [Imran Rashid] cleanup efb6195 [Imran Rashid] proper unit test, and no longer leak if apps stop during NM restart dd93dc0 [Imran Rashid] test for shuffle service w/ NM restarts d596969 [Imran Rashid] cleanup imports 0e9d69b [Imran Rashid] better names 9eae119 [Imran Rashid] cleanup lots of duplication 1136f44 [Imran Rashid] test needs to have an actual shuffle 0b588bd [Imran Rashid] more fixes ... ad122ef [Imran Rashid] more fixes 5e5a7c3 [Imran Rashid] fix build c69f46b [Imran Rashid] maybe working version, needs tests & cleanup ... bb3ba49 [Imran Rashid] minor cleanup 36127d3 [Imran Rashid] wip b9d2ced [Imran Rashid] incomplete setup for external shuffle service tests
*	[SPARK-9574] [STREAMING] Remove unnecessary contents of ↵	zsxwing	2015-08-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	spark-streaming-XXX-assembly jars Removed contents already included in Spark assembly jar from spark-streaming-XXX-assembly jars. Author: zsxwing <zsxwing@gmail.com> Closes #8069 from zsxwing/SPARK-9574.
*	[SPARK-9974] [BUILD] [SQL] Makes sure ↵	Cheng Lian	2015-08-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	com.twitter:parquet-hadoop-bundle:1.6.0 is in SBT assembly jar PR #7967 enables Spark SQL to persist Parquet tables in Hive compatible format when possible. One of the consequence is that, we have to set input/output classes to `MapredParquetInputFormat`/`MapredParquetOutputFormat`, which rely on com.twitter:parquet-hadoop:1.6.0 bundled with Hive 1.2.1. When loading such a table in Spark SQL, `o.a.h.h.ql.metadata.Table` first loads these input/output format classes, and thus classes in com.twitter:parquet-hadoop:1.6.0. However, the scope of this dependency is defined as "runtime", and is not packaged into Spark assembly jar. This results in a `ClassNotFoundException`. This issue can be worked around by asking users to add parquet-hadoop 1.6.0 via the `--driver-class-path` option. However, considering Maven build is immune to this problem, I feel it can be confusing and inconvenient for users. So this PR fixes this issue by changing scope of parquet-hadoop 1.6.0 to "compile". Author: Cheng Lian <lian@databricks.com> Closes #8198 from liancheng/spark-9974/bundle-parquet-1.6.0.
*	[SPARK-9649] Fix flaky test MasterSuite again - disable REST	Andrew Or	2015-08-11	1	-0/+1
\| \| \| \| \| \| \| \|	The REST server is not actually used in most tests and so we can disable it. It is a source of flakiness because it tries to bind to a specific port in vain. There was also some code that avoided the shuffle service in tests. This is actually not necessary because the shuffle service is already off by default. Author: Andrew Or <andrew@databricks.com> Closes #8084 from andrewor14/fix-master-suite-again.
*	[SPARK-5155] [PYSPARK] [STREAMING] Mqtt streaming support in Python	Prabeesh K	2015-08-10	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR is based on #4229, thanks prabeesh. Closes #4229 Author: Prabeesh K <prabsmails@gmail.com> Author: zsxwing <zsxwing@gmail.com> Author: prabs <prabsmails@gmail.com> Author: Prabeesh K <prabeesh.k@namshi.com> Closes #7833 from zsxwing/pr4229 and squashes the following commits: 9570bec [zsxwing] Fix the variable name and check null in finally 4a9c79e [zsxwing] Fix pom.xml indentation abf5f18 [zsxwing] Merge branch 'master' into pr4229 935615c [zsxwing] Fix the flaky MQTT tests 47278c5 [zsxwing] Include the project class files 478f844 [zsxwing] Add unpack 5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests 734db99 [zsxwing] Merge branch 'master' into pr4229 126608a [Prabeesh K] address the comments b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229 d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test a6747cb [Prabeesh K] wait for starting the receiver before publishing data 87fc677 [Prabeesh K] address the comments: 97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt 80474d1 [Prabeesh K] fix 1f0cfe9 [Prabeesh K] python style fix e1ee016 [Prabeesh K] scala style fix a5a8f9f [Prabeesh K] added Python test 9767d82 [Prabeesh K] implemented Python-friendly class a11968b [Prabeesh K] fixed python style 795ec27 [Prabeesh K] address comments ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly 3f4df12 [Prabeesh K] updated version b34c3c1 [prabs] adress comments 3aa7fff [prabs] Added Python streaming mqtt word count example b7d42ff [prabs] Mqtt streaming support in Python
*	[SPARK-8064] [BUILD] Follow-up. Undo change from SPARK-9507 that was ↵	tedyu	2015-08-04	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \|	accidentally reverted This PR removes the dependency reduced POM hack brought back by #7191 Author: tedyu <yuzhihong@gmail.com> Closes #7919 from tedyu/master and squashes the following commits: 1bfbd7b [tedyu] [BUILD] Remove dependency reduced POM hack
*	[SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build ↵	Sean Owen	2015-08-04	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	warnings, 1.5.0 edition Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process. I'll explain several of the changes inline in comments. Author: Sean Owen <sowen@cloudera.com> Closes #7862 from srowen/SPARK-9534 and squashes the following commits: ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.