aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-11823][SQL] Fix flaky JDBC cancellation test in ↵Josh Rosen2015-12-211-29/+56
| | | | | | | | | | | | HiveThriftBinaryServerSuite This patch fixes a flaky "test jdbc cancel" test in HiveThriftBinaryServerSuite. This test is prone to a race-condition which causes it to block indefinitely with while waiting for an extremely slow query to complete, which caused many Jenkins builds to time out. For more background, see my comments on #6207 (the PR which introduced this test). Author: Josh Rosen <joshrosen@databricks.com> Closes #10425 from JoshRosen/SPARK-11823.
* [MINOR] Fix typos in JavaStreamingContextShixiong Zhu2015-12-211-4/+4
| | | | | | Author: Shixiong Zhu <shixiong@databricks.com> Closes #10424 from zsxwing/typo.
* [SPARK-11807] Remove support for Hadoop < 2.2Reynold Xin2015-12-219-62/+9
| | | | | | | | i.e. Hadoop 1 and Hadoop 2.0 Author: Reynold Xin <rxin@databricks.com> Closes #10404 from rxin/SPARK-11807.
* [SPARK-12388] change default compression to lz4Davies Liu2015-12-216-14/+276
| | | | | | | | | | | | | | According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy. After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4). [1] https://github.com/ning/jvm-compressor-benchmark/wiki cc rxin Author: Davies Liu <davies@databricks.com> Closes #10342 from davies/lz4.
* [SPARK-12466] Fix harmless NPE in testsAndrew Or2015-12-211-1/+5
| | | | | | | | | | | | | | | | | | | ``` [info] ReplayListenerSuite: [info] - Simple replay (58 milliseconds) java.lang.NullPointerException at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) ``` https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull This was introduced in #10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests). Tested locally to verify that the NPE is gone. Author: Andrew Or <andrew@databricks.com> Closes #10417 from andrewor14/fix-harmless-npe.
* [SPARK-2331] SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]Reynold Xin2015-12-212-1/+4
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #10394 from rxin/SPARK-2331.
* [SPARK-12339][SPARK-11206][WEBUI] Added a null check that was removed inAlex Bozarth2015-12-211-6/+8
| | | | | | | | Updates made in SPARK-11206 missed an edge case which cause's a NullPointerException when a task is killed. In some cases when a task ends in failure taskMetrics is initialized as null (see JobProgressListener.onTaskEnd()). To address this a null check was added. Before the changes in SPARK-11206 this null check was called at the start of the updateTaskAccumulatorValues() function. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #10405 from ajbozarth/spark12339.
* Doc typo: ltrim = trim from left end, not rightpshearer2015-12-211-1/+1
| | | | | | Author: pshearer <pshearer@massmutual.com> Closes #10414 from pshearer/patch-1.
* [SPARK-5882][GRAPHX] Add a test for GraphLoader.edgeListFileTakeshi YAMAMURO2015-12-211-0/+47
| | | | | | Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #4674 from maropu/AddGraphLoaderSuite.
* [SPARK-12392][CORE] Optimize a location order of broadcast blocks by ↵Takeshi YAMAMURO2015-12-212-2/+29
| | | | | | | | | | considering preferred local hosts When multiple workers exist in a host, we can bypass unnecessary remote access for broadcasts; block managers fetch broadcast blocks from the same host instead of remote hosts. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10346 from maropu/OptimizeBlockLocationOrder.
* [SPARK-12374][SPARK-12150][SQL] Adding logical/physical operators for Rangegatorsmile2015-12-217-8/+119
| | | | | | | | | | | | | | Based on the suggestions from marmbrus , added logical/physical operators for Range for improving the performance. Also added another API for resolving the JIRA Spark-12150. Could you take a look at my implementation, marmbrus ? If not good, I can rework it. : ) Thank you very much! Author: gatorsmile <gatorsmile@gmail.com> Closes #10335 from gatorsmile/rangeOperators.
* [SPARK-12321][SQL] JSON format for TreeNode (use reflection)Wenchen Fan2015-12-2113-75/+472
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An alternative solution for https://github.com/apache/spark/pull/10295 , instead of implementing json format for all logical/physical plans and expressions, use reflection to implement it in `TreeNode`. Here I use pre-order traversal to flattern a plan tree to a plan list, and add an extra field `num-children` to each plan node, so that we can reconstruct the tree from the list. example json: logical plan tree: ``` [ { "class" : "org.apache.spark.sql.catalyst.plans.logical.Sort", "num-children" : 1, "order" : [ [ { "class" : "org.apache.spark.sql.catalyst.expressions.SortOrder", "num-children" : 1, "child" : 0, "direction" : "Ascending" }, { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "i", "dataType" : "integer", "nullable" : true, "metadata" : { }, "exprId" : { "id" : 10, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] } ] ], "global" : false, "child" : 0 }, { "class" : "org.apache.spark.sql.catalyst.plans.logical.Project", "num-children" : 1, "projectList" : [ [ { "class" : "org.apache.spark.sql.catalyst.expressions.Alias", "num-children" : 1, "child" : 0, "name" : "i", "exprId" : { "id" : 10, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] }, { "class" : "org.apache.spark.sql.catalyst.expressions.Add", "num-children" : 2, "left" : 0, "right" : 1 }, { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "a", "dataType" : "integer", "nullable" : true, "metadata" : { }, "exprId" : { "id" : 0, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] }, { "class" : "org.apache.spark.sql.catalyst.expressions.Literal", "num-children" : 0, "value" : "1", "dataType" : "integer" } ], [ { "class" : "org.apache.spark.sql.catalyst.expressions.Alias", "num-children" : 1, "child" : 0, "name" : "j", "exprId" : { "id" : 11, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] }, { "class" : "org.apache.spark.sql.catalyst.expressions.Multiply", "num-children" : 2, "left" : 0, "right" : 1 }, { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "a", "dataType" : "integer", "nullable" : true, "metadata" : { }, "exprId" : { "id" : 0, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] }, { "class" : "org.apache.spark.sql.catalyst.expressions.Literal", "num-children" : 0, "value" : "2", "dataType" : "integer" } ] ], "child" : 0 }, { "class" : "org.apache.spark.sql.catalyst.plans.logical.LocalRelation", "num-children" : 0, "output" : [ [ { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "a", "dataType" : "integer", "nullable" : true, "metadata" : { }, "exprId" : { "id" : 0, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] } ] ], "data" : [ ] } ] ``` Author: Wenchen Fan <wenchen@databricks.com> Closes #10311 from cloud-fan/toJson-reflection.
* [SPARK-12398] Smart truncation of DataFrame / Dataset toStringDilip Biswal2015-12-214-1/+73
| | | | | | | | | | | | | | | | When a DataFrame or Dataset has a long schema, we should intelligently truncate to avoid flooding the screen with unreadable information. // Standard output [a: int, b: int] // Truncate many top level fields [a: int, b, string ... 10 more fields] // Truncate long inner structs [a: struct<a: Int ... 10 more fields>] Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10373 from dilipbiswal/spark-12398.
* [PYSPARK] Pyspark typo & Add missing abstractmethod annotationJeff Zhang2015-12-212-2/+3
| | | | | | | | | | No jira is created since this is a trivial change. davies Please help review it Author: Jeff Zhang <zjffdu@apache.org> Closes #10143 from zjffdu/pyspark_typo.
* [SPARK-12349][ML] Make spark.ml PCAModel load backwards compatibleSean Owen2015-12-211-5/+28
| | | | | | | | | Only load explainedVariance in PCAModel if it was written with Spark > 1.6.x jkbradley is this kind of what you had in mind? Author: Sean Owen <sowen@cloudera.com> Closes #10327 from srowen/SPARK-12349.
* [SPARK-10158][PYSPARK][MLLIB] ALS better error message when using Long IDsBryan Cutler2015-12-202-1/+28
| | | | | | | | Added catch for casting Long to Int exception when PySpark ALS Ratings are serialized. It is easy to accidentally use Long IDs for user/product and before, it would fail with a somewhat cryptic "ClassCastException: java.lang.Long cannot be cast to java.lang.Integer." Now if this is done, a more descriptive error is shown, e.g. "PickleException: Ratings id 1205640308657491975 exceeds max integer value of 2147483647." Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #9361 from BryanCutler/als-pyspark-long-id-error-SPARK-10158.
* [SPARK-11808] Remove Bagel.Reynold Xin2015-12-1915-613/+9
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #10395 from rxin/SPARK-11808.
* HOTFIX for the previous hot fix.Reynold Xin2015-12-191-0/+1
|
* HOTFIX: Disable Java style test.Reynold Xin2015-12-191-1/+1
|
* Bump master version to 2.0.0-SNAPSHOT.Reynold Xin2015-12-1940-39/+176
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #10387 from rxin/version-bump.
* [SQL] Fix mistake doc of join type for dataframe.joinYanbo Liang2015-12-191-1/+1
| | | | | | | | Fix mistake doc of join type for ```dataframe.join```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10378 from yanboliang/leftsemi.
* [SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levelsgatorsmile2015-12-1810-31/+45
| | | | | | | | | | | | | | The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs. davies Is this inconsistency intentional? Thanks! Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY. Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`. Author: gatorsmile <gatorsmile@gmail.com> Closes #10092 from gatorsmile/persistStorageLevel.
* Revert "[SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs ↵Andrew Or2015-12-182-7/+2
| | | | | | with Mesos cluster mode." This reverts commit ad8c1f0b840284d05da737fb2cc5ebf8848f4490.
* Revert "[SPARK-12345][MESOS] Properly filter out SPARK_HOME in the Mesos ↵Andrew Or2015-12-181-1/+1
| | | | | | REST server" This reverts commit 8184568810e8a2e7d5371db2c6a0366ef4841f70.
* Revert "[SPARK-12413] Fix Mesos ZK persistence"Andrew Or2015-12-181-5/+1
| | | | This reverts commit 2bebaa39d9da33bc93ef682959cd42c1968a6a3e.
* [SPARK-12345][CORE] Do not send SPARK_HOME through Spark submit REST interfaceLuc Bourlier2015-12-181-2/+4
| | | | | | | | | | | | It is usually an invalid location on the remote machine executing the job. It is picked up by the Mesos support in cluster mode, and most of the time causes the job to fail. Fixes SPARK-12345 Author: Luc Bourlier <luc.bourlier@typesafe.com> Closes #10329 from skyluc/issue/SPARK_HOME.
* [SPARK-11097][CORE] Add channelActive callback to RpcHandler to monitor the ↵Shixiong Zhu2015-12-1811-92/+148
| | | | | | | | | | new connections Added `channelActive` to `RpcHandler` so that `NettyRpcHandler` doesn't need `clients` any more. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10301 from zsxwing/network-events.
* [SPARK-12411][CORE] Decrease executor heartbeat timeout to match heartbeat ↵Nong Li2015-12-181-1/+3
| | | | | | | | | | | | | | interval Previously, the rpc timeout was the default network timeout, which is the same value the driver uses to determine dead executors. This means if there is a network issue, the executor is determined dead after one heartbeat attempt. There is a separate config for the heartbeat interval which is a better value to use for the heartbeat RPC. With this change, the executor will make multiple heartbeat attempts even with RPC issues. Author: Nong Li <nong@databricks.com> Closes #10365 from nongli/spark-12411.
* [SPARK-9552] Return "false" while nothing to kill in killExecutorsGrace2015-12-183-17/+24
| | | | | | | | | | | | | | In discussion (SPARK-9552), we proposed a force kill in `killExecutors`. But if there is nothing to kill, it will return back with true (acknowledgement). And then, it causes the certain executor(s) (which is not eligible to kill) adding to pendingToRemove list for further actions. In this patch, we'd like to change the return semantics. If there is nothing to kill, we will return "false". and therefore all those non-eligible executors won't be added to the pendingToRemove list. vanzin andrewor14 As the follow up of PR#7888, please let me know your comments. Author: Grace <jie.huang@intel.com> Author: Jie Huang <hjie@fosun.com> Author: Andrew Or <andrew@databricks.com> Closes #9796 from GraceH/emptyPendingToRemove.
* [SPARK-11985][STREAMING][KINESIS][DOCS] Update Kinesis docsBurak Yavuz2015-12-181-9/+45
| | | | | | | | | | - Provide example on `message handler` - Provide bit on KPL record de-aggregation - Fix typos Author: Burak Yavuz <brkyvz@gmail.com> Closes #9970 from brkyvz/kinesis-docs.
* [SPARK-12404][SQL] Ensure objects passed to StaticInvoke is SerializableKousuke Saruta2015-12-186-26/+88
| | | | | | | | | | | | | | | | | | | | | Now `StaticInvoke` receives `Any` as a object and `StaticInvoke` can be serialized but sometimes the object passed is not serializable. For example, following code raises Exception because `RowEncoder#extractorsFor` invoked indirectly makes `StaticInvoke`. ``` case class TimestampContainer(timestamp: java.sql.Timestamp) val rdd = sc.parallelize(1 to 2).map(_ => TimestampContainer(System.currentTimeMillis)) val df = rdd.toDF val ds = df.as[TimestampContainer] val rdd2 = ds.rdd <----------------- invokes extractorsFor indirectory ``` I'll add test cases. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Author: Michael Armbrust <michael@databricks.com> Closes #10357 from sarutak/SPARK-12404.
* [SPARK-12218][SQL] Invalid splitting of nested AND expressions in Data ↵Yin Huai2015-12-184-13/+60
| | | | | | | | | | | | Source filter API JIRA: https://issues.apache.org/jira/browse/SPARK-12218 When creating filters for Parquet/ORC, we should not push nested AND expressions partially. Author: Yin Huai <yhuai@databricks.com> Closes #10362 from yhuai/SPARK-12218.
* [SPARK-12054] [SQL] Consider nullability of expression in codegenDavies Liu2015-12-1827-226/+261
| | | | | | | | | | This could simplify the generated code for expressions that is not nullable. This PR fix lots of bugs about nullability. Author: Davies Liu <davies@databricks.com> Closes #10333 from davies/skip_nullable.
* [SPARK-11619][SQL] cannot use UDTF in DataFrame.selectExprDilip Biswal2015-12-187-14/+31
| | | | | | | | | | | | Description of the problem from cloud-fan Actually this line: https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L689 When we use `selectExpr`, we pass in `UnresolvedFunction` to `DataFrame.select` and fall in the last case. A workaround is to do special handling for UDTF like we did for `explode`(and `json_tuple` in 1.6), wrap it with `MultiAlias`. Another workaround is using `expr`, for example, `df.select(expr("explode(a)").as(Nil))`, I think `selectExpr` is no longer needed after we have the `expr` function.... Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #9981 from dilipbiswal/spark-11619.
* [SPARK-12350][CORE] Don't log errors when requested stream is not found.Marcelo Vanzin2015-12-185-14/+39
| | | | | | | | | | | | | | | | If a client requests a non-existent stream, just send a failure message back, without logging any error on the server side (since it's not a server error). On the executor side, avoid error logs by translating any errors during transfer to a `ClassNotFoundException`, so that loading the class is retried on a the parent class loader. This can mask IO errors during transmission, but the most common cause is that the class is not served by the remote end. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10337 from vanzin/SPARK-12350.
* [SPARK-9057][STREAMING] Twitter example joining to static RDD of word ↵Jeff L2015-12-184-0/+2830
| | | | | | | | | | sentiment values Example of joining a static RDD of word sentiments to a streaming RDD of Tweets in order to demo the usage of the transform() method. Author: Jeff L <sha0lin@alumni.carnegiemellon.edu> Closes #8431 from Agent007/SPARK-9057.
* [SPARK-12413] Fix Mesos ZK persistenceMichael Gummelt2015-12-181-1/+5
| | | | | | | | I believe this fixes SPARK-12413. I'm currently running an integration test to verify. Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #10366 from mgummelt/fix-zk-mesos.
* [CORE][TESTS] minor fix of JavaSerializerSuiteJeff Zhang2015-12-181-2/+7
| | | | | | | | | Not jira is created. The original test is passed because the class cast is lazy (only when the object's method is invoked). Author: Jeff Zhang <zjffdu@apache.org> Closes #10371 from zjffdu/minor_fix.
* [MINOR] Hide the error logs for 'SQLListenerMemoryLeakSuite'Shixiong Zhu2015-12-171-29/+35
| | | | | | | | Hide the error logs for 'SQLListenerMemoryLeakSuite' to avoid noises. Most of changes are space changes. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10363 from zsxwing/hide-log.
* [SPARK-11749][STREAMING] Duplicate creating the RDD in file stream when ↵jhu-chang2015-12-172-9/+62
| | | | | | | | | | recovering from checkpoint data Add a transient flag `DStream.restoredFromCheckpointData` to control the restore processing in DStream to avoid duplicate works: check this flag first in `DStream.restoreCheckpointData`, only when `false`, the restore process will be executed. Author: jhu-chang <gt.hu.chang@gmail.com> Closes #9765 from jhu-chang/SPARK-11749.
* [SPARK-8641][SQL] Native Spark Window functionsHerman van Hovell2015-12-1715-746/+1148
| | | | | | | | | | | | | | This PR removes Hive windows functions from Spark and replaces them with (native) Spark ones. The PR is on par with Hive in terms of features. This has the following advantages: * Better memory management. * The ability to use spark UDAFs in Window functions. cc rxin / yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9819 from hvanhovell/SPARK-8641-2.
* [SPARK-12376][TESTS] Spark Streaming Java8APISuite fails in ↵Evan Chen2015-12-171-3/+8
| | | | | | | | | | assertOrderInvariantEquals method org.apache.spark.streaming.Java8APISuite.java is failing due to trying to sort immutable list in assertOrderInvariantEquals method. Author: Evan Chen <chene@us.ibm.com> Closes #10336 from evanyc15/SPARK-12376-StreamingJavaAPISuite.
* [SPARK-12397][SQL] Improve error messages for data sources when they are not ↵Reynold Xin2015-12-172-18/+49
| | | | | | | | | | found Point users to spark-packages.org to find them. Author: Reynold Xin <rxin@databricks.com> Closes #10351 from rxin/SPARK-12397.
* [SPARK-12410][STREAMING] Fix places that use '.' and '|' directly in splitShixiong Zhu2015-12-172-2/+2
| | | | | | | | String.split accepts a regular expression, so we should escape "." and "|". Author: Shixiong Zhu <shixiong@databricks.com> Closes #10361 from zsxwing/reg-bug.
* [SPARK-12345][MESOS] Properly filter out SPARK_HOME in the Mesos REST serverIulian Dragos2015-12-181-1/+1
| | | | | | | | Fix problem with #10332, this one should fix Cluster mode on Mesos Author: Iulian Dragos <jaguarul@gmail.com> Closes #10359 from dragos/issue/fix-spark-12345-one-more-time.
* [SPARK-12220][CORE] Make Utils.fetchFile support files that contain special ↵Shixiong Zhu2015-12-175-6/+46
| | | | | | | | | | characters This PR encodes and decodes the file name to fix the issue. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10208 from zsxwing/uri.
* [SQL] Update SQLContext.read.text docYanbo Liang2015-12-173-3/+3
| | | | | | | | Since we rename the column name from ```text``` to ```value``` for DataFrame load by ```SQLContext.read.text```, we need to update doc. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10349 from yanboliang/text-value.
* [SPARK-12395] [SQL] fix resulting columns of outer joinDavies Liu2015-12-172-9/+36
| | | | | | | | | | | | | | For API DataFrame.join(right, usingColumns, joinType), if the joinType is right_outer or full_outer, the resulting join columns could be wrong (will be null). The order of columns had been changed to match that with MySQL and PostgreSQL [1]. This PR also fix the nullability of output for outer join. [1] http://www.postgresql.org/docs/9.2/static/queries-table-expressions.html Author: Davies Liu <davies@databricks.com> Closes #10353 from davies/fix_join.
* Revert "Once driver register successfully, stop it to connect to master."Davies Liu2015-12-171-1/+0
| | | | This reverts commit 5a514b61bbfb609c505d8d65f2483068a56f1f70.
* Once driver register successfully, stop it to connect to master.echo2mei2015-12-171-0/+1
| | | | | | | | This commit is to resolve SPARK-12396. Author: echo2mei <534384876@qq.com> Closes #10354 from echoTomei/master.