spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[SPARK-10359] Enumerate dependencies in a file and diff against it for new ↵	Josh Rosen	2015-12-30	10	-120/+512
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pull requests This patch adds a new build check which enumerates Spark's resolved runtime classpath and saves it to a file, then diffs against that file to detect whether pull requests have introduced dependency changes. The aim of this check is to make it simpler to reason about whether pull request which modify the build have introduced new dependencies or changed transitive dependencies in a way that affects the final classpath. This supplants the checks added in SPARK-4123 / #5093, which are currently disabled due to bugs. This patch is based on pwendell's work in #8531. Closes #8531. Author: Josh Rosen <joshrosen@databricks.com> Author: Patrick Wendell <patrick@databricks.com> Closes #10461 from JoshRosen/SPARK-10359.
*	[SPARK-12300] [SQL] [PYSPARK] fix schema inferance on local collections	Holden Karau	2015-12-30	2	-7/+14
\| \| \| \| \| \| \| \|	Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information. Author: Holden Karau <holden@us.ibm.com> Closes #10275 from holdenk/SPARK-12300-fix-schmea-inferance-on-local-collections.
*	[SPARK-12495][SQL] use true as default value for propagateNull in NewInstance	Wenchen Fan	2015-12-30	7	-37/+38
\| \| \| \| \| \| \| \| \| \|	Most of cases we should propagate null when call `NewInstance`, and so far there is only one case we should stop null propagation: create product/java bean. So I think it makes more sense to propagate null by dafault. This also fixes a bug when encode null array/map, which is firstly discovered in https://github.com/apache/spark/pull/10401 Author: Wenchen Fan <wenchen@databricks.com> Closes #10443 from cloud-fan/encoder.
*	[SPARK-12263][DOCS] IllegalStateException: Memory can't be 0 for ↵	Neelesh Srinivas Salian	2015-12-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	SPARK_WORKER_MEMORY without unit Updated the Worker Unit IllegalStateException message to indicate no values less than 1MB instead of 0 to help solve this. Requesting review Author: Neelesh Srinivas Salian <nsalian@cloudera.com> Closes #10483 from nssalian/SPARK-12263.
*	Revert "[SPARK-12362][SQL][WIP] Inline Hive Parser"	Reynold Xin	2015-12-30	18	-5402/+72
\| \| \| \|	This reverts commit b600bccf41a7b1958e33d8301a19214e6517e388 due to non-deterministic build breaks.
*	[SPARK-12564][SQL] Improve missing column AnalysisException	gatorsmile	2015-12-29	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	``` org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input columns text; ``` lets put a `:` after `columns` and put the columns in `[]` so that they match the toString of DataFrame. Author: gatorsmile <gatorsmile@gmail.com> Closes #10518 from gatorsmile/improveAnalysisExceptionMsg.
*	[SPARK-12490][CORE] Limit the css style scope to fix the Streaming UI	Shixiong Zhu	2015-12-29	3	-3/+5
\| \| \| \| \| \| \| \| \| \| \| \|	#10441 broke the Streaming UI because of the new CSS style. <img width="503" alt="screen shot 2015-12-29 at 4 49 04 pm" src="https://cloud.githubusercontent.com/assets/1000778/12044763/1efce0fe-ae4c-11e5-9f8b-39df08426bf8.png"> This PR just added a class for the new style and only applied them to the paged tables. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10517 from zsxwing/fix-streaming-ui.
*	[SPARK-12362][SQL][WIP] Inline Hive Parser	Nong Li	2015-12-29	18	-72/+5402
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a WIP. The PR has been taken over from nongli (see https://github.com/apache/spark/pull/10420). I have removed some additional dead code, and fixed a few issues which were caused by the fact that the inlined Hive parser is newer than the Hive parser we currently use in Spark. I am submitting this PR in order to get some feedback and testing done. There is quite a bit of work to do: - [ ] Get it to pass jenkins build/test. - [ ] Aknowledge Hive-project for using their parser. - [ ] Refactorings between HiveQl and the java classes. - [ ] Create our own ASTNode and integrate the current implicit extentions. - [ ] Move remaining ```SemanticAnalyzer``` and ```ParseUtils``` functionality to ```HiveQl```. - [ ] Removing Hive dependencies from the parser. This will require some edits in the grammar files. - [ ] Introduce our own context which needs to contain a ```TokenRewriteStream```. - [ ] Add ```useSQL11ReservedKeywordsForIdentifier``` and ```allowQuotedId``` to the catalyst or sql configuration. - [ ] Remove ```HiveConf``` from grammar files &HiveQl, and pass in our own configuration. - [ ] Moving the parser into sql/core. cc nongli rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> Closes #10509 from hvanhovell/SPARK-12362.
*	[SPARK-12549][SQL] Take Option[Seq[DataType]] in UDF input type specification.	Reynold Xin	2015-12-29	5	-68/+75
\| \| \| \| \| \| \| \|	In Spark we allow UDFs to declare its expected input types in order to apply type coercion. The expected input type parameter takes a Seq[DataType] and uses Nil when no type coercion is applied. It makes more sense to take Option[Seq[DataType]] instead, so we can differentiate a no-arg function vs function with no expected input type specified. Author: Reynold Xin <rxin@databricks.com> Closes #10504 from rxin/SPARK-12549.
*	[SPARK-12349][SPARK-12349][ML] Fix typo in Spark version regex introduced in ↵	Sean Owen	2015-12-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	/ PR 10327 Sorry jkbradley Ref: https://github.com/apache/spark/pull/10327#discussion_r48502942 Author: Sean Owen <sowen@cloudera.com> Closes #10508 from srowen/SPARK-12349.2.
*	[SPARK-11199][SPARKR] Improve R context management story and add getOrCreate	Hossein	2015-12-29	2	-1/+5
\| \| \| \| \| \| \| \| \| \| \|	* Changes api.r.SQLUtils to use ```SQLContext.getOrCreate``` instead of creating a new context. * Adds a simple test [SPARK-11199] #comment link with JIRA Author: Hossein <hossein@databricks.com> Closes #9185 from falaki/SPARK-11199.
*	[SPARK-12530][BUILD] Fix build break at Spark-Master-Maven-Snapshots from #1293	Kazuaki Ishizaki	2015-12-29	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \|	Compilation error caused due to string concatenations that are not a constant Use raw string literal to avoid string concatenations https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/ Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #10488 from kiszk/SPARK-12530.
*	[SPARK-12526][SPARKR] ifelse`, `when`, `otherwise` unable to take Column as ↵	Forest Fang	2015-12-29	3	-7/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	value `ifelse`, `when`, `otherwise` is unable to take `Column` typed S4 object as values. For example: ```r ifelse(lit(1) == lit(1), lit(2), lit(3)) ifelse(df$mpg > 0, df$mpg, 0) ``` will both fail with ```r attempt to replicate an object of type 'environment' ``` The PR replaces `ifelse` calls with `if ... else ...` inside the function implementations to avoid attempt to vectorize(i.e. `rep()`). It remains to be discussed whether we should instead support vectorization in these functions for consistency because `ifelse` in base R is vectorized but I cannot foresee any scenarios these functions will want to be vectorized in SparkR. For reference, added test cases which trigger failures: ```r . Error: when(), otherwise() and ifelse() with column on a DataFrame ---------- error in evaluating the argument 'x' in selecting a method for function 'collect': error in evaluating the argument 'col' in selecting a method for function 'select': attempt to replicate an object of type 'environment' Calls: when -> when -> ifelse -> ifelse 1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage")) 2: eval(code, new_test_environment) 3: eval(expr, envir, enclos) 4: expect_equal(collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))[, 1], c(NA, 1)) at test_sparkSQL.R:1126 5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label) 6: condition(object) 7: compare(actual, expected, ...) 8: collect(select(df, when(df$a > 1 & df$b > 2, lit(1)))) Error: Test failures Execution halted ``` Author: Forest Fang <forest.fang@outlook.com> Closes #10481 from saurfang/spark-12526.
*	[SPARK-11394][SQL] Throw IllegalArgumentException for unsupported types in ↵	Takeshi YAMAMURO	2015-12-28	2	-0/+5
\| \| \| \| \| \| \| \| \| \| \|	postgresql If DataFrame has BYTE types, throws an exception: org.postgresql.util.PSQLException: ERROR: type "byte" does not exist Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #9350 from maropu/FixBugInPostgreJdbc.
*	[SPARK-12547][SQL] Tighten scala style checker enforcement for UDF registration	Reynold Xin	2015-12-28	2	-29/+30
\| \| \| \| \| \| \| \| \| \|	We use scalastyle:off to turn off style checks in certain places where it is not possible to follow the style guide. This is usually ok. However, in udf registration, we disable the checker for a large amount of code simply because some of them exceed 100 char line limit. It is better to just disable the line limit check rather than everything. In this pull request, I only disabled line length check, and fixed a problem (lack explicit types for public methods). Author: Reynold Xin <rxin@databricks.com> Closes #10501 from rxin/SPARK-12547.
*	[SPARK-12522][SQL][MINOR] Add the missing document strings for the SQL ↵	gatorsmile	2015-12-28	3	-8/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	configuration Fixing the missing the document for the configuration. We can see the missing messages "TODO" when issuing the command "SET -V". ``` spark.sql.columnNameOfCorruptRecord spark.sql.hive.verifyPartitionPath spark.sql.sources.parallelPartitionDiscovery.threshold spark.sql.hive.convertMetastoreParquet.mergeSchema spark.sql.hive.convertCTAS spark.sql.hive.thriftServer.async ``` Author: gatorsmile <gatorsmile@gmail.com> Closes #10471 from gatorsmile/commandDesc.
*	[SPARK-12490] Don't use Javascript for web UI's paginated table controls	Josh Rosen	2015-12-28	5	-97/+178
\| \| \| \| \| \| \| \| \| \|	The web UI's paginated table uses Javascript to implement certain navigation controls, such as table sorting and the "go to page" form. This is unnecessary and should be simplified to use plain HTML form controls and links. /cc zsxwing, who wrote this original code, and yhuai. Author: Josh Rosen <joshrosen@databricks.com> Closes #10441 from JoshRosen/simplify-paginated-table-sorting.
*	[SPARK-12489][CORE][SQL][MLIB] Fix minor issues found by FindBugs	Shixiong Zhu	2015-12-28	7	-32/+51
\| \| \| \| \| \| \| \| \| \| \| \|	Include the following changes: 1. Close `java.sql.Statement` 2. Fix incorrect `asInstanceOf`. 3. Remove unnecessary `synchronized` and `ReentrantLock`. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10440 from zsxwing/findbugs.
*	[SPARK-12525] Fix fatal compiler warnings in Kinesis ASL due to @transient ↵	Josh Rosen	2015-12-28	2	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	annotations The Scala 2.11 SBT build currently fails for Spark 1.6.0 and master due to warnings about the `transient` annotation: ``` [error] [warn] /Users/joshrosen/Documents/spark/extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala:73: no valid targets for annotation on value sc - it is discarded unused. You may specify targets with meta-annotations, e.g. (transient param) [error] [warn] transient sc: SparkContext, ``` This fix implemented here is the same as what we did in #8433: remove the `transient` annotations when they are not necessary and replace use `transient private val` in the remaining cases. Author: Josh Rosen <joshrosen@databricks.com> Closes #10479 from JoshRosen/fix-sbt-2.11.
*	[SPARK-12222][CORE] Deserialize RoaringBitmap using Kryo serializer throw ↵	Daoyuan Wang	2015-12-29	1	-6/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Buffer underflow exception Since we only need to implement `def skipBytes(n: Int)`, code in #10213 could be simplified. davies scwf Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #10253 from adrian-wang/kryo.
*	[SPARK-12441][SQL] Fixing missingInput in ↵	gatorsmile	2015-12-28	15	-18/+63
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Generate/MapPartitions/AppendColumns/MapGroups/CoGroup When explain any plan with Generate, we will see an exclamation mark in the plan. Normally, when we see this mark, it means the plan has an error. This PR is to correct the `missingInput` in `Generate`. For example, ```scala val df = Seq((1, "a b c"), (2, "a b"), (3, "a")).toDF("number", "letters") val df2 = df.explode('letters) { case Row(letters: String) => letters.split(" ").map(Tuple1(_)).toSeq } df2.explain(true) ``` Before the fix, the plan is like ``` == Parsed Logical Plan == 'Generate UserDefinedGenerator('letters), true, false, None +- Project [_1#0 AS number#2,_2#1 AS letters#3] +- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]] == Analyzed Logical Plan == number: int, letters: string, _1: string Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8] +- Project [_1#0 AS number#2,_2#1 AS letters#3] +- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]] == Optimized Logical Plan == Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8] +- LocalRelation [number#2,letters#3], [[1,a b c],[2,a b],[3,a]] == Physical Plan == !Generate UserDefinedGenerator(letters#3), true, false, [number#2,letters#3,_1#8] +- LocalTableScan [number#2,letters#3], [[1,a b c],[2,a b],[3,a]] ``` Updates: The same issues are also found in the other four Dataset operators: `MapPartitions`/`AppendColumns`/`MapGroups`/`CoGroup`. Fixed all these four. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10393 from gatorsmile/generateExplain.
*	[SPARK-7727][SQL] Avoid inner classes in RuleExecutor	Stephan Kessler	2015-12-28	3	-5/+74
\| \| \| \| \| \| \| \| \| \|	Moved (case) classes Strategy, Once, FixedPoint and Batch to the companion object. This is necessary if we want to have the Optimizer easily extendable in the following sense: Usually a user wants to add additional rules, and just take the ones that are already there. However, inner classes made that impossible since the code did not compile This allows easy extension of existing Optimizers see the DefaultOptimizerExtendableSuite for a corresponding test case. Author: Stephan Kessler <stephan.kessler@sap.com> Closes #10174 from stephankessler/SPARK-7727.
*	[SPARK-12424][ML] The implementation of ParamMap#filter is wrong.	Kousuke Saruta	2015-12-29	2	-2/+34
\| \| \| \| \| \| \| \| \|	ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` is collection.Map, not mutable.Map but the result is casted to mutable.Map using `asInstanceOf` so we get `ClassCastException`. Also, the return type of Map#filterKeys is not Serializable. It's the issue of Scala (https://issues.scala-lang.org/browse/SI-6654). Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10381 from sarutak/SPARK-12424.
*	[SPARK-12287][SQL] Support UnsafeRow in MapPartitions/MapGroups/CoGroup	gatorsmile	2015-12-28	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \|	Support Unsafe Row in MapPartitions/MapGroups/CoGroup. Added a test case for MapPartitions. Since MapGroups and CoGroup are built on AppendColumns, all the related dataset test cases already can verify the correctness when MapGroups and CoGroup processing unsafe rows. davies cloud-fan Not sure if my understanding is right, please correct me. Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #10398 from gatorsmile/unsafeRowMapGroup.
*	[SPARK-12517] add default RDD name for one created via sc.textFile	Yaron Weinsberg	2015-12-29	2	-2/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The feature was first added at commit: 7b877b27053bfb7092e250e01a3b887e1b50a109 but was later removed (probably by mistake) at commit: fc8b58195afa67fbb75b4c8303e022f703cbf007. This change sets the default path of RDDs created via sc.textFile(...) to the path argument. Here is the symptom: * Using spark-1.5.2-bin-hadoop2.6: scala> sc.textFile("/home/root/.bashrc").name res5: String = null scala> sc.binaryFiles("/home/root/.bashrc").name res6: String = /home/root/.bashrc * while using Spark 1.3.1: scala> sc.textFile("/home/root/.bashrc").name res0: String = /home/root/.bashrc scala> sc.binaryFiles("/home/root/.bashrc").name res1: String = /home/root/.bashrc Author: Yaron Weinsberg <wyaron@gmail.com> Author: yaron <yaron@il.ibm.com> Closes #10456 from wyaron/master.
*	[SPARK-12231][SQL] create a combineFilters' projection when we call ↵	Kevin Yu	2015-12-28	2	-5/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	buildPartitionedTableScan Hello Michael & All: We have some issues to submit the new codes in the other PR(#10299), so we closed that PR and open this one with the fix. The reason for the previous failure is that the projection for the scan when there is a filter that is not pushed down (the "left-over" filter) could be different, in elements or ordering, from the original projection. With this new codes, the approach to solve this problem is: Insert a new Project if the "left-over" filter is nonempty and (the original projection is not empty and the projection for the scan has more than one elements which could otherwise cause different ordering in projection). We create 3 test cases to cover the otherwise failure cases. Author: Kevin Yu <qyu@us.ibm.com> Closes #10388 from kevinyu98/spark-12231.
*	[HOT-FIX] bypass hive test when parse logical plan to json	Wenchen Fan	2015-12-28	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	https://github.com/apache/spark/pull/10311 introduces some rare, non-deterministic flakiness for hive udf tests, see https://github.com/apache/spark/pull/10311#issuecomment-166548851 I can't reproduce it locally, and may need more time to investigate, a quick solution is: bypass hive tests for json serialization. Author: Wenchen Fan <wenchen@databricks.com> Closes #10430 from cloud-fan/hot-fix.
*	[SPARK-12508][PROJECT-INFRA] Fix minor bugs in ↵	Josh Rosen	2015-12-28	1	-17/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	dev/tests/pr_public_classes.sh script This patch fixes a handful of minor bugs in the `dev/tests/pr_public_classes.sh` script, which is used by the `run_tests_jenkins` script to detect the addition of new public classes: - Account for differences between BSD and GNU `sed` in order to allow the script to run on OS X. - Diff `$ghprbActualCommit^...$ghprbActualCommit ` instead of `master...$ghprbActualCommit`: since `ghprbActualCommit` is a merge commit which results from merging the PR into the target branch, this will give us the desired diff and will avoid certain race-conditions which could lead to false-positives. - Use `echo -e` instead of `echo` so that newline characters are handled correctly in output. This should fix a formatting glitch which caused the output to appear on a single line in the GitHub comment (see [the SC2028 page](https://github.com/koalaman/shellcheck/wiki/SC2028) on the Shellcheck wiki for more details). Author: Josh Rosen <joshrosen@databricks.com> Closes #10455 from JoshRosen/fix-pr-public-classes-test.
*	[SPARK-12218] Fixes ORC conjunction predicate push down	Cheng Lian	2015-12-28	3	-30/+112
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR is a follow-up of PR #10362. Two major changes: 1. The fix introduced in #10362 is OK for Parquet, but may disable ORC PPD in many cases PR #10362 stops converting an `AND` predicate if any branch is inconvertible. On the other hand, `OrcFilters` combines all filters into a single big conjunction first and then tries to convert it into ORC `SearchArgument`. This means, if any filter is inconvertible, no filters can be pushed down. This PR fixes this issue by finding out all convertible filters first before doing the actual conversion. The reason behind the current implementation is mostly due to the limitation of ORC `SearchArgument` builder, which is documented in this PR in detail. 1. Copied the `AND` predicate fix for ORC from #10362 to avoid merge conflict. Same as #10362, this PR targets master (2.0.0-SNAPSHOT), branch-1.6, and branch-1.5. Author: Cheng Lian <lian@databricks.com> Closes #10377 from liancheng/spark-12218.fix-orc-conjunction-ppd.
*	[SPARK-12353][STREAMING][PYSPARK] Fix countByValue inconsistent output in ↵	jerryshao	2015-12-28	2	-5/+16
\| \| \| \| \| \| \| \| \| \|	Python API The semantics of Python countByValue is different from Scala API, it is more like countDistinctValue, so here change to make it consistent with Scala/Java API. Author: jerryshao <sshao@hortonworks.com> Closes #10350 from jerryshao/SPARK-12353.
*	[SPARK-12515][SQL][DOC] minor doc update for read.jdbc	felixcheung	2015-12-28	1	-5/+6
\| \| \| \| \| \|	Author: felixcheung <felixcheung_m@hotmail.com> Closes #10465 from felixcheung/dfreaderjdbcdoc.
*	[SPARK-12520] [PYSPARK] Correct Descriptions and Add Use Cases in Equi-Join	gatorsmile	2015-12-27	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After reading the JIRA https://issues.apache.org/jira/browse/SPARK-12520, I double checked the code. For example, users can do the Equi-Join like ```df.join(df2, 'name', 'outer').select('name', 'height').collect()``` - There exists a bug in 1.5 and 1.4. The code just ignores the third parameter (join type) users pass. However, the join type we called is `Inner`, even if the user-specified type is the other type (e.g., `Outer`). - After a PR: https://github.com/apache/spark/pull/8600, the 1.6 does not have such an issue, but the description has not been updated. Plan to submit another PR to fix 1.5 and issue an error message if users specify a non-inner join type when using Equi-Join. Author: gatorsmile <gatorsmile@gmail.com> Closes #10477 from gatorsmile/pyOuterJoin.
*	[SPARK-12396][CORE] Modify the function scheduleAtFixedRate to schedule.	echo2mei	2015-12-25	1	-2/+2
\| \| \| \| \| \| \| \| \|	Instead of just cancel the registrationRetryTimer to avoid driver retry connect to master, change the function to schedule. It is no need to register to master iteratively. Author: echo2mei <534384876@qq.com> Closes #10447 from echoTomei/master.
*	[SPARK-12440][CORE] Avoid setCheckpoint warning when directory is not local	pierre-borckmans	2015-12-24	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In SparkContext method `setCheckpointDir`, a warning is issued when spark master is not local and the passed directory for the checkpoint dir appears to be local. In practice, when relying on HDFS configuration file and using a relative path for the checkpoint directory (using an incomplete URI without HDFS scheme, ...), this warning should not be issued and might be confusing. In fact, in this case, the checkpoint directory is successfully created, and the checkpointing mechanism works as expected. This PR uses the `FileSystem` instance created with the given directory, and checks whether it is local or not. (The rationale is that since this same `FileSystem` instance is used to create the checkpoint dir anyway and can therefore be reliably used to determine if it is local or not). The warning is only issued if the directory is not local, on top of the existing conditions. Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com> Closes #10392 from pierre-borckmans/SPARK-12440_CheckpointDir_Warning_NonLocal.
*	[SPARK-12010][SQL] Spark JDBC requires support for column-name-free INSERT ↵	CK50	2015-12-24	1	-8/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	syntax In the past Spark JDBC write only worked with technologies which support the following INSERT statement syntax (JdbcUtils.scala: insertStatement()): INSERT INTO $table VALUES ( ?, ?, ..., ? ) But some technologies require a list of column names: INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? ) This was blocking the use of e.g. the Progress JDBC Driver for Cassandra. Another limitation is that syntax 1 relies no the dataframe field ordering match that of the target table. This works fine, as long as the target table has been created by writer.jdbc(). If the target table contains more columns (not created by writer.jdbc()), then the insert fails due mismatch of number of columns or their data types. This PR switches to the recommended second INSERT syntax. Column names are taken from datafram field names. Author: CK50 <christian.kurz@oracle.com> Closes #10380 from CK50/master-SPARK-12010-2.
*	[SPARK-12311][CORE] Restore previous value of "os.arch" property in test ↵	Kazuaki Ishizaki	2015-12-24	49	-142/+338
\| \| \| \| \| \| \| \| \| \| \| \|	suites after forcing to set specific value to "os.arch" property Restore the original value of os.arch property after each test Since some of tests forced to set the specific value to os.arch property, we need to set the original value. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #10289 from kiszk/SPARK-12311.
*	[SPARK-12502][BUILD][PYTHON] Script /dev/run-tests fails when IBM Java is used	Kazuaki Ishizaki	2015-12-24	1	-4/+3
\| \| \| \| \| \| \| \|	fix an exception with IBM JDK by removing update field from a JavaVersion tuple. This is because IBM JDK does not have information on update '_xx' Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #10463 from kiszk/SPARK-12502.
*	[SPARK-12499][BUILD] don't force MAVEN_OPTS	Adrian Bridgett	2015-12-23	1	-1/+1
\| \| \| \| \| \| \| \|	allow the user to override MAVEN_OPTS (2GB wasn't sufficient for me) Author: Adrian Bridgett <adrian@smop.co.uk> Closes #10448 from abridgett/feature/do_not_force_maven_opts.
*	[SPARK-12500][CORE] Fix Tachyon deprecations; pull Tachyon dependency into ↵	Sean Owen	2015-12-23	3	-84/+104
\| \| \| \| \| \| \| \| \| \| \| \|	one class Fix Tachyon deprecations; pull Tachyon dependency into `TachyonBlockManager` only CC calvinjia as I probably need a double-check that the usage of the new API is correct. Author: Sean Owen <sowen@cloudera.com> Closes #10449 from srowen/SPARK-12500.
*	[SPARK-12477][SQL] - Tungsten projection fails for null values in array fields	pierre-borckmans	2015-12-22	2	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Accessing null elements in an array field fails when tungsten is enabled. It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled. This PR solves this by checking if the accessed element in the array field is null, in the generated code. Example: ``` // Array of String case class AS( as: Seq[String] ) val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF dfAS.registerTempTable("T_AS") for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))} ``` With Tungsten disabled: ``` 0 = [a] 1 = [null] 2 = [b] ``` With Tungsten enabled: ``` 0 = [a] 15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ``` Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com> Closes #10429 from pierre-borckmans/SPARK-12477_Tungsten-Projection-Null-Element-In-Array.
*	[SPARK-11164][SQL] Add InSet pushdown filter back for Parquet	Liang-Chi Hsieh	2015-12-23	3	-8/+45
\| \| \| \| \| \| \| \| \| \|	When the filter is ```"b in ('1', '2')"```, the filter is not pushed down to Parquet. Thanks! Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10278 from gatorsmile/parquetFilterNot.
*	[SPARK-12478][SQL] Bugfix: Dataset fields of product types can't be null	Cheng Lian	2015-12-23	2	-4/+15
\| \| \| \| \| \| \| \| \| \| \| \|	When creating extractors for product types (i.e. case classes and tuples), a null check is missing, thus we always assume input product values are non-null. This PR adds a null check in the extractor expression for product types. The null check is stripped off for top level product fields, which are mapped to the outermost `Row`s, since they can't be null. Thanks cloud-fan for helping investigating this issue! Author: Cheng Lian <lian@databricks.com> Closes #10431 from liancheng/spark-12478.top-level-null-field.
*	[SPARK-12429][STREAMING][DOC] Add Accumulator and Broadcast example for ↵	Shixiong Zhu	2015-12-22	5	-13/+325
\| \| \| \| \| \| \| \| \| \|	Streaming This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10385 from zsxwing/accumulator-broadcast-example.
*	[SPARK-12487][STREAMING][DOCUMENT] Add docs for Kafka message handler	Shixiong Zhu	2015-12-22	1	-0/+3
\| \| \| \| \| \|	Author: Shixiong Zhu <shixiong@databricks.com> Closes #10439 from zsxwing/kafka-message-handler-doc.
*	[SPARK-12102][SQL] Cast a non-nullable struct field to a nullable field ↵	Dilip Biswal	2015-12-22	2	-1/+9
\| \| \| \| \| \| \| \| \| \|	during analysis Compare both left and right side of the case expression ignoring nullablity when checking for type equality. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10156 from dilipbiswal/spark-12102.
*	[SPARK-12471][CORE] Spark daemons will log their pid on start up.	Nong Li	2015-12-22	9	-20/+35
\| \| \| \| \| \|	Author: Nong Li <nong@databricks.com> Closes #10422 from nongli/12471-pids.
*	Minor corrections, i.e. typo fixes and follow deprecated	Jacek Laskowski	2015-12-22	5	-6/+6
\| \| \| \| \| \|	Author: Jacek Laskowski <jacek@japila.pl> Closes #10432 from jaceklaskowski/minor-corrections.
*	[SPARK-12456][SQL] Add ExpressionDescription to misc functions	Xiu Guo	2015-12-22	4	-0/+29
\| \| \| \| \| \| \| \|	First try, not sure how much information we need to provide in the usage part. Author: Xiu Guo <xguo27@gmail.com> Closes #10423 from xguo27/SPARK-12456.
*	[SPARK-12475][BUILD] Upgrade Zinc from 0.3.5.3 to 0.3.9	Josh Rosen	2015-12-22	1	-3/+3
\| \| \| \| \| \| \| \|	We should update to the latest version of Zinc in order to match our SBT version. Author: Josh Rosen <joshrosen@databricks.com> Closes #10426 from JoshRosen/update-zinc.
*	[SPARK-11677][SQL][FOLLOW-UP] Add tests for checking the ORC filter creation ↵	hyukjinkwon	2015-12-23	1	-0/+236
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	against pushed down filters. https://issues.apache.org/jira/browse/SPARK-11677 Although it checks correctly the filters by the number of results if ORC filter-push-down is enabled, the filters themselves are not being tested. So, this PR includes the test similarly with `ParquetFilterSuite`. Since the results are checked by `OrcQuerySuite`, this `OrcFilterSuite` only checks if the appropriate filters are created. One thing different with `ParquetFilterSuite` here is, it does not check the results because that is checked in `OrcQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #10341 from HyukjinKwon/SPARK-11677-followup.