spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-18333][SQL] Revert hacks in parquet and orc reader to support case ↵	Eric Liang	2016-11-09	3	-44/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	insensitive resolution ## What changes were proposed in this pull request? These are no longer needed after https://issues.apache.org/jira/browse/SPARK-17183 cc cloud-fan ## How was this patch tested? Existing parquet and orc tests. Author: Eric Liang <ekl@databricks.com> Closes #15799 from ericl/sc-4929.
*	[SPARK-18239][SPARKR] Gradient Boosted Tree for R	Felix Cheung	2016-11-08	10	-66/+696
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Gradient Boosted Tree in R. With a few minor improvements to RandomForest in R. Since this is relatively isolated I'd like to target this for branch-2.1 ## How was this patch tested? manual tests, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15746 from felixcheung/rgbt.
*	[SPARK-18342] Make rename failures fatal in HDFSBackedStateStore	Burak Yavuz	2016-11-08	2	-7/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? If the rename operation in the state store fails (`fs.rename` returns `false`), the StateStore should throw an exception and have the task retry. Currently if renames fail, nothing happens during execution immediately. However, you will observe that snapshot operations will fail, and then any attempt at recovery (executor failure / checkpoint recovery) also fails. ## How was this patch tested? Unit test Author: Burak Yavuz <brkyvz@gmail.com> Closes #15804 from brkyvz/rename-state.
*	[SPARK-18280][CORE] Fix potential deadlock in `StandaloneSchedulerBackend.dead`	Shixiong Zhu	2016-11-08	5	-2/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock. This PR add a thread local flag inside RPC threads. `SparkContext.stop` uses it to decide if launching a new thread to stop the SparkContext. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15775 from zsxwing/SPARK-18280.
*	[SPARK-17748][ML] Minor cleanups to one-pass linear regression with elastic net	Joseph K. Bradley	2016-11-08	3	-12/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? * Made SingularMatrixException private ml * WeightedLeastSquares: Changed to allow tol >= 0 instead of only tol > 0 ## How was this patch tested? existing tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #15779 from jkbradley/wls-cleanups.
*	[SPARK-18357] Fix yarn files/archive broken issue andd unit tests	Kishor Patil	2016-11-08	2	-1/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The #15627 broke functionality with yarn --files --archives does not accept any files. This patch ensures that --files and --archives accept unique files. ## How was this patch tested? A. I added unit tests. B. Also, manually tested --files with --archives to throw exception if duplicate files are specified and continue if unique files are specified. Author: Kishor Patil <kpatil@yahoo-inc.com> Closes #15810 from kishorvpatil/SPARK18357.
*	[SPARK-18191][CORE] Port RDD API to use commit protocol	jiangxingbo	2016-11-08	9	-176/+280
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR port RDD API to use commit protocol, the changes made here: 1. Add new internal helper class that saves an RDD using a Hadoop OutputFormat named `SparkNewHadoopWriter`, it's similar with `SparkHadoopWriter` but uses commit protocol. This class supports the newer `mapreduce` API, instead of the old `mapred` API which is supported by `SparkHadoopWriter`; 2. Rewrite `PairRDDFunctions.saveAsNewAPIHadoopDataset` function, so it uses commit protocol now. ## How was this patch tested? Exsiting test cases. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15769 from jiangxb1987/rdd-commit.
*	[SPARK-18346][SQL] TRUNCATE TABLE should fail if no partition is matched for ↵	Wenchen Fan	2016-11-08	3	-24/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the given non-partial partition spec ## What changes were proposed in this pull request? a follow up of https://github.com/apache/spark/pull/15688 ## How was this patch tested? updated test in `DDLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #15805 from cloud-fan/truncate.
*	[SPARK-17868][SQL] Do not use bitmasks during parsing and analysis of ↵	jiangxingbo	2016-11-08	5	-137/+474
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CUBE/ROLLUP/GROUPING SETS ## What changes were proposed in this pull request? We generate bitmasks for grouping sets during the parsing process, and use these during analysis. These bitmasks are difficult to work with in practice and have lead to numerous bugs. This PR removes these and use actual sets instead, however we still need to generate these offsets for the grouping_id. This PR does the following works: 1. Replace bitmasks by actual grouping sets durning Parsing/Analysis stage of CUBE/ROLLUP/GROUPING SETS; 2. Add new testsuite `ResolveGroupingAnalyticsSuite` to test the `Analyzer.ResolveGroupingAnalytics` rule directly; 3. Fix a minor bug in `ResolveGroupingAnalytics`. ## How was this patch tested? By existing test cases, and add new testsuite `ResolveGroupingAnalyticsSuite` to test directly. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15484 from jiangxb1987/group-set.
*	[MINOR][DOC] Unify example marks	Zheng RuiFeng	2016-11-08	5	-21/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1, `Example` => `Examples`, because more algos use `Examples`. 2, delete `### Examples` in `Isotonic regression`, because it's not that special in http://spark.apache.org/docs/latest/ml-classification-regression.html 3, add missing marks for `LDA` and other algos. ## How was this patch tested? No tests for it only modify doc Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15783 from zhengruifeng/doc_fix.
*	[SPARK-13770][DOCUMENTATION][ML] Document the ML feature Interaction	chie8842	2016-11-08	3	-0/+208
\| \| \| \| \| \| \| \|	I created Scala and Java example and added documentation. Author: chie8842 <hayashidac@nttdata.co.jp> Closes #15658 from hayashidac/SPARK-13770.
*	[SPARK-18137][SQL] Fix RewriteDistinctAggregates UnresolvedException when a ↵	root	2016-11-08	2	-9/+61
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	UDAF has a foldable TypeCheck ## What changes were proposed in this pull request? In RewriteDistinctAggregates rewrite funtion,after the UDAF's childs are mapped to AttributeRefference, If the UDAF(such as ApproximatePercentile) has a foldable TypeCheck for the input, It will failed because the AttributeRefference is not foldable,then the UDAF is not resolved, and then nullify on the unresolved object will throw a Exception. In this PR, only map Unfoldable child to AttributeRefference, this can avoid the UDAF's foldable TypeCheck. and then only Expand Unfoldable child, there is no need to Expand a static value(foldable value). Before sql result > select percentile_approxy(key,0.99999),count(distinct key),sume(distinc key) from src limit 1 > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'percentile_approx(CAST(src.`key` AS DOUBLE), CAST(0.99999BD AS DOUBLE), 10000) > at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:92) > at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$nullify(RewriteDistinctAggregates.scala:261) After sql result > select percentile_approxy(key,0.99999),count(distinct key),sume(distinc key) from src limit 1 > [498.0,309,79136] ## How was this patch tested? Add a test case in HiveUDFSuit. Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)> Closes #15668 from windpiger/RewriteDistinctUDAFUnresolveExcep.
*	[SPARK-18207][SQL] Fix a compilation error due to HashExpression.doGenCode	Kazuaki Ishizaki	2016-11-08	2	-7/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR avoids a compilation error due to more than 64KB Java byte code size. This error occur since generate java code for computing a hash value for a row is too big. This PR fixes this compilation error by splitting a big code chunk into multiple methods by calling `CodegenContext.splitExpression` at `HashExpression.doGenCode` The test case requires a calculation of hash code for a row that includes 1000 String fields. `HashExpression.doGenCode` generate a lot of Java code for this computation into one function. As a result, the size of the corresponding Java bytecode is more than 64 KB. Generated code without this PR ````java /* 027 / public UnsafeRow apply(InternalRow i) { / 028 / boolean isNull = false; / 029 / / 030 / int value1 = 42; / 031 / / 032 / boolean isNull2 = i.isNullAt(0); / 033 / UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); / 034 / if (!isNull2) { / 035 / value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), value2.getBaseOffset(), value2.numBytes(), value1); / 036 / } / 037 / / 038 / / 039 / boolean isNull3 = i.isNullAt(1); / 040 / UTF8String value3 = isNull3 ? null : (i.getUTF8String(1)); / 041 / if (!isNull3) { / 042 / value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value3.getBaseObject(), value3.getBaseOffset(), value3.numBytes(), value1); / 043 / } / 044 / / 045 / ... / 7024 / / 7025 / boolean isNull1001 = i.isNullAt(999); / 7026 / UTF8String value1001 = isNull1001 ? null : (i.getUTF8String(999)); / 7027 / if (!isNull1001) { / 7028 / value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1001.getBaseObject(), value1001.getBaseOffset(), value1001.numBytes(), value1); / 7029 / } / 7030 / / 7031 / / 7032 / boolean isNull1002 = i.isNullAt(1000); / 7033 / UTF8String value1002 = isNull1002 ? null : (i.getUTF8String(1000)); / 7034 / if (!isNull1002) { / 7035 / value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1002.getBaseObject(), value1002.getBaseOffset(), value1002.numBytes(), value1); / 7036 / } ```` Generated code with this PR ````java / 3807 / private void apply_249(InternalRow i) { / 3808 / / 3809 / boolean isNull998 = i.isNullAt(996); / 3810 / UTF8String value998 = isNull998 ? null : (i.getUTF8String(996)); / 3811 / if (!isNull998) { / 3812 / value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value998.getBaseObject(), value998.getBaseOffset(), value998.numBytes(), value1); / 3813 / } / 3814 / / 3815 / boolean isNull999 = i.isNullAt(997); / 3816 / UTF8String value999 = isNull999 ? null : (i.getUTF8String(997)); / 3817 / if (!isNull999) { / 3818 / value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value999.getBaseObject(), value999.getBaseOffset(), value999.numBytes(), value1); / 3819 / } / 3820 / / 3821 / boolean isNull1000 = i.isNullAt(998); / 3822 / UTF8String value1000 = isNull1000 ? null : (i.getUTF8String(998)); / 3823 / if (!isNull1000) { / 3824 / value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1000.getBaseObject(), value1000.getBaseOffset(), value1000.numBytes(), value1); / 3825 / } / 3826 / / 3827 / boolean isNull1001 = i.isNullAt(999); / 3828 / UTF8String value1001 = isNull1001 ? null : (i.getUTF8String(999)); / 3829 / if (!isNull1001) { / 3830 / value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1001.getBaseObject(), value1001.getBaseOffset(), value1001.numBytes(), value1); / 3831 / } / 3832 / / 3833 / } / 3834 / ... / 4532 / private void apply_0(InternalRow i) { / 4533 / / 4534 / boolean isNull2 = i.isNullAt(0); / 4535 / UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); / 4536 / if (!isNull2) { / 4537 / value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), value2.getBaseOffset(), value2.numBytes(), value1); / 4538 / } / 4539 / / 4540 / boolean isNull3 = i.isNullAt(1); / 4541 / UTF8String value3 = isNull3 ? null : (i.getUTF8String(1)); / 4542 / if (!isNull3) { / 4543 / value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value3.getBaseObject(), value3.getBaseOffset(), value3.numBytes(), value1); / 4544 / } / 4545 / / 4546 / boolean isNull4 = i.isNullAt(2); / 4547 / UTF8String value4 = isNull4 ? null : (i.getUTF8String(2)); / 4548 / if (!isNull4) { / 4549 / value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value4.getBaseObject(), value4.getBaseOffset(), value4.numBytes(), value1); / 4550 / } / 4551 / / 4552 / boolean isNull5 = i.isNullAt(3); / 4553 / UTF8String value5 = isNull5 ? null : (i.getUTF8String(3)); / 4554 / if (!isNull5) { / 4555 / value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value5.getBaseObject(), value5.getBaseOffset(), value5.numBytes(), value1); / 4556 / } / 4557 / / 4558 / } ... / 7344 / public UnsafeRow apply(InternalRow i) { / 7345 / boolean isNull = false; / 7346 / / 7347 / value1 = 42; / 7348 / apply_0(i); / 7349 / apply_1(i); ... / 7596 / apply_248(i); / 7597 / apply_249(i); / 7598 / apply_250(i); / 7599 */ apply_251(i); ... ```` ## How was this patch tested? Add a new test in `DataFrameSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #15745 from kiszk/SPARK-18207.
*	[SPARK-16575][CORE] partition calculation mismatch with sc.binaryFiles	fidato	2016-11-07	4	-5/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This Pull request comprises of the critical bug SPARK-16575 changes. This change rectifies the issue with BinaryFileRDD partition calculations as upon creating an RDD with sc.binaryFiles, the resulting RDD always just consisted of two partitions only. ## How was this patch tested? The original issue ie. getNumPartitions on binary Files RDD (always having two partitions) was first replicated and then tested upon the changes. Also the unit tests have been checked and passed. This contribution is my original work and I licence the work to the project under the project's open source license srowen hvanhovell rxin vanzin skyluc kmader zsxwing datafarmer Please have a look . Author: fidato <fidato.july13@gmail.com> Closes #15327 from fidato13/SPARK-16575.
*	[SPARK-18217][SQL] Disallow creating permanent views based on temporary ↵	gatorsmile	2016-11-07	5	-12/+172
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	views or UDFs ### What changes were proposed in this pull request? Based on the discussion in [SPARK-18209](https://issues.apache.org/jira/browse/SPARK-18209). It doesn't really make sense to create permanent views based on temporary views or temporary UDFs. To disallow the supports and issue the exceptions, this PR needs to detect whether a temporary view/UDF is being used when defining a permanent view. Basically, this PR can be split to two sub-tasks: Task 1: detecting a temporary view from the query plan of view definition. When finding an unresolved temporary view, Analyzer replaces it by a `SubqueryAlias` with the corresponding logical plan, which is stored in an in-memory HashMap. After replacement, it is impossible to detect whether the `SubqueryAlias` is added/generated from a temporary view. Thus, to detect the usage of a temporary view in view definition, this PR traverses the unresolved logical plan and uses the name of an `UnresolvedRelation` to detect whether it is a (global) temporary view. Task 2: detecting a temporary UDF from the query plan of view definition. Detecting usage of a temporary UDF in view definition is not straightfoward. First, in the analyzed plan, we are having different forms to represent the functions. More importantly, some classes (e.g., `HiveGenericUDF`) are not accessible from `CreateViewCommand`, which is part of `sql/core`. Thus, we used the unanalyzed plan `child` of `CreateViewCommand` to detect the usage of a temporary UDF. Because the plan has already been successfully analyzed, we can assume the functions have been defined/registered. Second, in Spark, the functions have four forms: Spark built-in functions, built-in hash functions, permanent UDFs and temporary UDFs. We do not have any direct way to determine whether a function is temporary or not. Thus, we introduced a function `isTemporaryFunction` in `SessionCatalog`. This function contains the detailed logics to determine whether a function is temporary or not. ### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #15764 from gatorsmile/blockTempFromPermViewCreation.
*	[SPARK-18261][STRUCTURED STREAMING] Add statistics to MemorySink for joining	Liwei Lin	2016-11-07	2	-1/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Right now, there is no way to join the output of a memory sink with any table: > UnsupportedOperationException: LeafNode MemoryPlan must implement statistics This patch adds statistics to MemorySink, making joining snapshots of memory streams with tables possible. ## How was this patch tested? Added a test case. Author: Liwei Lin <lwlin7@gmail.com> Closes #15786 from lw-lin/memory-sink-stat.
*	[SPARK-18086] Add support for Hive session vars.	Ryan Blue	2016-11-07	4	-5/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This adds support for Hive variables: * Makes values set via `spark-sql --hivevar name=value` accessible * Adds `getHiveVar` and `setHiveVar` to the `HiveClient` interface * Adds a SessionVariables trait for sessions like Hive that support variables (including Hive vars) * Adds SessionVariables support to variable substitution * Adds SessionVariables support to the SET command ## How was this patch tested? * Adds a test to all supported Hive versions for accessing Hive variables * Adds HiveVariableSubstitutionSuite Author: Ryan Blue <blue@apache.org> Closes #15738 from rdblue/SPARK-18086-add-hivevar-support.
*	[SPARK-18295][SQL] Make to_json function null safe (matching it to from_json)	hyukjinkwon	2016-11-07	3	-11/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR proposes to match up the behaviour of `to_json` to `from_json` function for null-safety. Currently, it throws `NullPointException` but this PR fixes this to produce `null` instead. with the data below: ```scala import spark.implicits._ val df = Seq(Some(Tuple1(Tuple1(1))), None).toDF("a") df.show() ``` ``` +----+ \| a\| +----+ \| [1]\| \|null\| +----+ ``` the codes below ```scala import org.apache.spark.sql.functions._ df.select(to_json($"a")).show() ``` produces.. Before throws `NullPointException` as below: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeFields(JacksonGenerator.scala:138) at org.apache.spark.sql.catalyst.json.JacksonGenerator$$anonfun$write$1.apply$mcV$sp(JacksonGenerator.scala:194) at org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeObject(JacksonGenerator.scala:131) at org.apache.spark.sql.catalyst.json.JacksonGenerator.write(JacksonGenerator.scala:193) at org.apache.spark.sql.catalyst.expressions.StructToJson.eval(jsonExpressions.scala:544) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142) at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48) at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) ``` After ``` +---------------+ \|structtojson(a)\| +---------------+ \| {"_1":1}\| \| null\| +---------------+ ``` ## How was this patch tested? Unit test in `JsonExpressionsSuite.scala` and `JsonFunctionsSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15792 from HyukjinKwon/SPARK-18295.
*	[SPARK-18236] Reduce duplicate objects in Spark UI and HistoryServer	Josh Rosen	2016-11-07	8	-38/+84
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? When profiling heap dumps from the HistoryServer and live Spark web UIs, I found a large amount of memory being wasted on duplicated objects and strings. This patch's changes remove most of this duplication, resulting in over 40% memory savings for some benchmarks. - Task metrics (6441f0624dfcda9c7193a64bfb416a145b5aabdf): previously, every `TaskUIData` object would have its own instances of `InputMetricsUIData`, `OutputMetricsUIData`, `ShuffleReadMetrics`, and `ShuffleWriteMetrics`, but for many tasks these metrics are irrelevant because they're all zero. This patch changes how we construct these metrics in order to re-use a single immutable "empty" value for the cases where these metrics are empty. - TaskInfo.accumulables (ade86db901127bf13c0e0bdc3f09c933a093bb76): Previously, every `TaskInfo` object had its own empty `ListBuffer` for holding updates from named accumulators. Tasks which didn't use named accumulators still paid for the cost of allocating and storing this empty buffer. To avoid this overhead, I changed the `val` with a mutable buffer into a `var` which holds an immutable Scala list, allowing tasks which do not have named accumulator updates to share the same singleton `Nil` object. - String.intern() in JSONProtocol (7e05630e9a78c455db8c8c499f0590c864624e05): in the HistoryServer, executor hostnames and ids are deserialized from JSON, leading to massive duplication of these string objects. By calling `String.intern()` on the deserialized values we can remove all of this duplication. Since Spark now requires Java 7+ we don't have to worry about string interning exhausting the permgen (see http://java-performance.info/string-intern-in-java-6-7-8/). ## How was this patch tested? I ran ``` sc.parallelize(1 to 100000, 100000).count() ``` in `spark-shell` with event logging enabled, then loaded that event log in the HistoryServer, performed a full GC, and took a heap dump. According to YourKit, the changes in this patch reduced memory consumption by roughly 28 megabytes (or 770k Java objects): ![image](https://cloud.githubusercontent.com/assets/50748/19953276/4f3a28aa-a129-11e6-93df-d7fa91396f66.png) Here's a table illustrating the drop in objects due to deduplication (the drop is <100k for some objects because some events were dropped from the listener bus; this is a separate, existing bug that I'll address separately after CPU-profiling): ![image](https://cloud.githubusercontent.com/assets/50748/19953290/6a271290-a129-11e6-93ad-b825f1448886.png) Author: Josh Rosen <joshrosen@databricks.com> Closes #15743 from JoshRosen/spark-ui-memory-usage.
*	[SPARK-17490][SQL] Optimize SerializeFromObject() for a primitive array	Kazuaki Ishizaki	2016-11-08	7	-14/+203
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Waiting for merging #13680 This PR optimizes `SerializeFromObject()` for an primitive array. This is derived from #13758 to address one of problems by using a simple way in #13758. The current implementation always generates `GenericArrayData` from `SerializeFromObject()` for any type of an array in a logical plan. This involves a boxing at a constructor of `GenericArrayData` when `SerializedFromObject()` has an primitive array. This PR enables to generate `UnsafeArrayData` from `SerializeFromObject()` for a primitive array. It can avoid boxing to create an instance of `ArrayData` in the generated code by Catalyst. This PR also generate `UnsafeArrayData` in a case for `RowEncoder.serializeFor` or `CatalystTypeConverters.createToCatalystConverter`. Performance improvement of `SerializeFromObject()` is up to 2.0x ``` OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64 Intel Xeon E3-12xx v2 (Ivy Bridge) Without this PR Write an array in Dataset: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Int 556 / 608 15.1 66.3 1.0X Double 1668 / 1746 5.0 198.8 0.3X with this PR Write an array in Dataset: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Int 352 / 401 23.8 42.0 1.0X Double 821 / 885 10.2 97.9 0.4X ``` Here is an example program that will happen in mllib as described in [SPARK-16070](https://issues.apache.org/jira/browse/SPARK-16070). ``` sparkContext.parallelize(Seq(Array(1, 2)), 1).toDS.map(e => e).show ``` Generated code before applying this PR ``` java /* 039 / protected void processNext() throws java.io.IOException { / 040 / while (inputadapter_input.hasNext()) { / 041 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 042 / int[] inputadapter_value = (int[])inputadapter_row.get(0, null); / 043 / / 044 / Object mapelements_obj = ((Expression) references[0]).eval(null); / 045 / scala.Function1 mapelements_value1 = (scala.Function1) mapelements_obj; / 046 / / 047 / boolean mapelements_isNull = false \|\| false; / 048 / int[] mapelements_value = null; / 049 / if (!mapelements_isNull) { / 050 / Object mapelements_funcResult = null; / 051 / mapelements_funcResult = mapelements_value1.apply(inputadapter_value); / 052 / if (mapelements_funcResult == null) { / 053 / mapelements_isNull = true; / 054 / } else { / 055 / mapelements_value = (int[]) mapelements_funcResult; / 056 / } / 057 / / 058 / } / 059 / mapelements_isNull = mapelements_value == null; / 060 / / 061 / serializefromobject_argIsNulls[0] = mapelements_isNull; / 062 / serializefromobject_argValue = mapelements_value; / 063 / / 064 / boolean serializefromobject_isNull = false; / 065 / for (int idx = 0; idx < 1; idx++) { / 066 / if (serializefromobject_argIsNulls[idx]) { serializefromobject_isNull = true; break; } / 067 / } / 068 / / 069 / final ArrayData serializefromobject_value = serializefromobject_isNull ? null : new org.apache.spark.sql.catalyst.util.GenericArrayData(serializefromobject_argValue); / 070 / serializefromobject_holder.reset(); / 071 / / 072 / serializefromobject_rowWriter.zeroOutNullBytes(); / 073 / / 074 / if (serializefromobject_isNull) { / 075 / serializefromobject_rowWriter.setNullAt(0); / 076 / } else { / 077 / // Remember the current cursor so that we can calculate how many bytes are / 078 / // written later. / 079 / final int serializefromobject_tmpCursor = serializefromobject_holder.cursor; / 080 / / 081 / if (serializefromobject_value instanceof UnsafeArrayData) { / 082 / final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes(); / 083 / // grow the global buffer before writing data. / 084 / serializefromobject_holder.grow(serializefromobject_sizeInBytes); / 085 / ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor); / 086 / serializefromobject_holder.cursor += serializefromobject_sizeInBytes; / 087 / / 088 / } else { / 089 / final int serializefromobject_numElements = serializefromobject_value.numElements(); / 090 / serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 4); / 091 / / 092 / for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) { / 093 / if (serializefromobject_value.isNullAt(serializefromobject_index)) { / 094 / serializefromobject_arrayWriter.setNullInt(serializefromobject_index); / 095 / } else { / 096 / final int serializefromobject_element = serializefromobject_value.getInt(serializefromobject_index); / 097 / serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element); / 098 / } / 099 / } / 100 / } / 101 / / 102 / serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor); / 103 / } / 104 / serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize()); / 105 / append(serializefromobject_result); / 106 / if (shouldStop()) return; / 107 / } / 108 / } / 109 / } ``` Generated code after applying this PR ``` java / 035 / protected void processNext() throws java.io.IOException { / 036 / while (inputadapter_input.hasNext()) { / 037 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 038 / int[] inputadapter_value = (int[])inputadapter_row.get(0, null); / 039 / / 040 / Object mapelements_obj = ((Expression) references[0]).eval(null); / 041 / scala.Function1 mapelements_value1 = (scala.Function1) mapelements_obj; / 042 / / 043 / boolean mapelements_isNull = false \|\| false; / 044 / int[] mapelements_value = null; / 045 / if (!mapelements_isNull) { / 046 / Object mapelements_funcResult = null; / 047 / mapelements_funcResult = mapelements_value1.apply(inputadapter_value); / 048 / if (mapelements_funcResult == null) { / 049 / mapelements_isNull = true; / 050 / } else { / 051 / mapelements_value = (int[]) mapelements_funcResult; / 052 / } / 053 / / 054 / } / 055 / mapelements_isNull = mapelements_value == null; / 056 / / 057 / boolean serializefromobject_isNull = mapelements_isNull; / 058 / final ArrayData serializefromobject_value = serializefromobject_isNull ? null : org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(mapelements_value); / 059 / serializefromobject_isNull = serializefromobject_value == null; / 060 / serializefromobject_holder.reset(); / 061 / / 062 / serializefromobject_rowWriter.zeroOutNullBytes(); / 063 / / 064 / if (serializefromobject_isNull) { / 065 / serializefromobject_rowWriter.setNullAt(0); / 066 / } else { / 067 / // Remember the current cursor so that we can calculate how many bytes are / 068 / // written later. / 069 / final int serializefromobject_tmpCursor = serializefromobject_holder.cursor; / 070 / / 071 / if (serializefromobject_value instanceof UnsafeArrayData) { / 072 / final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes(); / 073 / // grow the global buffer before writing data. / 074 / serializefromobject_holder.grow(serializefromobject_sizeInBytes); / 075 / ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor); / 076 / serializefromobject_holder.cursor += serializefromobject_sizeInBytes; / 077 / / 078 / } else { / 079 / final int serializefromobject_numElements = serializefromobject_value.numElements(); / 080 / serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 4); / 081 / / 082 / for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) { / 083 / if (serializefromobject_value.isNullAt(serializefromobject_index)) { / 084 / serializefromobject_arrayWriter.setNullInt(serializefromobject_index); / 085 / } else { / 086 / final int serializefromobject_element = serializefromobject_value.getInt(serializefromobject_index); / 087 / serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element); / 088 / } / 089 / } / 090 / } / 091 / / 092 / serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor); / 093 / } / 094 / serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize()); / 095 / append(serializefromobject_result); / 096 / if (shouldStop()) return; / 097 / } / 098 / } / 099 */ } ``` ## How was this patch tested? Added a test in `DatasetSuite`, `RowEncoderSuite`, and `CatalystTypeConvertersSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #15044 from kiszk/SPARK-17490.
*	[SPARK-14914][CORE] Fix Resource not closed after using, mostly for unit tests	Hyukjin Kwon	2016-11-07	12	-49/+93
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Close `FileStreams`, `ZipFiles` etc to release the resources after using. Not closing the resources will cause IO Exception to be raised while deleting temp files. ## How was this patch tested? Existing tests Author: U-FAREAST\tl <tl@microsoft.com> Author: hyukjinkwon <gurwls223@gmail.com> Author: Tao LI <tl@microsoft.com> Closes #15618 from HyukjinKwon/SPARK-14914-1.
*	[SPARK-17108][SQL] Fix BIGINT and INT comparison failure in spark sql	Weiqing Yang	2016-11-07	2	-1/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add a function to check if two integers are compatible when invoking `acceptsType()` in `DataType`. ## How was this patch tested? Manually. E.g. ``` spark.sql("create table t3(a map<bigint, array<string>>)") spark.sql("select * from t3 where a[1] is not null") ``` Before: ``` cannot resolve 't.`a`[1]' due to data type mismatch: argument 2 requires bigint type, however, '1' is of int type.; line 1 pos 22 org.apache.spark.sql.AnalysisException: cannot resolve 't.`a`[1]' due to data type mismatch: argument 2 requires bigint type, however, '1' is of int type.; line 1 pos 22 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:307) ``` After: Run the sql queries above. No errors. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #15448 from weiqingy/SPARK_17108.
*	[SPARK-18283][STRUCTURED STREAMING][KAFKA] Added test to check whether ↵	Tathagata Das	2016-11-07	1	-0/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	default starting offset in latest ## What changes were proposed in this pull request? Added test to check whether default starting offset in latest ## How was this patch tested? new unit test Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15778 from tdas/SPARK-18283.
*	[SPARK-18291][SPARKR][ML] SparkR glm predict should output original label ↵	Yanbo Liang	2016-11-07	2	-13/+84
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	when family = binomial. ## What changes were proposed in this pull request? SparkR ```spark.glm``` predict should output original label when family = "binomial". ## How was this patch tested? Add unit test. You can also run the following code to test: ```R training <- suppressWarnings(createDataFrame(iris)) training <- training[training$Species %in% c("versicolor", "virginica"), ] model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = binomial(link = "logit")) showDF(predict(model, training)) ``` Before this change: ``` +------------+-----------+------------+-----------+----------+-----+-------------------+ \|Sepal_Length\|Sepal_Width\|Petal_Length\|Petal_Width\| Species\|label\| prediction\| +------------+-----------+------------+-----------+----------+-----+-------------------+ \| 7.0\| 3.2\| 4.7\| 1.4\|versicolor\| 0.0\| 0.8271421517601544\| \| 6.4\| 3.2\| 4.5\| 1.5\|versicolor\| 0.0\| 0.6044595910413112\| \| 6.9\| 3.1\| 4.9\| 1.5\|versicolor\| 0.0\| 0.7916340858281998\| \| 5.5\| 2.3\| 4.0\| 1.3\|versicolor\| 0.0\|0.16080518180591158\| \| 6.5\| 2.8\| 4.6\| 1.5\|versicolor\| 0.0\| 0.6112229217050189\| \| 5.7\| 2.8\| 4.5\| 1.3\|versicolor\| 0.0\| 0.2555087295500885\| \| 6.3\| 3.3\| 4.7\| 1.6\|versicolor\| 0.0\| 0.5681507664364834\| \| 4.9\| 2.4\| 3.3\| 1.0\|versicolor\| 0.0\|0.05990570219972002\| \| 6.6\| 2.9\| 4.6\| 1.3\|versicolor\| 0.0\| 0.6644434078306246\| \| 5.2\| 2.7\| 3.9\| 1.4\|versicolor\| 0.0\|0.11293577405862379\| \| 5.0\| 2.0\| 3.5\| 1.0\|versicolor\| 0.0\|0.06152372321585971\| \| 5.9\| 3.0\| 4.2\| 1.5\|versicolor\| 0.0\|0.35250697207602555\| \| 6.0\| 2.2\| 4.0\| 1.0\|versicolor\| 0.0\|0.32267018290814303\| \| 6.1\| 2.9\| 4.7\| 1.4\|versicolor\| 0.0\| 0.433391153814592\| \| 5.6\| 2.9\| 3.6\| 1.3\|versicolor\| 0.0\| 0.2280744262436993\| \| 6.7\| 3.1\| 4.4\| 1.4\|versicolor\| 0.0\| 0.7219848389339459\| \| 5.6\| 3.0\| 4.5\| 1.5\|versicolor\| 0.0\|0.23527698971404695\| \| 5.8\| 2.7\| 4.1\| 1.0\|versicolor\| 0.0\| 0.285024533520016\| \| 6.2\| 2.2\| 4.5\| 1.5\|versicolor\| 0.0\| 0.4107047877447493\| \| 5.6\| 2.5\| 3.9\| 1.1\|versicolor\| 0.0\|0.20083561961645083\| +------------+-----------+------------+-----------+----------+-----+-------------------+ ``` After this change: ``` +------------+-----------+------------+-----------+----------+-----+----------+ \|Sepal_Length\|Sepal_Width\|Petal_Length\|Petal_Width\| Species\|label\|prediction\| +------------+-----------+------------+-----------+----------+-----+----------+ \| 7.0\| 3.2\| 4.7\| 1.4\|versicolor\| 0.0\| virginica\| \| 6.4\| 3.2\| 4.5\| 1.5\|versicolor\| 0.0\| virginica\| \| 6.9\| 3.1\| 4.9\| 1.5\|versicolor\| 0.0\| virginica\| \| 5.5\| 2.3\| 4.0\| 1.3\|versicolor\| 0.0\|versicolor\| \| 6.5\| 2.8\| 4.6\| 1.5\|versicolor\| 0.0\| virginica\| \| 5.7\| 2.8\| 4.5\| 1.3\|versicolor\| 0.0\|versicolor\| \| 6.3\| 3.3\| 4.7\| 1.6\|versicolor\| 0.0\| virginica\| \| 4.9\| 2.4\| 3.3\| 1.0\|versicolor\| 0.0\|versicolor\| \| 6.6\| 2.9\| 4.6\| 1.3\|versicolor\| 0.0\| virginica\| \| 5.2\| 2.7\| 3.9\| 1.4\|versicolor\| 0.0\|versicolor\| \| 5.0\| 2.0\| 3.5\| 1.0\|versicolor\| 0.0\|versicolor\| \| 5.9\| 3.0\| 4.2\| 1.5\|versicolor\| 0.0\|versicolor\| \| 6.0\| 2.2\| 4.0\| 1.0\|versicolor\| 0.0\|versicolor\| \| 6.1\| 2.9\| 4.7\| 1.4\|versicolor\| 0.0\|versicolor\| \| 5.6\| 2.9\| 3.6\| 1.3\|versicolor\| 0.0\|versicolor\| \| 6.7\| 3.1\| 4.4\| 1.4\|versicolor\| 0.0\| virginica\| \| 5.6\| 3.0\| 4.5\| 1.5\|versicolor\| 0.0\|versicolor\| \| 5.8\| 2.7\| 4.1\| 1.0\|versicolor\| 0.0\|versicolor\| \| 6.2\| 2.2\| 4.5\| 1.5\|versicolor\| 0.0\|versicolor\| \| 5.6\| 2.5\| 3.9\| 1.1\|versicolor\| 0.0\|versicolor\| +------------+-----------+------------+-----------+----------+-----+----------+ ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #15788 from yanboliang/spark-18291.
*	[SPARK-18125][SQL] Fix a compilation error in codegen due to splitExpression	Liang-Chi Hsieh	2016-11-07	2	-6/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As reported in the jira, sometimes the generated java code in codegen will cause compilation error. Code snippet to test it: case class Route(src: String, dest: String, cost: Int) case class GroupedRoutes(src: String, dest: String, routes: Seq[Route]) val ds = sc.parallelize(Array( Route("a", "b", 1), Route("a", "b", 2), Route("a", "c", 2), Route("a", "d", 10), Route("b", "a", 1), Route("b", "a", 5), Route("b", "c", 6)) ).toDF.as[Route] val grped = ds.map(r => GroupedRoutes(r.src, r.dest, Seq(r))) .groupByKey(r => (r.src, r.dest)) .reduceGroups { (g1: GroupedRoutes, g2: GroupedRoutes) => GroupedRoutes(g1.src, g1.dest, g1.routes ++ g2.routes) }.map(_._2) The problem here is, in `ReferenceToExpressions` we evaluate the children vars to local variables. Then the result expression is evaluated to use those children variables. In the above case, the result expression code is too long and will be split by `CodegenContext.splitExpression`. So those local variables cannot be accessed and cause compilation error. ## How was this patch tested? Jenkins tests. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #15693 from viirya/fix-codege-compilation-error.
*	[SPARK-16904][SQL] Removal of Hive Built-in Hash Functions and ↵	gatorsmile	2016-11-07	3	-50/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TestHiveFunctionRegistry ### What changes were proposed in this pull request? Currently, the Hive built-in `hash` function is not being used in Spark since Spark 2.0. The public interface does not allow users to unregister the Spark built-in functions. Thus, users will never use Hive's built-in `hash` function. The only exception here is `TestHiveFunctionRegistry`, which allows users to unregister the built-in functions. Thus, we can load Hive's hash function in the test cases. If we disable it, 10+ test cases will fail because the results are different from the Hive golden answer files. This PR is to remove `hash` from the list of `hiveFunctions` in `HiveSessionCatalog`. It will also remove `TestHiveFunctionRegistry`. This removal makes us easier to remove `TestHiveSessionState` in the future. ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #14498 from gatorsmile/removeHash.
*	[SPARK-18296][SQL] Use consistent naming for expression test suites	Reynold Xin	2016-11-06	6	-9/+8
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We have an undocumented naming convention to call expression unit tests ExpressionsSuite, and the end-to-end tests FunctionsSuite. It'd be great to make all test suites consistent with this naming convention. ## How was this patch tested? This is a test-only naming change. Author: Reynold Xin <rxin@databricks.com> Closes #15793 from rxin/SPARK-18296.
*	[SPARK-18167][SQL] Disable flaky hive partition pruning test.	Reynold Xin	2016-11-06	1	-1/+1
\|
*	[SPARK-18173][SQL] data source tables should support truncating partition	Wenchen Fan	2016-11-06	5	-17/+146
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Previously `TRUNCATE TABLE ... PARTITION` will always truncate the whole table for data source tables, this PR fixes it and improve `InMemoryCatalog` to make this command work with it. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #15688 from cloud-fan/truncate.
*	[SPARK-18269][SQL] CSV datasource should read null properly when schema is ↵	hyukjinkwon	2016-11-06	4	-45/+81
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	lager than parsed tokens ## What changes were proposed in this pull request? Currently, there are the three cases when reading CSV by datasource when it is `PERMISSIVE` parse mode. - schema == parsed tokens (from each line) No problem to cast the value in the tokens to the field in the schema as they are equal. - schema < parsed tokens (from each line) It slices the tokens into the number of fields in schema. - schema > parsed tokens (from each line) It appends `null` into parsed tokens so that safely values can be casted with the schema. However, when `null` is appended in the third case, we should take `null` into account when casting the values. In case of `StringType`, it is fine as `UTF8String.fromString(datum)` produces `null` when the input is `null`. Therefore, this case will happen only when schema is explicitly given and schema includes data types that are not `StringType`. The codes below: ```scala val path = "/tmp/a" Seq("1").toDF().write.text(path.getAbsolutePath) val schema = StructType( StructField("a", IntegerType, true) :: StructField("b", IntegerType, true) :: Nil) spark.read.schema(schema).option("header", "false").csv(path).show() ``` prints Before ``` java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:542) at java.lang.Integer.parseInt(Integer.java:615) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:24) ``` After ``` +---+----+ \| a\| b\| +---+----+ \| 1\|null\| +---+----+ ``` ## How was this patch tested? Unit test in `CSVSuite.scala` and `CSVTypeCastSuite.scala` Author: hyukjinkwon <gurwls223@gmail.com> Closes #15767 from HyukjinKwon/SPARK-18269.
*	[SPARK-18210][ML] Pipeline.copy does not create an instance with the same UID	Wojciech Szymanski	2016-11-06	2	-3/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Motivation: `org.apache.spark.ml.Pipeline.copy(extra: ParamMap)` does not create an instance with the same UID. It does not conform to the method specification from its base class `org.apache.spark.ml.param.Params.copy(extra: ParamMap)` Solution: - fix for Pipeline UID - introduced new tests for `org.apache.spark.ml.Pipeline.copy` - minor improvements in test for `org.apache.spark.ml.PipelineModel.copy` ## How was this patch tested? Introduced new unit test: `org.apache.spark.ml.PipelineSuite."Pipeline.copy"` Improved existing unit test: `org.apache.spark.ml.PipelineSuite."PipelineModel.copy"` Author: Wojciech Szymanski <wk.szymanski@gmail.com> Closes #15759 from wojtek-szymanski/SPARK-18210.
*	[SPARK-17854][SQL] rand/randn allows null/long as input seed	hyukjinkwon	2016-11-06	4	-22/+135
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR proposes `rand`/`randn` accept `null` as input in Scala/SQL and `LongType` as input in SQL. In this case, it treats the values as `0`. So, this PR includes both changes below: - `null` support It seems MySQL also accepts this. ``` sql mysql> select rand(0); +---------------------+ \| rand(0) \| +---------------------+ \| 0.15522042769493574 \| +---------------------+ 1 row in set (0.00 sec) mysql> select rand(NULL); +---------------------+ \| rand(NULL) \| +---------------------+ \| 0.15522042769493574 \| +---------------------+ 1 row in set (0.00 sec) ``` and also Hive does according to [HIVE-14694](https://issues.apache.org/jira/browse/HIVE-14694) So the codes below: ``` scala spark.range(1).selectExpr("rand(null)").show() ``` prints.. Before ``` Input argument to rand must be an integer literal.;; line 1 pos 0 org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:444) ``` After ``` +-----------------------+ \|rand(CAST(NULL AS INT))\| +-----------------------+ \| 0.13385709732307427\| +-----------------------+ ``` - `LongType` support in SQL. In addition, it make the function allows to take `LongType` consistently within Scala/SQL. In more details, the codes below: ``` scala spark.range(1).select(rand(1), rand(1L)).show() spark.range(1).selectExpr("rand(1)", "rand(1L)").show() ``` prints.. Before ``` +------------------+------------------+ \| rand(1)\| rand(1)\| +------------------+------------------+ \|0.2630967864682161\|0.2630967864682161\| +------------------+------------------+ Input argument to rand must be an integer literal.;; line 1 pos 0 org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465) at ``` After ``` +------------------+------------------+ \| rand(1)\| rand(1)\| +------------------+------------------+ \|0.2630967864682161\|0.2630967864682161\| +------------------+------------------+ +------------------+------------------+ \| rand(1)\| rand(1)\| +------------------+------------------+ \|0.2630967864682161\|0.2630967864682161\| +------------------+------------------+ ``` ## How was this patch tested? Unit tests in `DataFrameSuite.scala` and `RandomSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15432 from HyukjinKwon/SPARK-17854.
*	[SPARK-18276][ML] ML models should copy the training summary and set parent	sethah	2016-11-05	12	-20/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Only some of the models which contain a training summary currently set the summaries in the copy method. Linear/Logistic regression do, GLR, GMM, KM, and BKM do not. Additionally, these copy methods did not set the parent pointer of the copied model. This patch modifies the copy methods of the four models mentioned above to copy the training summary and set the parent. ## How was this patch tested? Add unit tests in Linear/Logistic/GeneralizedLinear regression and GaussianMixture/KMeans/BisectingKMeans to check the parent pointer of the copied model and check that the copied model has a summary. Author: sethah <seth.hendrickson16@gmail.com> Closes #15773 from sethah/SPARK-18276.
*	[MINOR][DOCUMENTATION] Fix some minor descriptions in functions consistently ↵	hyukjinkwon	2016-11-05	3	-36/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	with expressions ## What changes were proposed in this pull request? This PR proposes to improve documentation and fix some descriptions equivalent to several minor fixes identified in https://github.com/apache/spark/pull/15677 Also, this suggests to change `Note:` and `NOTE:` to `.. note::` consistently with the others which marks up pretty. ## How was this patch tested? Jenkins tests and manually. For PySpark, `Note:` and `NOTE:` to `.. note::` make the document as below: From ![2016-11-04 6 53 35](https://cloud.githubusercontent.com/assets/6477701/20002648/42989922-a2c5-11e6-8a32-b73eda49e8c3.png) ![2016-11-04 6 53 45](https://cloud.githubusercontent.com/assets/6477701/20002650/429fb310-a2c5-11e6-926b-e030d7eb0185.png) ![2016-11-04 6 54 11](https://cloud.githubusercontent.com/assets/6477701/20002649/429d570a-a2c5-11e6-9e7e-44090f337e32.png) ![2016-11-04 6 53 51](https://cloud.githubusercontent.com/assets/6477701/20002647/4297fc74-a2c5-11e6-801a-b89fbcbfca44.png) ![2016-11-04 6 53 51](https://cloud.githubusercontent.com/assets/6477701/20002697/749f5780-a2c5-11e6-835f-022e1f2f82e3.png) To ![2016-11-04 7 03 48](https://cloud.githubusercontent.com/assets/6477701/20002659/4961b504-a2c5-11e6-9ee0-ef0751482f47.png) ![2016-11-04 7 04 03](https://cloud.githubusercontent.com/assets/6477701/20002660/49871d3a-a2c5-11e6-85ea-d9a5d11efeff.png) ![2016-11-04 7 04 28](https://cloud.githubusercontent.com/assets/6477701/20002662/498e0f14-a2c5-11e6-803d-c0c5aeda4153.png) ![2016-11-04 7 33 39](https://cloud.githubusercontent.com/assets/6477701/20002731/a76e30d2-a2c5-11e6-993b-0481b8342d6b.png) ![2016-11-04 7 33 39](https://cloud.githubusercontent.com/assets/6477701/20002731/a76e30d2-a2c5-11e6-993b-0481b8342d6b.png) Author: hyukjinkwon <gurwls223@gmail.com> Closes #15765 from HyukjinKwon/minor-function-doc.
*	[SPARK-17964][SPARKR] Enable SparkR with Mesos client mode and cluster mode	Susan X. Huynh	2016-11-05	2	-8/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Enabled SparkR with Mesos client mode and cluster mode. Just a few changes were required to get this working on Mesos: (1) removed the SparkR on Mesos error checks and (2) do not require "--class" to be specified for R apps. The logic to check spark.mesos.executor.home was already in there. sun-rui ## How was this patch tested? 1. SparkSubmitSuite 2. On local mesos cluster (on laptop): ran SparkR shell, spark-submit client mode, and spark-submit cluster mode, with the "examples/src/main/R/dataframe.R" example application. 3. On multi-node mesos cluster: ran SparkR shell, spark-submit client mode, and spark-submit cluster mode, with the "examples/src/main/R/dataframe.R" example application. I tested with the following --conf values set: spark.mesos.executor.docker.image and spark.mesos.executor.home This contribution is my original work and I license the work to the project under the project's open source license. Author: Susan X. Huynh <xhuynh@mesosphere.com> Closes #15700 from susanxhuynh/susan-r-branch.
*	[SPARK-17849][SQL] Fix NPE problem when using grouping sets	wangyang	2016-11-05	3	-2/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Prior this pr, the following code would cause an NPE: `case class point(a:String, b:String, c:String, d: Int)` `val data = Seq( point("1","2","3", 1), point("4","5","6", 1), point("7","8","9", 1) )` `sc.parallelize(data).toDF().registerTempTable("table")` `spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()` The reason is that when the grouping_id() behavior was changed in #10677, some code (which should be changed) was left out. Take the above code for example, prior #10677, the bit mask for set "(a)" was `001`, while after #10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly. This pr will fix this problem. ## How was this patch tested? add integration tests Author: wangyang <wangyang@haizhi.com> Closes #15416 from yangw1234/groupingid.
*	[SPARK-18192][MINOR][FOLLOWUP] Missed json test in FileStreamSinkSuite	hyukjinkwon	2016-11-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR proposes to fix ```diff test("FileStreamSink - json") { - testFormat(Some("text")) + testFormat(Some("json")) } ``` `text` is being tested above ``` test("FileStreamSink - text") { testFormat(Some("text")) } ``` ## How was this patch tested? Fixed test in `FileStreamSinkSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15785 from HyukjinKwon/SPARK-18192.
*	[SPARK-18287][SQL] Move hash expressions from misc.scala into hash.scala	Reynold Xin	2016-11-05	4	-880/+932
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As the title suggests, this patch moves hash expressions from misc.scala into hash.scala, to make it easier to find the hash functions. I wanted to do this a while ago but decided to wait for the branch-2.1 cut so the chance of conflicts will be smaller. ## How was this patch tested? Test cases were also moved out of MiscFunctionsSuite into HashExpressionsSuite. Author: Reynold Xin <rxin@databricks.com> Closes #15784 from rxin/SPARK-18287.
*	[SPARK-17183][SPARK-17983][SPARK-18101][SQL] put hive serde table schema to ↵	Wenchen Fan	2016-11-05	17	-97/+245
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	table properties like data source table ## What changes were proposed in this pull request? For data source tables, we will put its table schema, partition columns, etc. to table properties, to work around some hive metastore issues, e.g. not case-preserving, bad decimal type support, etc. We should also do this for hive serde tables, to reduce the difference between hive serde tables and data source tables, e.g. column names should be case preserving. ## How was this patch tested? existing tests, and a new test in `HiveExternalCatalog` Author: Wenchen Fan <wenchen@databricks.com> Closes #14750 from cloud-fan/minor1.
*	[SPARK-18260] Make from_json null safe	Burak Yavuz	2016-11-05	2	-1/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `from_json` is currently not safe against `null` rows. This PR adds a fix and a regression test for it. ## How was this patch tested? Regression test Author: Burak Yavuz <brkyvz@gmail.com> Closes #15771 from brkyvz/json_fix.
*	[SPARK-17710][FOLLOW UP] Add comments to state why 'Utils.classForName' is ↵	Weiqing Yang	2016-11-04	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	not used ## What changes were proposed in this pull request? Add comments. ## How was this patch tested? Build passed. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #15776 from weiqingy/SPARK-17710.
*	[SPARK-18189] [SQL] [Followup] Move test from ReplSuite to prevent ↵	Reynold Xin	2016-11-04	2	-17/+12
\| \| \| \| \| \|	java.lang.ClassCircularityError closes #15774
*	[SPARK-18256] Improve the performance of event log replay in HistoryServer	Josh Rosen	2016-11-04	2	-42/+70
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch significantly improves the performance of event log replay in the HistoryServer via two simple changes: - Don't use `extractOpt`: it turns out that `json4s`'s `extractOpt` method uses exceptions for control flow, causing huge performance bottlenecks due to the overhead of initializing exceptions. To avoid this overhead, we can simply use our own` Utils.jsonOption` method. This patch replaces all uses of `extractOpt` with `Utils.jsonOption` and adds a style checker rule to ban the use of the slow `extractOpt` method. - Don't call `Utils.getFormattedClassName` for every event: the old code called` Utils.getFormattedClassName` dozens of times per replayed event in order to match up class names in events with SparkListener event names. By simply storing the results of these calls in constants rather than recomputing them, we're able to eliminate a huge performance hotspot by removing thousands of expensive `Class.getSimpleName` calls. ## How was this patch tested? Tested by profiling the replay of a long event log using YourKit. For an event log containing 1000+ jobs, each of which had thousands of tasks, the changes in this patch cut the replay time in half: ![image](https://cloud.githubusercontent.com/assets/50748/19980953/31154622-a1bd-11e6-9be4-21fbb9b3f9a7.png) Prior to this patch's changes, the two slowest methods in log replay were internal exceptions thrown by `Json4S` and calls to `Class.getSimpleName()`: ![image](https://cloud.githubusercontent.com/assets/50748/19981052/87416cce-a1bd-11e6-9f25-06a7cd391822.png) After this patch, these hotspots are completely eliminated. Author: Josh Rosen <joshrosen@databricks.com> Closes #15756 from JoshRosen/speed-up-jsonprotocol.
*	[SPARK-18167] Re-enable the non-flaky parts of SQLQuerySuite	Eric Liang	2016-11-04	1	-21/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? It seems the proximate cause of the test failures is that `cast(str as decimal)` in derby will raise an exception instead of returning NULL. This is a problem since Hive sometimes inserts `__HIVE_DEFAULT_PARTITION__` entries into the partition table as documented here: https://github.com/apache/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1034 Basically, when these special default partitions are present, partition pruning pushdown using the SQL-direct mode will fail due this cast exception. As commented on in `MetaStoreDirectSql.java` above, this is normally fine since Hive falls back to JDO pruning, however when the pruning predicate contains an unsupported operator such as `>`, that will fail as well. The only remaining question is why this behavior is nondeterministic. We know that when the test flakes, retries do not help, therefore the cause must be environmental. The current best hypothesis is that some config is different between different jenkins runs, which is why this PR prints out the Spark SQL and Hive confs for the test. The hope is that by comparing the config state for failure vs success we can isolate the root cause of the flakiness. Update: we could not isolate the issue. It does not seem to be due to configuration differences. As such, I'm going to enable the non-flaky parts of the test since we are fairly confident these issues only occur with Derby (which is not used in production). ## How was this patch tested? N/A Author: Eric Liang <ekl@databricks.com> Closes #15725 from ericl/print-confs-out.
*	[SPARK-17337][SQL] Do not pushdown predicates through filters with ↵	Herman van Hovell	2016-11-04	2	-5/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	predicate subqueries ## What changes were proposed in this pull request? The `PushDownPredicate` rule can create a wrong result if we try to push a filter containing a predicate subquery through a project when the subquery and the project share attributes (have the same source). The current PR fixes this by making sure that we do not push down when there is a predicate subquery that outputs the same attributes as the filters new child plan. ## How was this patch tested? Added a test to `SubquerySuite`. nsyca has done previous work this. I have taken test from his initial PR. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15761 from hvanhovell/SPARK-17337.
*	[SPARK-18197][CORE] Optimise AppendOnlyMap implementation	Adam Roberts	2016-11-04	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This improvement works by using the fastest comparison test first and we observed a 1% throughput performance improvement on PageRank (HiBench large profile) with this change. We used tprof and before the change in AppendOnlyMap.changeValue (where the optimisation occurs) this method was being used for 8053 profiling ticks representing 0.72% of the overall application time. After this change we observed this method only occurring for 2786 ticks and for 0.25% of the overall time. ## How was this patch tested? Existing unit tests and for performance we used HiBench large, profiling with tprof and IBM Healthcenter. Author: Adam Roberts <aroberts@uk.ibm.com> Closes #15714 from a-roberts/patch-9.
*	Closing some stale/invalid pull requests	Reynold Xin	2016-11-04	0	-0/+0
\| \| \| \| \| \|	Closes #15758 Closes #15753 Closes #12708
*	[SPARK-18200][GRAPHX][FOLLOW-UP] Support zero as an initial capacity in ↵	Dongjoon Hyun	2016-11-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	OpenHashSet ## What changes were proposed in this pull request? This is a follow-up PR of #15741 in order to keep `nextPowerOf2` consistent. Before ``` nextPowerOf2(0) => 2 nextPowerOf2(1) => 1 nextPowerOf2(2) => 2 nextPowerOf2(3) => 4 nextPowerOf2(4) => 4 nextPowerOf2(5) => 8 ``` After ``` nextPowerOf2(0) => 1 nextPowerOf2(1) => 1 nextPowerOf2(2) => 2 nextPowerOf2(3) => 4 nextPowerOf2(4) => 4 nextPowerOf2(5) => 8 ``` ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15754 from dongjoon-hyun/SPARK-18200-2.
*	[SPARK-14393][SQL][DOC] update doc for python and R	Felix Cheung	2016-11-03	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? minor doc update that should go to master & branch-2.1 ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15747 from felixcheung/pySPARK-14393.
*	[SPARK-18259][SQL] Do not capture Throwable in QueryExecution	Herman van Hovell	2016-11-03	2	-1/+51
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `QueryExecution.toString` currently captures `java.lang.Throwable`s; this is far from a best practice and can lead to confusing situation or invalid application states. This PR fixes this by only capturing `AnalysisException`s. ## How was this patch tested? Added a `QueryExecutionSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15760 from hvanhovell/SPARK-18259.