aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-17274][SQL] Move join optimizer rules into a separate fileReynold Xin2016-08-272-106/+134
| | | | | | | | | | | | ## What changes were proposed in this pull request? As part of breaking Optimizer.scala apart, this patch moves various join rules into a single file. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #14846 from rxin/SPARK-17274.
* [SPARK-17273][SQL] Move expression optimizer rules into a separate fileReynold Xin2016-08-272-460/+507
| | | | | | | | | | | | ## What changes were proposed in this pull request? As part of breaking Optimizer.scala apart, this patch moves various expression optimization rules into a single file. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #14845 from rxin/SPARK-17273.
* [SPARK-17272][SQL] Move subquery optimizer rules into its own fileReynold Xin2016-08-272-323/+356
| | | | | | | | | | | | ## What changes were proposed in this pull request? As part of breaking Optimizer.scala apart, this patch moves various subquery rules into a single file. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #14844 from rxin/SPARK-17272.
* [SPARK-17269][SQL] Move finish analysis optimization stage into its own fileReynold Xin2016-08-263-39/+66
| | | | | | | | | | | | ## What changes were proposed in this pull request? As part of breaking Optimizer.scala apart, this patch moves various finish analysis optimization stage rules into a single file. I'm submitting separate pull requests so we can more easily merge this in branch-2.0 to simplify optimizer backports. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #14838 from rxin/SPARK-17269.
* [SPARK-17270][SQL] Move object optimization rules into its own fileReynold Xin2016-08-262-71/+98
| | | | | | | | | | | | ## What changes were proposed in this pull request? As part of breaking Optimizer.scala apart, this patch moves various Dataset object optimization rules into a single file. I'm submitting separate pull requests so we can more easily merge this in branch-2.0 to simplify optimizer backports. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #14839 from rxin/SPARK-17270.
* [SPARK-17266][TEST] Add empty strings to the regressionTests of ↵Yin Huai2016-08-261-1/+2
| | | | | | | | | | | | | PrefixComparatorsSuite ## What changes were proposed in this pull request? This PR adds a regression test to PrefixComparatorsSuite's "String prefix comparator" because this test failed on jenkins once (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1620/testReport/junit/org.apache.spark.util.collection.unsafe.sort/PrefixComparatorsSuite/String_prefix_comparator/). I could not reproduce it locally. But, let's this test case in the regressionTests. Author: Yin Huai <yhuai@databricks.com> Closes #14837 from yhuai/SPARK-17266.
* [SPARK-17244] Catalyst should not pushdown non-deterministic join conditionsSameer Agarwal2016-08-262-7/+28
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that. ## How was this patch tested? A new test in `FilterPushdownSuite` that checks catalyst behavior for both deterministic and non-deterministic join conditions. Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14815 from sameeragarwal/constraint-inputfile.
* [SPARK-17235][SQL] Support purging of old logs in MetadataLogpetermaxlee2016-08-263-4/+43
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds a purge interface to MetadataLog, and an implementation in HDFSMetadataLog. The purge function is currently unused, but I will use it to purge old execution and file source logs in follow-up patches. These changes are required in a production structured streaming job that runs for a long period of time. ## How was this patch tested? Added a unit test case in HDFSMetadataLogSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #14802 from petermaxlee/SPARK-17235.
* [SPARK-17246][SQL] Add BigDecimal literalHerman van Hovell2016-08-267-3/+59
| | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds parser support for `BigDecimal` literals. If you append the suffix `BD` to a valid number then this will be interpreted as a `BigDecimal`, for example `12.0E10BD` will interpreted into a BigDecimal with scale -9 and precision 3. This is useful in situations where you need exact values. ## How was this patch tested? Added tests to `ExpressionParserSuite`, `ExpressionSQLBuilderSuite` and `SQLQueryTestSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #14819 from hvanhovell/SPARK-17246.
* [SPARK-16967] move mesos to moduleMichael Gummelt2016-08-2643-118/+305
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Move Mesos code into a mvn module ## How was this patch tested? unit tests manually submitting a client mode and cluster mode job spark/mesos integration test suite Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14637 from mgummelt/mesos-module.
* [SPARK-17207][MLLIB] fix comparing Vector bug in TestingUtilsPeng, Meng2016-08-265-16/+566
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? fix comparing Vector bug in TestingUtils. There is the same bug for Matrix comparing. How to check the length of Matrix should be discussed first. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Peng, Meng <peng.meng@intel.com> Closes #14785 from mpjlu/testUtils.
* [SPARK-17165][SQL] FileStreamSource should not track the list of seen files ↵petermaxlee2016-08-265-36/+285
| | | | | | | | | | | | | | | | indefinitely ## What changes were proposed in this pull request? Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set. This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed. ## How was this patch tested? Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number. Author: petermaxlee <petermaxlee@gmail.com> Closes #14728 from petermaxlee/SPARK-17165.
* [SPARK-17250][SQL] Remove HiveClient and setCurrentDatabase from ↵gatorsmile2016-08-262-8/+0
| | | | | | | | | | | | | | | | HiveSessionCatalog ### What changes were proposed in this pull request? This is the first step to remove `HiveClient` from `HiveSessionState`. In the metastore interaction, we always use the fully qualified table name when accessing/operating a table. That means, we always specify the database. Thus, it is not necessary to use `HiveClient` to change the active database in Hive metastore. In `HiveSessionCatalog `, `setCurrentDatabase` is the only function that uses `HiveClient`. Thus, we can remove it after removing `setCurrentDatabase` ### How was this patch tested? The existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #14821 from gatorsmile/setCurrentDB.
* [SPARK-17192][SQL] Issue Exception when Users Specify the Partitioning ↵gatorsmile2016-08-263-29/+29
| | | | | | | | | | | | | | | | | | Columns without a Given Schema ### What changes were proposed in this pull request? Address the comments by yhuai in the original PR: https://github.com/apache/spark/pull/14207 First, issue an exception instead of logging a warning when users specify the partitioning columns without a given schema. Second, refactor the codes a little. ### How was this patch tested? Fixed the test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #14572 from gatorsmile/followup16552.
* [SPARKR][MINOR] Fix example of spark.naiveBayesJunyang Qian2016-08-261-2/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The original example doesn't work because the features are not categorical. This PR fixes this by changing to another dataset. ## How was this patch tested? Manual test. Author: Junyang Qian <junyangq@databricks.com> Closes #14820 from junyangq/SPARK-FixNaiveBayes.
* [SPARK-17187][SQL][FOLLOW-UP] improve document of TypedImperativeAggregateWenchen Fan2016-08-261-40/+61
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? improve the document to make it easier to understand and also mention window operator. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #14822 from cloud-fan/object-agg.
* [SPARK-17260][MINOR] move CreateTables to HiveStrategiesWenchen Fan2016-08-264-37/+27
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? `CreateTables` rule turns a general `CreateTable` plan to `CreateHiveTableAsSelectCommand` for hive serde table. However, this rule is logically a planner strategy, we should move it to `HiveStrategies`, to be consistent with other DDL commands. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14825 from cloud-fan/ctas.
* [SPARK-16216][SQL][FOLLOWUP] Enable timestamp type tests for JSON and verify ↵hyukjinkwon2016-08-264-12/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | all unsupported types in CSV ## What changes were proposed in this pull request? This PR enables the tests for `TimestampType` for JSON and unifies the logics for verifying schema when writing in CSV. In more details, this PR, - Enables the tests for `TimestampType` for JSON and This was disabled due to an issue in `DatatypeConverter.parseDateTime` which parses dates incorrectly, for example as below: ```scala val d = javax.xml.bind.DatatypeConverter.parseDateTime("0900-01-01T00:00:00.000").getTime println(d.toString) ``` ``` Fri Dec 28 00:00:00 KST 899 ``` However, since we use `FastDateFormat`, it seems we are safe now. ```scala val d = FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSS").parse("0900-01-01T00:00:00.000") println(d) ``` ``` Tue Jan 01 00:00:00 PST 900 ``` - Verifies all unsupported types in CSV There is a separate logics to verify the schemas in `CSVFileFormat`. This is actually not quite correct enough because we don't support `NullType` and `CalanderIntervalType` as well `StructType`, `ArrayType`, `MapType`. So, this PR adds both types. ## How was this patch tested? Tests in `JsonHadoopFsRelation` and `CSVSuite` Author: hyukjinkwon <gurwls223@gmail.com> Closes #14829 from HyukjinKwon/SPARK-16216-followup.
* [SPARK-17242][DOCUMENT] Update links of external dstream projectsShixiong Zhu2016-08-251-6/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Updated links of external dstream projects. ## How was this patch tested? Just document changes. Author: Shixiong Zhu <shixiong@databricks.com> Closes #14814 from zsxwing/dstream-link.
* [SPARK-17212][SQL] TypeCoercion supports widening conversion between ↵hyukjinkwon2016-08-262-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DateType and TimestampType ## What changes were proposed in this pull request? Currently, type-widening does not work between `TimestampType` and `DateType`. This applies to `SetOperation`, `Union`, `In`, `CaseWhen`, `Greatest`, `Leatest`, `CreateArray`, `CreateMap`, `Coalesce`, `NullIf`, `IfNull`, `Nvl` and `Nvl2`, . This PR adds the support for widening `DateType` to `TimestampType` for them. For a simple example, **Before** ```scala Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", "b").selectExpr("greatest(a, b)").show() ``` shows below: ``` cannot resolve 'greatest(`a`, `b`)' due to data type mismatch: The expressions should all have the same type, got GREATEST(timestamp, date) ``` or union as below: ```scala val a = Seq(Tuple1(new Timestamp(0))).toDF() val b = Seq(Tuple1(new Date(0))).toDF() a.union(b).show() ``` shows below: ``` Union can only be performed on tables with the compatible column types. DateType <> TimestampType at the first column of the second table; ``` **After** ```scala Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", "b").selectExpr("greatest(a, b)").show() ``` shows below: ``` +----------------------------------------------------+ |greatest(CAST(a AS TIMESTAMP), CAST(b AS TIMESTAMP))| +----------------------------------------------------+ | 1969-12-31 16:00:...| +----------------------------------------------------+ ``` or union as below: ```scala val a = Seq(Tuple1(new Timestamp(0))).toDF() val b = Seq(Tuple1(new Date(0))).toDF() a.union(b).show() ``` shows below: ``` +--------------------+ | _1| +--------------------+ |1969-12-31 16:00:...| |1969-12-31 00:00:...| +--------------------+ ``` ## How was this patch tested? Unit tests in `TypeCoercionSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Author: HyukjinKwon <gurwls223@gmail.com> Closes #14786 from HyukjinKwon/SPARK-17212.
* [SPARK-17187][SQL] Supports using arbitrary Java object as internal ↵Sean Zhong2016-08-253-0/+456
| | | | | | | | | | | | | | | | | | | | | | | | | aggregation buffer object ## What changes were proposed in this pull request? This PR introduces an abstract class `TypedImperativeAggregate` so that an aggregation function of TypedImperativeAggregate can use **arbitrary** user-defined Java object as intermediate aggregation buffer object. **This has advantages like:** 1. It now can support larger category of aggregation functions. For example, it will be much easier to implement aggregation function `percentile_approx`, which has a complex aggregation buffer definition. 2. It can be used to avoid doing serialization/de-serialization for every call of `update` or `merge` when converting domain specific aggregation object to internal Spark-Sql storage format. 3. It is easier to integrate with other existing monoid libraries like algebird, and supports more aggregation functions with high performance. Please see `org.apache.spark.sql.TypedImperativeAggregateSuite.TypedMaxAggregate` to find an example of how to defined a `TypedImperativeAggregate` aggregation function. Please see Java doc of `TypedImperativeAggregate` and Jira ticket SPARK-17187 for more information. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #14753 from clockfly/object_aggregation_buffer_try_2.
* [SPARK-17240][CORE] Make SparkConf serializable again.Marcelo Vanzin2016-08-252-5/+28
| | | | | | | | | | | | | Make the config reader transient, and initialize it lazily so that serialization works with both java and kryo (and hopefully any other custom serializer). Added unit test to make sure SparkConf remains serializable and the reader works with both built-in serializers. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #14813 from vanzin/SPARK-17240.
* [SPARK-17205] Literal.sql should handle Infinity and NaNJosh Rosen2016-08-262-2/+21
| | | | | | | | This patch updates `Literal.sql` to properly generate SQL for `NaN` and `Infinity` float and double literals: these special values need to be handled differently from regular values, since simply appending a suffix to the value's `toString()` representation will not work for these values. Author: Josh Rosen <joshrosen@databricks.com> Closes #14777 from JoshRosen/SPARK-17205.
* [SPARK-17229][SQL] PostgresDialect shouldn't widen float and short types ↵Josh Rosen2016-08-253-5/+28
| | | | | | | | | | | | | | | | | | during reads ## What changes were proposed in this pull request? When reading float4 and smallint columns from PostgreSQL, Spark's `PostgresDialect` widens these types to Decimal and Integer rather than using the narrower Float and Short types. According to https://www.postgresql.org/docs/7.1/static/datatype.html#DATATYPE-TABLE, Postgres maps the `smallint` type to a signed two-byte integer and the `real` / `float4` types to single precision floating point numbers. This patch fixes this by adding more special-cases to `getCatalystType`, similar to what was done for the Derby JDBC dialect. I also fixed a similar problem in the write path which causes Spark to create integer columns in Postgres for what should have been ShortType columns. ## How was this patch tested? New test cases in `PostgresIntegrationSuite` (which I ran manually because Jenkins can't run it right now). Author: Josh Rosen <joshrosen@databricks.com> Closes #14796 from JoshRosen/postgres-jdbc-type-fixes.
* [SPARKR][BUILD] ignore cran-check.out under R folderwm624@hotmail.com2016-08-251-0/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) R add cran check which will generate the cran-check.out. This file should be ignored in git. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual test it. Run clean test and git status to make sure the file is not included in git. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14774 from wangmiao1981/ignore.
* [SPARK-17231][CORE] Avoid building debug or trace log messages unless the ↵Michael Allman2016-08-259-45/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | respective log level is enabled (This PR addresses https://issues.apache.org/jira/browse/SPARK-17231) ## What changes were proposed in this pull request? While debugging the performance of a large GraphX connected components computation, we found several places in the `network-common` and `network-shuffle` code bases where trace or debug log messages are constructed even if the respective log level is disabled. According to YourKit, these constructions were creating substantial churn in the eden region. Refactoring the respective code to avoid these unnecessary constructions except where necessary led to a modest but measurable reduction in our job's task time, GC time and the ratio thereof. ## How was this patch tested? We computed the connected components of a graph with about 2.6 billion vertices and 1.7 billion edges four times. We used four different EC2 clusters each with 8 r3.8xl worker nodes. Two test runs used Spark master. Two used Spark master + this PR. The results from the first test run, master and master+PR: ![master](https://cloud.githubusercontent.com/assets/833693/17951634/7471cbca-6a18-11e6-9c26-78afe9319685.jpg) ![logging_perf_improvements](https://cloud.githubusercontent.com/assets/833693/17951632/7467844e-6a18-11e6-9a0e-053dc7650413.jpg) The results from the second test run, master and master+PR: ![master 2](https://cloud.githubusercontent.com/assets/833693/17951633/746dd6aa-6a18-11e6-8e27-606680b3f105.jpg) ![logging_perf_improvements 2](https://cloud.githubusercontent.com/assets/833693/17951631/74488710-6a18-11e6-8a32-08692f373386.jpg) Though modest, I believe these results are significant. Author: Michael Allman <michael@videoamp.com> Closes #14798 from mallman/spark-17231-logging_perf_improvements.
* [SPARK-16991][SPARK-17099][SPARK-17120][SQL] Fix Outer Join Elimination when ↵gatorsmile2016-08-255-12/+161
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Filter's isNotNull Constraints Unable to Filter Out All Null-supplying Rows ### What changes were proposed in this pull request? This PR is to fix an incorrect outer join elimination when filter's `isNotNull` constraints is unable to filter out all null-supplying rows. For example, `isnotnull(coalesce(b#227, c#238))`. Users can hit this error when they try to use `using/natural outer join`, which is converted to a normal outer join with a `coalesce` expression on the `using columns`. For example, ```Scala val a = Seq((1, 2), (2, 3)).toDF("a", "b") val b = Seq((2, 5), (3, 4)).toDF("a", "c") val c = Seq((3, 1)).toDF("a", "d") val ab = a.join(b, Seq("a"), "fullouter") ab.join(c, "a").explain(true) ``` The dataframe `ab` is doing `using full-outer join`, which is converted to a normal outer join with a `coalesce` expression. Constraints inference generates a `Filter` with constraints `isnotnull(coalesce(b#227, c#238))`. Then, it triggers a wrong outer join elimination and generates a wrong result. ``` Project [a#251, b#227, c#237, d#247] +- Join Inner, (a#251 = a#246) :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237] : +- Join FullOuter, (a#226 = a#236) : :- Project [_1#223 AS a#226, _2#224 AS b#227] : : +- LocalRelation [_1#223, _2#224] : +- Project [_1#233 AS a#236, _2#234 AS c#237] : +- LocalRelation [_1#233, _2#234] +- Project [_1#243 AS a#246, _2#244 AS d#247] +- LocalRelation [_1#243, _2#244] == Optimized Logical Plan == Project [a#251, b#227, c#237, d#247] +- Join Inner, (a#251 = a#246) :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237] : +- Filter isnotnull(coalesce(a#226, a#236)) : +- Join FullOuter, (a#226 = a#236) : :- LocalRelation [a#226, b#227] : +- LocalRelation [a#236, c#237] +- LocalRelation [a#246, d#247] ``` **A note to the `Committer`**, please also give the credit to dongjoon-hyun who submitted another PR for fixing this issue. https://github.com/apache/spark/pull/14580 ### How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14661 from gatorsmile/fixOuterJoinElimination.
* [SPARK-12978][SQL] Skip unnecessary final group-by when input data already ↵Takeshi YAMAMURO2016-08-258-224/+257
| | | | | | | | | | | | | | | | | | | | | | | | | | | clustered with group-by keys This ticket targets the optimization to skip an unnecessary group-by operation below; Without opt.: ``` == Physical Plan == TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], output=[col0#159,sum(col1)#177,avg(col2)#178]) +- TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)], output=[col0#159,sum#200,sum#201,count#202L]) +- TungstenExchange hashpartitioning(col0#159,200), None +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None ``` With opt.: ``` == Physical Plan == TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], output=[col0#159,sum(col1)#177,avg(col2)#178]) +- TungstenExchange hashpartitioning(col0#159,200), None +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None ``` Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10896 from maropu/SkipGroupbySpike.
* [SPARK-17197][ML][PYSPARK] PySpark LiR/LoR supports tree aggregation level ↵Yanbo Liang2016-08-254-11/+42
| | | | | | | | | | | | | | | configurable. ## What changes were proposed in this pull request? [SPARK-17090](https://issues.apache.org/jira/browse/SPARK-17090) makes tree aggregation level in LiR/LoR configurable, this PR makes PySpark support this function. ## How was this patch tested? Since ```aggregationDepth``` is an expert param, I'm not prefer to test it in doctest which is also used for example. Here is the offline test result: ![image](https://cloud.githubusercontent.com/assets/1962026/17879457/f83d7760-68a6-11e6-9936-d0a884d5d6ec.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #14766 from yanboliang/spark-17197.
* [SPARK-17061][SPARK-17093][SQL] MapObjects` should make copies of ↵Liwei Lin2016-08-253-2/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | unsafe-backed data ## What changes were proposed in this pull request? Currently `MapObjects` does not make copies of unsafe-backed data, leading to problems like [SPARK-17061](https://issues.apache.org/jira/browse/SPARK-17061) [SPARK-17093](https://issues.apache.org/jira/browse/SPARK-17093). This patch makes `MapObjects` make copies of unsafe-backed data. Generated code - prior to this patch: ```java ... /* 295 */ if (isNull12) { /* 296 */ convertedArray1[loopIndex1] = null; /* 297 */ } else { /* 298 */ convertedArray1[loopIndex1] = value12; /* 299 */ } ... ``` Generated code - after this patch: ```java ... /* 295 */ if (isNull12) { /* 296 */ convertedArray1[loopIndex1] = null; /* 297 */ } else { /* 298 */ convertedArray1[loopIndex1] = value12 instanceof UnsafeRow? value12.copy() : value12; /* 299 */ } ... ``` ## How was this patch tested? Add a new test case which would fail without this patch. Author: Liwei Lin <lwlin7@gmail.com> Closes #14698 from lw-lin/mapobjects-copy.
* [SPARK-17193][CORE] HadoopRDD NPE at DEBUG log level when getLocationInfo == ↵Sean Owen2016-08-252-15/+13
| | | | | | | | | | | | | | | | null ## What changes were proposed in this pull request? Handle null from Hadoop getLocationInfo directly instead of catching (and logging) exception ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #14760 from srowen/SPARK-17193.
* [SPARK-17215][SQL] Method `SQLContext.parseDataType(dataTypeString: String)` ↵jiangxingbo2016-08-247-23/+16
| | | | | | | | | | | | | | | | | could be removed. ## What changes were proposed in this pull request? Method `SQLContext.parseDataType(dataTypeString: String)` could be removed, we should use `SparkSession.parseDataType(dataTypeString: String)` instead. This require updating PySpark. ## How was this patch tested? Existing test cases. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #14790 from jiangxb1987/parseDataType.
* [SPARK-17190][SQL] Removal of HiveSharedStategatorsmile2016-08-2515-99/+88
| | | | | | | | | | | | | | ### What changes were proposed in this pull request? Since `HiveClient` is used to interact with the Hive metastore, it should be hidden in `HiveExternalCatalog`. After moving `HiveClient` into `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes straightforward. After removal of `HiveSharedState`, the reflection logic is directly applied on the choice of `ExternalCatalog` types, based on the configuration of `CATALOG_IMPLEMENTATION`. ~~`HiveClient` is also used/invoked by the other entities besides HiveExternalCatalog, we defines the following two APIs: getClient and getNewClient~~ ### How was this patch tested? The existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14757 from gatorsmile/removeHiveClient.
* [SPARK-17228][SQL] Not infer/propagate non-deterministic constraintsSameer Agarwal2016-08-242-1/+19
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Given that filters based on non-deterministic constraints shouldn't be pushed down in the query plan, unnecessarily inferring them is confusing and a source of potential bugs. This patch simplifies the inferring logic by simply ignoring them. ## How was this patch tested? Added a new test in `ConstraintPropagationSuite`. Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14795 from sameeragarwal/deterministic-constraints.
* [SPARKR][MINOR] Add installation message for remote master mode and improve ↵Junyang Qian2016-08-243-39/+80
| | | | | | | | | | | | | | | | | | | | other messages ## What changes were proposed in this pull request? This PR gives informative message to users when they try to connect to a remote master but don't have Spark package in their local machine. As a clarification, for now, automatic installation will only happen if they start SparkR in R console (rather than from sparkr-shell) and connect to local master. In the remote master mode, local Spark package is still needed, but we will not trigger the install.spark function because the versions have to match those on the cluster, which involves more user input. Instead, we here try to provide detailed message that may help the users. Some of the other messages have also been slightly changed. ## How was this patch tested? Manual test. Author: Junyang Qian <junyangq@databricks.com> Closes #14761 from junyangq/SPARK-16579-V1.
* [SPARKR][MINOR] Add more examples to window function docsJunyang Qian2016-08-242-18/+72
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds more examples to window function docs to make them more accessible to the users. It also fixes default value issues for `lag` and `lead`. ## How was this patch tested? Manual test, R unit test. Author: Junyang Qian <junyangq@databricks.com> Closes #14779 from junyangq/SPARKR-FixWindowFunctionDocs.
* [MINOR][SPARKR] fix R MLlib parameter documentationFelix Cheung2016-08-241-2/+2
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixed several misplaced param tag - they should be on the spark.* method generics ## How was this patch tested? run knitr junyangq Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14792 from felixcheung/rdocmllib.
* [SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and ↵hyukjinkwon2016-08-2418-90/+454
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dateFormat/timestampFormat option for CSV and JSON ## What changes were proposed in this pull request? ### Default - ISO 8601 Currently, CSV datasource is writing `Timestamp` and `Date` as numeric form and JSON datasource is writing both as below: - CSV ``` // TimestampType 1414459800000000 // DateType 16673 ``` - Json ``` // TimestampType 1970-01-01 11:46:40.0 // DateType 1970-01-01 ``` So, for CSV we can't read back what we write and for JSON it becomes ambiguous because the timezone is being missed. So, this PR make both **write** `Timestamp` and `Date` in ISO 8601 formatted string (please refer the [ISO 8601 specification](https://www.w3.org/TR/NOTE-datetime)). - For `Timestamp` it becomes as below: (`yyyy-MM-dd'T'HH:mm:ss.SSSZZ`) ``` 1970-01-01T02:00:01.000-01:00 ``` - For `Date` it becomes as below (`yyyy-MM-dd`) ``` 1970-01-01 ``` ### Custom date format option - `dateFormat` This PR also adds the support to write and read dates and timestamps in a formatted string as below: - **DateType** - With `dateFormat` option (e.g. `yyyy/MM/dd`) ``` +----------+ | date| +----------+ |2015/08/26| |2014/10/27| |2016/01/28| +----------+ ``` ### Custom date format option - `timestampFormat` - **TimestampType** - With `dateFormat` option (e.g. `dd/MM/yyyy HH:mm`) ``` +----------------+ | date| +----------------+ |2015/08/26 18:00| |2014/10/27 18:30| |2016/01/28 20:00| +----------------+ ``` ## How was this patch tested? Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests cover the default cases. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14279 from HyukjinKwon/SPARK-16216-json-csv.
* [SPARK-15083][WEB UI] History Server can OOM due to unlimited TaskUIDataAlex Bozarth2016-08-249-228/+256
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Based on #12990 by tankkyo Since the History Server currently loads all application's data it can OOM if too many applications have a significant task count. `spark.ui.trimTasks` (default: false) can be set to true to trim tasks by `spark.ui.retainedTasks` (default: 10000) (This is a "quick fix" to help those running into the problem until a update of how the history server loads app data can be done) ## How was this patch tested? Manual testing and dev/run-tests ![spark-15083](https://cloud.githubusercontent.com/assets/13952758/17713694/fe82d246-63b0-11e6-9697-b87ea75ff4ef.png) Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14673 from ajbozarth/spark15083.
* [SPARK-16983][SQL] Add `prettyName` for row_number, dense_rank, ↵Dongjoon Hyun2016-08-241-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | percent_rank, cume_dist ## What changes were proposed in this pull request? Currently, two-word window functions like `row_number`, `dense_rank`, `percent_rank`, and `cume_dist` are expressed without `_` in error messages. We had better show the correct names. **Before** ```scala scala> sql("select row_number()").show java.lang.UnsupportedOperationException: Cannot evaluate expression: rownumber() ``` **After** ```scala scala> sql("select row_number()").show java.lang.UnsupportedOperationException: Cannot evaluate expression: row_number() ``` ## How was this patch tested? Pass the Jenkins and manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14571 from dongjoon-hyun/SPARK-16983.
* [SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the ↵Sean Owen2016-08-2416-16/+16
| | | | | | | | | | | | | | | | same java used in the spark environment ## What changes were proposed in this pull request? Update to py4j 0.10.3 to enable JAVA_HOME support ## How was this patch tested? Pyspark tests Author: Sean Owen <sowen@cloudera.com> Closes #14748 from srowen/SPARK-16781.
* [SPARK-16445][MLLIB][SPARKR] Multilayer Perceptron Classifier wrapper in SparkRXin Ren2016-08-246-5/+293
| | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-16445 ## What changes were proposed in this pull request? Create Multilayer Perceptron Classifier wrapper in SparkR ## How was this patch tested? Tested manually on local machine Author: Xin Ren <iamshrek@126.com> Closes #14447 from keypointt/SPARK-16445.
* [SPARKR][MINOR] Fix doc for show methodJunyang Qian2016-08-241-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The original doc of `show` put methods for multiple classes together but the text only talks about `SparkDataFrame`. This PR tries to fix this problem. ## How was this patch tested? Manual test. Author: Junyang Qian <junyangq@databricks.com> Closes #14776 from junyangq/SPARK-FixShowDoc.
* [MINOR][DOC] Fix wrong ml.feature.Normalizer document.Yanbo Liang2016-08-241-1/+1
| | | | | | | | | | | | | ## What changes were proposed in this pull request? The ```ml.feature.Normalizer``` examples illustrate L1 norm rather than L2, we should correct corresponding document. ![image](https://cloud.githubusercontent.com/assets/1962026/17928637/85aec284-69b0-11e6-9b13-d465ee560581.png) ## How was this patch tested? Doc change, no test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14787 from yanboliang/normalizer.
* [SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer ↵VinceShieh2016-08-242-1/+25
| | | | | | | | | | | | | | | | | | | | | | when some quantiles are duplicated ## What changes were proposed in this pull request? In cases when QuantileDiscretizerSuite is called upon a numeric array with duplicated elements, we will take the unique elements generated from approxQuantiles as input for Bucketizer. ## How was this patch tested? An unit test is added in QuantileDiscretizerSuite QuantileDiscretizer.fit will throw an illegal exception when calling setSplits on a list of splits with duplicated elements. Bucketizer.setSplits should only accept either a numeric vector of two or more unique cut points, although that may produce less number of buckets than requested. Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #14747 from VinceShieh/SPARK-17086.
* [MINOR][BUILD] Fix Java CheckStyle ErrorWeiqing Yang2016-08-242-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? As Spark 2.0.1 will be released soon (mentioned in the spark dev mailing list), besides the critical bugs, it's better to fix the code style errors before the release. Before: ``` ./dev/lint-java Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[525] (sizes) LineLength: Line is longer than 100 characters (found 119). [ERROR] src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103). ``` After: ``` ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` ## How was this patch tested? Manual. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #14768 from Sherry302/fixjavastyle.
* [SPARK-17186][SQL] remove catalog table type INDEXWenchen Fan2016-08-235-10/+6
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Actually Spark SQL doesn't support index, the catalog table type `INDEX` is from Hive. However, most operations in Spark SQL can't handle index table, e.g. create table, alter table, etc. Logically index table should be invisible to end users, and Hive also generates special table name for index table to avoid users accessing it directly. Hive has special SQL syntax to create/show/drop index tables. At Spark SQL side, although we can describe index table directly, but the result is unreadable, we should use the dedicated SQL syntax to do it(e.g. `SHOW INDEX ON tbl`). Spark SQL can also read index table directly, but the result is always empty.(Can hive read index table directly?) This PR remove the table type `INDEX`, to make it clear that Spark SQL doesn't support index currently. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14752 from cloud-fan/minor2.
* [MINOR][SQL] Remove implemented functions from comments of ↵Weiqing Yang2016-08-231-4/+2
| | | | | | | | | | | | | | 'HiveSessionCatalog.scala' ## What changes were proposed in this pull request? This PR removes implemented functions from comments of `HiveSessionCatalog.scala`: `java_method`, `posexplode`, `str_to_map`. ## How was this patch tested? Manual. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #14769 from Sherry302/cleanComment.
* [SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader`Tejas Patil2016-08-231-1/+21
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Jira: https://issues.apache.org/jira/browse/SPARK-16862 `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k buffer to read data off disk. This PR makes it configurable to improve on disk reads. I have made the default value to be 1 MB as with that value I observed improved performance. ## How was this patch tested? I am relying on the existing unit tests. ## Performance After deploying this change to prod and setting the config to 1 mb, there was a 12% reduction in the CPU time and 19.5% reduction in CPU reservation time. Author: Tejas Patil <tejasp@fb.com> Closes #14726 from tejasapatil/spill_buffer_2.
* [SPARK-17194] Use single quotes when generating SQL for string literalsJosh Rosen2016-08-2320-22/+23
| | | | | | | | When Spark emits SQL for a string literal, it should wrap the string in single quotes, not double quotes. Databases which adhere more strictly to the ANSI SQL standards, such as Postgres, allow only single-quotes to be used for denoting string literals (see http://stackoverflow.com/a/1992331/590203). Author: Josh Rosen <joshrosen@databricks.com> Closes #14763 from JoshRosen/SPARK-17194.