aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-16598][SQL][TEST] Added a test case for verifying the table ↵gatorsmile2016-08-121-0/+8
| | | | | | | | | | | | | | identifier parsing #### What changes were proposed in this pull request? So far, the test cases of `TableIdentifierParserSuite` do not cover the quoted cases. We should add one for avoiding regression. #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #14244 from gatorsmile/quotedIdentifiers.
* [MINOR][DOC] Fix style in examples across documentationhyukjinkwon2016-08-125-47/+47
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes the documentation as below: - Python has 4 spaces and Java and Scala has 2 spaces (See https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide). - Avoid excessive parentheses and curly braces for anonymous functions. (See https://github.com/databricks/scala-style-guide#anonymous) ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes #14593 from HyukjinKwon/minor-documentation.
* [SPARK-16985] Change dataFormat from yyyyMMddHHmm to yyyyMMddHHmmsshongshen2016-08-124-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In our cluster, sometimes the sql output maybe overrided. When I submit some sql, all insert into the same table, and the sql will cost less one minute, here is the detail, 1 sql1, 11:03 insert into table. 2 sql2, 11:04:11 insert into table. 3 sql3, 11:04:48 insert into table. 4 sql4, 11:05 insert into table. 5 sql5, 11:06 insert into table. The sql3's output file will override the sql2's output file. here is the log: ``` 16/05/04 11:04:11 INFO hive.SparkHiveHadoopWriter: XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_1204544348/10000/_tmp.p_20160428/attempt_201605041104_0001_m_000000_1 16/05/04 11:04:48 INFO hive.SparkHiveHadoopWriter: XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_212180468/10000/_tmp.p_20160428/attempt_201605041104_0001_m_000000_1 ``` The reason is the output file use SimpleDateFormat("yyyyMMddHHmm"), if two sql insert into the same table in the same minute, the output will be overrite. I think we should change dateFormat to "yyyyMMddHHmmss", in our cluster, we can't finished a sql in one second. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: hongshen <shenh062326@126.com> Closes #14574 from shenh062326/SPARK-16985.
* [SPARK-17013][SQL] Parse negative numeric literalspetermaxlee2016-08-115-55/+37
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch updates the SQL parser to parse negative numeric literals as numeric literals, instead of unary minus of positive literals. This allows the parser to parse the minimal value for each data type, e.g. "-32768S". ## How was this patch tested? Updated test cases. Author: petermaxlee <petermaxlee@gmail.com> Closes #14608 from petermaxlee/SPARK-17013.
* [SPARK-16975][SQL] Column-partition path starting '_' should be handled ↵Dongjoon Hyun2016-08-125-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | correctly ## What changes were proposed in this pull request? Currently, Spark ignores path names starting with underscore `_` and `.`. This causes read-failures for the column-partitioned file data sources whose partition column names starts from '_', e.g. `_col`. **Before** ```scala scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet") scala> spark.read.parquet("/tmp/parquet") org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /tmp/parquet20. It must be specified manually; ``` **After** ```scala scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet") scala> spark.read.parquet("/tmp/parquet") res2: org.apache.spark.sql.DataFrame = [id: bigint, _locality_code: int] ``` ## How was this patch tested? Pass the Jenkins with a new test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14585 from dongjoon-hyun/SPARK-16975-PARQUET.
* [MINOR][ML] Rename TreeEnsembleModels to TreeEnsembleModel for PySparkYanbo Liang2016-08-112-6/+6
| | | | | | | | | | | | ## What changes were proposed in this pull request? Fix the typo of ```TreeEnsembleModels``` for PySpark, it should ```TreeEnsembleModel``` which will be consistent with Scala. What's more, it represents a tree ensemble model, so ```TreeEnsembleModel``` should be more reasonable. This should not be used public, so it will not involve breaking change. ## How was this patch tested? No new tests, should pass existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14454 from yanboliang/TreeEnsembleModel.
* [SPARK-16434][SQL] Avoid per-record type dispatch in JSON when readinghyukjinkwon2016-08-125-216/+297
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, `JacksonParser.parse` is doing type-based dispatch for each row to convert the tokens to appropriate values for Spark. It might not have to be done like this because the schema is already kept. So, appropriate converters can be created first according to the schema once, and then apply them to each row. This PR corrects `JacksonParser` so that it creates all converters for the schema once and then applies them to each row rather than type dispatching for every row. Benchmark was proceeded with the codes below: #### Parser tests **Before** ```scala test("Benchmark for JSON converter") { val N = 500 << 8 val row = """{"struct":{"field1": true, "field2": 92233720368547758070}, "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]}, "arrayOfString":["str1", "str2"], "arrayOfInteger":[1, 2147483647, -2147483648], "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808], "arrayOfBigInteger":[922337203685477580700, -922337203685477580800], "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308], "arrayOfBoolean":[true, false, true], "arrayOfNull":[null, null, null, null], "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}], "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]], "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]] }""" val data = List.fill(N)(row) val dummyOption = new JSONOptions(Map.empty[String, String]) val schema = InferSchema.infer(spark.sparkContext.parallelize(Seq(row)), "", dummyOption) val factory = new JsonFactory() val benchmark = new Benchmark("JSON converter", N) benchmark.addCase("convert JSON file", 10) { _ => data.foreach { input => val parser = factory.createParser(input) parser.nextToken() JacksonParser.convertRootField(factory, parser, schema) } } benchmark.run() } ``` ``` JSON converter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ convert JSON file 1697 / 1807 0.1 13256.9 1.0X ``` **After** ```scala test("Benchmark for JSON converter") { val N = 500 << 8 val row = """{"struct":{"field1": true, "field2": 92233720368547758070}, "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]}, "arrayOfString":["str1", "str2"], "arrayOfInteger":[1, 2147483647, -2147483648], "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808], "arrayOfBigInteger":[922337203685477580700, -922337203685477580800], "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308], "arrayOfBoolean":[true, false, true], "arrayOfNull":[null, null, null, null], "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}], "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]], "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]] }""" val data = List.fill(N)(row) val dummyOption = new JSONOptions(Map.empty[String, String], new SQLConf()) val schema = InferSchema.infer(spark.sparkContext.parallelize(Seq(row)), dummyOption) val benchmark = new Benchmark("JSON converter", N) benchmark.addCase("convert JSON file", 10) { _ => val parser = new JacksonParser(schema, dummyOption) data.foreach { input => parser.parse(input) } } benchmark.run() } ``` ``` JSON converter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ convert JSON file 1401 / 1461 0.1 10947.4 1.0X ``` It seems parsing time is improved by roughly ~20% #### End-to-End test ```scala test("Benchmark for JSON reader") { val N = 500 << 8 val row = """{"struct":{"field1": true, "field2": 92233720368547758070}, "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]}, "arrayOfString":["str1", "str2"], "arrayOfInteger":[1, 2147483647, -2147483648], "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808], "arrayOfBigInteger":[922337203685477580700, -922337203685477580800], "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308], "arrayOfBoolean":[true, false, true], "arrayOfNull":[null, null, null, null], "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}], "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]], "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]] }""" val df = spark.sqlContext.read.json(spark.sparkContext.parallelize(List.fill(N)(row))) withTempPath { path => df.write.format("json").save(path.getCanonicalPath) val benchmark = new Benchmark("JSON reader", N) benchmark.addCase("reading JSON file", 10) { _ => spark.read.format("json").load(path.getCanonicalPath).collect() } benchmark.run() } } ``` **Before** ``` JSON reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ reading JSON file 6485 / 6924 0.0 50665.0 1.0X ``` **After** ``` JSON reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ reading JSON file 6350 / 6529 0.0 49609.3 1.0X ``` ## How was this patch tested? Existing test cases should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14102 from HyukjinKwon/SPARK-16434.
* [SPARK-13081][PYSPARK][SPARK_SUBMIT] Allow set pythonExec of driver and ↵Jeff Zhang2016-08-118-8/+72
| | | | | | | | | | | | executor through conf… Before this PR, user have to export environment variable to specify the python of driver & executor which is not so convenient for users. This PR is trying to allow user to specify python through configuration "--pyspark-driver-python" & "--pyspark-executor-python" Manually test in local & yarn mode for pyspark-shell and pyspark batch mode. Author: Jeff Zhang <zjffdu@apache.org> Closes #13146 from zjffdu/SPARK-13081.
* [SPARK-17022][YARN] Handle potential deadlock in driver handling messagesWangTaoTheTonic2016-08-111-3/+15
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? We directly send RequestExecutors to AM instead of transfer it to yarnShedulerBackend first, to avoid potential deadlock. ## How was this patch tested? manual tests Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #14605 from WangTaoTheTonic/lock.
* [SPARK-16868][WEB UI] Fix executor be both dead and alive on executor ui.huangzhaowei2016-08-111-0/+4
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In a heavy pressure of the spark application, since the executor will register it to driver block manager twice(because of heart beats), the executor will show as picture show: ![image](https://cloud.githubusercontent.com/assets/7404824/17467245/c1359094-5d4e-11e6-843a-f6d6347e1bf6.png) ## How was this patch tested? NA Details in: [SPARK-16868](https://issues.apache.org/jira/browse/SPARK-16868) Author: huangzhaowei <carlmartinmax@gmail.com> Closes #14530 from SaintBacchus/SPARK-16868.
* [SPARK-13602][CORE] Add shutdown hook to DriverRunner to prevent driver ↵Bryan Cutler2016-08-112-50/+142
| | | | | | | | | | | | | | | | | | process leak ## What changes were proposed in this pull request? Added shutdown hook to DriverRunner to kill the driver process in case the Worker JVM exits suddenly and the `WorkerWatcher` was unable to properly catch this. Did some cleanup to consolidate driver state management and setting of finalized vars within the running thread. ## How was this patch tested? Added unit tests to verify that final state and exception variables are set accordingly for successfull, failed, and errors in the driver process. Retrofitted existing test to verify killing of mocked process ends with the correct state and stops properly Manually tested (with deploy-mode=cluster) that the shutdown hook is called by forcibly exiting the `Worker` and various points in the code with the `WorkerWatcher` both disabled and enabled. Also, manually killed the driver through the ui and verified that the `DriverRunner` interrupted, killed the process and exited properly. Author: Bryan Cutler <cutlerb@gmail.com> Closes #11746 from BryanCutler/DriverRunner-shutdown-hook-SPARK-13602.
* [SPARK-17018][SQL] literals.sql for testing literal parsingpetermaxlee2016-08-115-62/+476
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds literals.sql for testing literal parsing end-to-end in SQL. ## How was this patch tested? The patch itself is only about adding test cases. Author: petermaxlee <petermaxlee@gmail.com> Closes #14598 from petermaxlee/SPARK-17018-2.
* [SPARK-17021][SQL] simplify the constructor parameters of QuantileSummariesWenchen Fan2016-08-111-10/+11
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1. `sampled` doesn't need to be `ArrayBuffer`, we never update it, but assign new value 2. `count` doesn't need to be `var`, we never mutate it. 3. `headSampled` doesn't need to be in constructor, we never pass a non-empty `headSampled` to constructor ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14603 from cloud-fan/simply.
* [SPARK-16958] [SQL] Reuse subqueries within the same queryDavies Liu2016-08-117-49/+215
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? There could be multiple subqueries that generate same results, we could re-use the result instead of running it multiple times. This PR also cleanup up how we run subqueries. For SQL query ```sql select id,(select avg(id) from t) from t where id > (select avg(id) from t) ``` The explain is ``` == Physical Plan == *Project [id#15L, Subquery subquery29 AS scalarsubquery()#35] : +- Subquery subquery29 : +- *HashAggregate(keys=[], functions=[avg(id#15L)]) : +- Exchange SinglePartition : +- *HashAggregate(keys=[], functions=[partial_avg(id#15L)]) : +- *Range (0, 1000, splits=4) +- *Filter (cast(id#15L as double) > Subquery subquery29) : +- Subquery subquery29 : +- *HashAggregate(keys=[], functions=[avg(id#15L)]) : +- Exchange SinglePartition : +- *HashAggregate(keys=[], functions=[partial_avg(id#15L)]) : +- *Range (0, 1000, splits=4) +- *Range (0, 1000, splits=4) ``` The visualized plan: ![reuse-subquery](https://cloud.githubusercontent.com/assets/40902/17573229/e578d93c-5f0d-11e6-8a3c-0150d81d3aed.png) ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #14548 from davies/subq.
* [SPARK-16952] don't lookup spark home directory when executor uri is setMichael Gummelt2016-08-112-7/+26
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? remove requirement to set spark.mesos.executor.home when spark.executor.uri is used ## How was this patch tested? unit tests Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14552 from mgummelt/fix-spark-home.
* [SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and ↵hyukjinkwon2016-08-115-122/+124
| | | | | | | | | | | | | | | | | | | | | | | | | | | indentation in documentation ## What changes were proposed in this pull request? Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments. This PR fixes three things below: - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java. - Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples. - Fix `StructuredNetworkWordCountWindowed` and `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially). ## How was this patch tested? N/A Closes https://github.com/apache/spark/pull/14491 Author: hyukjinkwon <gurwls223@gmail.com> Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local> Closes #14564 from HyukjinKwon/SPARK-16886.
* [SPARK-16941] Use concurrentHashMap instead of scala Map in ↵huangzhaowei2016-08-113-11/+13
| | | | | | | | | | | | | | | | | SparkSQLOperationManager. ## What changes were proposed in this pull request? ThriftServer will have some thread-safe problem in **SparkSQLOperationManager**. Add a SynchronizedMap trait for the maps in it to avoid this problem. Details in [SPARK-16941](https://issues.apache.org/jira/browse/SPARK-16941) ## How was this patch tested? NA Author: huangzhaowei <carlmartinmax@gmail.com> Closes #14534 from SaintBacchus/SPARK-16941.
* Correct example value for spark.ssl.YYY.XXX settingsAndrew Ash2016-08-111-2/+4
| | | | | | | | | | Docs adjustment to: - link to other relevant section of docs - correct statement about the only value when actually other values are supported Author: Andrew Ash <andrew@andrewash.com> Closes #14581 from ash211/patch-10.
* [SPARK-17015][SQL] group-by/order-by ordinal and arithmetic testspetermaxlee2016-08-118-232/+613
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds three test files: 1. arithmetic.sql.out 2. order-by-ordinal.sql 3. group-by-ordinal.sql This includes https://github.com/apache/spark/pull/14594. ## How was this patch tested? This is a test case change. Author: petermaxlee <petermaxlee@gmail.com> Closes #14595 from petermaxlee/SPARK-17015.
* [SPARK-17011][SQL] Support testing exceptions in SQLQueryTestSuitepetermaxlee2016-08-109-68/+161
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds exception testing to SQLQueryTestSuite. When there is an exception in query execution, the query result contains the the exception class along with the exception message. As part of this, I moved some additional test cases for limit from SQLQuerySuite over to SQLQueryTestSuite. ## How was this patch tested? This is a test harness change. Author: petermaxlee <petermaxlee@gmail.com> Closes #14592 from petermaxlee/SPARK-17011.
* [SPARK-17010][MINOR][DOC] Wrong description in memory management documentTao Wang2016-08-101-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? change the remain percent to right one. ## How was this patch tested? Manual review Author: Tao Wang <wangtao111@huawei.com> Closes #14591 from WangTaoTheTonic/patch-1.
* [SPARK-17007][SQL] Move test data files into a test-data folderpetermaxlee2016-08-1040-39/+40
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch moves all the test data files in sql/core/src/test/resources to sql/core/src/test/resources/test-data, so we don't clutter the top level sql/core/src/test/resources. Also deleted sql/core/src/test/resources/old-repeated.parquet since it is no longer used. The change will make it easier to spot sql-tests directory. ## How was this patch tested? This is a test-only change. Author: petermaxlee <petermaxlee@gmail.com> Closes #14589 from petermaxlee/SPARK-17007.
* [SPARK-17008][SPARK-17009][SQL] Normalization and isolation in ↵petermaxlee2016-08-108-65/+180
| | | | | | | | | | | | | | | | | | | SQLQueryTestSuite. ## What changes were proposed in this pull request? This patch enhances SQLQueryTestSuite in two ways: 1. SPARK-17009: Use a new SparkSession for each test case to provide stronger isolation (e.g. config changes in one test case does not impact another). That said, we do not currently isolate catalog changes. 2. SPARK-17008: Normalize query output using sorting, inspired by HiveComparisonTest. I also ported a few new test cases over from SQLQuerySuite. ## How was this patch tested? This is a test harness update. Author: petermaxlee <petermaxlee@gmail.com> Closes #14590 from petermaxlee/SPARK-17008.
* [SPARK-14743][YARN] Add a configurable credential manager for Spark running ↵jerryshao2016-08-1024-573/+985
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | on YARN ## What changes were proposed in this pull request? Add a configurable token manager for Spark on running on yarn. ### Current Problems ### 1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes. 2. Also this problem exits in timely token renewer and updater. ### Changes In This Proposal ### In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes: 1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface. 2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on. 3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded. ### Behavior Changes ### For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive). For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations: 1. `spark.yarn.security.tokens.test.enabled` to true 2. `spark.yarn.security.tokens.test.class` to the full qualified class name. So we still keep the same semantics as current code while add one new configuration. ### Current Status ### - [x] token provider interface and management framework. - [x] implement built-in token providers (hdfs, hbase, hive). - [x] Coverage of unit test. - [x] Integrated test with security cluster. ## How was this patch tested? Unit test and integrated test. Please suggest and review, any comment is greatly appreciated. Author: jerryshao <sshao@hortonworks.com> Closes #14065 from jerryshao/SPARK-16342.
* [SPARK-12920][CORE] Honor "spark.ui.retainedStages" to reduce mem-pressureRajesh Balamohan2016-08-102-18/+36
| | | | | | | | | | When large number of jobs are run concurrently with Spark thrift server, thrift server starts running at high CPU due to GC pressure. Job UI retention causes memory pressure with large jobs. https://issues.apache.org/jira/secure/attachment/12783302/SPARK-12920.profiler_job_progress_listner.png has the profiler snapshot. This PR honors `spark.ui.retainedStages` strictly to reduce memory pressure. Manual and unit tests Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #10846 from rajeshbalamohan/SPARK-12920.
* [SPARK-16928] [SQL] Recursive call of ColumnVector::getInt() breaks JIT inliningQifan Pu2016-08-104-25/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In both `OnHeapColumnVector` and `OffHeapColumnVector`, we implemented `getInt()` with the following code pattern: ``` public int getInt(int rowId) { if (dictionary == null) { return intData[rowId]; } else { return dictionary.decodeToInt(dictionaryIds.getInt(rowId)); } } ``` As `dictionaryIds` is also a `ColumnVector`, this results in a recursive call of `getInt()` and breaks JIT inlining. As a result, `getInt()` will not get inlined. We fix this by adding a separate method `getDictId()` specific for `dictionaryIds` to use. ## How was this patch tested? We tested the difference with the following aggregate query on a TPCDS dataset (with scale factor = 5): ``` select max(ss_sold_date_sk) as max_ss_sold_date_sk, from store_sales ``` The query runtime is improved, from 202ms (before) to 159ms (after). Author: Qifan Pu <qifan.pu@gmail.com> Closes #14513 from ooq/SPARK-16928.
* [SPARK-16579][SPARKR] add install.spark functionJunyang Qian2016-08-107-4/+267
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R. Updates: Several changes have been made: - `install.spark()` - check existence of tar file in the cache folder, and download only if not found - trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option - use 2.0.0 - `sparkR.session()` - can install spark when not found in `SPARK_HOME` ## How was this patch tested? Manual tests, running the check-cran.sh script added in #14173. Author: Junyang Qian <junyangq@databricks.com> Closes #14258 from junyangq/SPARK-16579.
* [SPARK-16710][SPARKR][ML] spark.glm should support weightColYanbo Liang2016-08-103-5/+36
| | | | | | | | | | | | ## What changes were proposed in this pull request? Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14346 from yanboliang/spark-16710.
* [SPARK-15639] [SPARK-16321] [SQL] Push down filter at RowGroups level for ↵Liang-Chi Hsieh2016-08-105-147/+143
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | parquet reader ## What changes were proposed in this pull request? The base class `SpecificParquetRecordReaderBase` used for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader. The benchmark that excludes the time of writing Parquet file: test("Benchmark for Parquet") { val N = 500 << 12 withParquetTable((0 until N).map(i => (101, i)), "t") { val benchmark = new Benchmark("Parquet reader", N) benchmark.addCase("reading Parquet file", 10) { iter => sql("SELECT _1 FROM t where t._1 < 100").collect() } benchmark.run() } } `withParquetTable` in default will run tests for vectorized reader non-vectorized readers. I only let it run vectorized reader. When we set the block size of parquet as 1024 to have multiple row groups. The benchmark is: Before this patch: The retrieved row groups: 8063 Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic Intel(R) Core(TM) i7-5557U CPU 3.10GHz Parquet reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ reading Parquet file 825 / 1233 2.5 402.6 1.0X After this patch: The retrieved row groups: 0 Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic Intel(R) Core(TM) i7-5557U CPU 3.10GHz Parquet reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ reading Parquet file 306 / 503 6.7 149.6 1.0X Next, I run the benchmark for non-pushdown case using the same benchmark code but with disabled pushdown configuration. This time the parquet block size is default value. Before this patch: Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic Intel(R) Core(TM) i7-5557U CPU 3.10GHz Parquet reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ reading Parquet file 136 / 238 15.0 66.5 1.0X After this patch: Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic Intel(R) Core(TM) i7-5557U CPU 3.10GHz Parquet reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ reading Parquet file 124 / 193 16.5 60.7 1.0X For non-pushdown case, from the results, I think this patch doesn't affect normal code path. I've manually output the `totalRowCount` in `SpecificParquetRecordReaderBase` to see if this patch actually filter the row-groups. When running the above benchmark: After this patch: `totalRowCount = 0` Before this patch: `totalRowCount = 1024000` ## How was this patch tested? Existing tests should be passed. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13701 from viirya/vectorized-reader-push-down-filter2.
* [SPARK-15899][SQL] Fix the construction of the file path with hadoop Pathavulanov2016-08-103-27/+35
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix the construction of the file path. Previous way of construction caused the creation of incorrect path on Windows. ## How was this patch tested? Run SQL unit tests on Windows Author: avulanov <nashb@yandex.ru> Closes #13868 from avulanov/SPARK-15899-file.
* [SPARK-16866][SQL] Infrastructure for file-based SQL end-to-end testspetermaxlee2016-08-106-38/+267
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch introduces SQLQueryTestSuite, a basic framework for end-to-end SQL test cases defined in spark/sql/core/src/test/resources/sql-tests. This is a more standard way to test SQL queries end-to-end in different open source database systems, because it is more manageable to work with files. This is inspired by HiveCompatibilitySuite, but simplified for general Spark SQL tests. Once this is merged, I can work towards porting SQLQuerySuite over, and eventually also move the existing HiveCompatibilitySuite to use this framework. Unlike HiveCompatibilitySuite, SQLQueryTestSuite compares both the output schema and the output data (in string form). When there is a mismatch, the error message looks like the following: ``` [info] - blacklist.sql !!! IGNORED !!! [info] - number-format.sql *** FAILED *** (2 seconds, 405 milliseconds) [info] Expected "...147483648 -214748364[8]", but got "...147483648 -214748364[9]" Result should match for query #1 (SQLQueryTestSuite.scala:171) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.assertResult(Assertions.scala:1171) ``` ## How was this patch tested? This is a test infrastructure change. Author: petermaxlee <petermaxlee@gmail.com> Closes #14472 from petermaxlee/SPARK-16866.
* [SPARK-16324][SQL] regexp_extract should doc that it returns empty string ↵Sean Owen2016-08-103-2/+11
| | | | | | | | | | | | | | | | when match fails ## What changes were proposed in this pull request? Doc that regexp_extract returns empty string when regex or group does not match ## How was this patch tested? Jenkins test, with a few new test cases Author: Sean Owen <sowen@cloudera.com> Closes #14525 from srowen/SPARK-16324.
* [SPARK-16927][SPARK-16923] Override task properties at dispatcher.Timothy Chen2016-08-102-22/+33
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? - enable setting default properties for all jobs submitted through the dispatcher [SPARK-16927] - remove duplication of conf vars on cluster submitted jobs [SPARK-16923] (this is a small fix, so I'm including in the same PR) ## How was this patch tested? mesos/spark integration test suite manual testing Author: Timothy Chen <tnachen@gmail.com> Closes #14511 from mgummelt/override-props.
* Typo: Fow -> ForAndrew Ash2016-08-101-1/+1
| | | | | | Author: Andrew Ash <andrew@andrewash.com> Closes #14563 from ash211/patch-8.
* [SPARK-16185][SQL] Better Error Messages When Creating Table As Select ↵gatorsmile2016-08-103-1/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Without Enabling Hive Support #### What changes were proposed in this pull request? When we do not turn on the Hive Support, the following query generates a confusing error message by Planner: ```Scala sql("CREATE TABLE t2 SELECT a, b from t1") ``` ``` assertion failed: No plan for CreateTable CatalogTable( Table: `t2` Created: Tue Aug 09 23:45:32 PDT 2016 Last Access: Wed Dec 31 15:59:59 PST 1969 Type: MANAGED Provider: hive Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), ErrorIfExists +- Relation[a#19L,b#20L] parquet java.lang.AssertionError: assertion failed: No plan for CreateTable CatalogTable( Table: `t2` Created: Tue Aug 09 23:45:32 PDT 2016 Last Access: Wed Dec 31 15:59:59 PST 1969 Type: MANAGED Provider: hive Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), ErrorIfExists +- Relation[a#19L,b#20L] parquet ``` This PR is to issue a better error message: ``` Hive support is required to use CREATE Hive TABLE AS SELECT ``` #### How was this patch tested? Added test cases in `DDLSuite.scala` Author: gatorsmile <gatorsmile@gmail.com> Closes #13886 from gatorsmile/createCatalogedTableAsSelect.
* [SPARK-10601][SQL] Support `MINUS` set operatorDongjoon Hyun2016-08-104-2/+21
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds `MINUS` set operator which is equivalent `EXCEPT DISTINCT`. This will slightly improve the compatibility with Oracle. ## How was this patch tested? Pass the Jenkins with newly added testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14570 from dongjoon-hyun/SPARK-10601.
* [SPARK-16959][SQL] Rebuild Table Comment when Retrieving Metadata from Hive ↵gatorsmile2016-08-102-3/+7
| | | | | | | | | | | | | | | | Metastore ### What changes were proposed in this pull request? The `comment` in `CatalogTable` returned from Hive is always empty. We store it in the table property when creating a table. However, when we try to retrieve the table metadata from Hive metastore, we do not rebuild it. The `comment` is always empty. This PR is to fix the issue. ### How was this patch tested? Fixed the test case to verify the change. Author: gatorsmile <gatorsmile@gmail.com> Closes #14550 from gatorsmile/tableComment.
* [MINOR][SPARKR] R API documentation for "coltypes" is confusingXin Ren2016-08-101-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? R API documentation for "coltypes" is confusing, found when working on another ticket. Current version http://spark.apache.org/docs/2.0.0/api/R/coltypes.html, where parameters have 2 "x" which is a duplicate, and also the example is not very clear ![current](https://cloud.githubusercontent.com/assets/3925641/17386808/effb98ce-59a2-11e6-9657-d477d258a80c.png) ![screen shot 2016-08-03 at 5 56 00 pm](https://cloud.githubusercontent.com/assets/3925641/17386884/91831096-59a3-11e6-84af-39890b3d45d8.png) ## How was this patch tested? Tested manually on local machine. And the screenshots are like below: ![screen shot 2016-08-07 at 11 29 20 pm](https://cloud.githubusercontent.com/assets/3925641/17471144/df36633c-5cf6-11e6-8238-4e32ead0e529.png) ![screen shot 2016-08-03 at 5 56 22 pm](https://cloud.githubusercontent.com/assets/3925641/17386896/9d36cb26-59a3-11e6-9619-6dae29f7ab17.png) Author: Xin Ren <iamshrek@126.com> Closes #14489 from keypointt/rExample.
* Fixed typoMichał Kiełbowicz2016-08-091-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixed small typo - "value ... ~~in~~ is null" ## How was this patch tested? Still compiles! Author: Michał Kiełbowicz <jupblb@users.noreply.github.com> Closes #14569 from jupblb/typo-fix.
* Make logDir easily copy/paste-ableAndrew Ash2016-08-091-1/+1
| | | | | | | | | | | In many terminals double-clicking and dragging also includes the trailing period. Simply remove this to make the value more easily copy/pasteable. Example value: `hdfs://mybox-123.net.example.com:8020/spark-events.` Author: Andrew Ash <andrew@andrewash.com> Closes #14566 from ash211/patch-9.
* [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurableJosh Rosen2016-08-093-3/+21
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration. **Background:** This application-killing was added in 6b5980da796e0204a7735a31fb454f312bc9daac (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path. **Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative. I'd like to merge this patch into master, branch-2.0, and branch-1.6. ## How was this patch tested? I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now. Author: Josh Rosen <joshrosen@databricks.com> Closes #14544 from JoshRosen/add-setting-for-max-executor-failures.
* [SPARK-16905] SQL DDL: MSCK REPAIR TABLEDavies Liu2016-08-097-9/+211
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? MSCK REPAIR TABLE could be used to recover the partitions in external catalog based on partitions in file system. Another syntax is: ALTER TABLE table RECOVER PARTITIONS The implementation in this PR will only list partitions (not the files with a partition) in driver (in parallel if needed). ## How was this patch tested? Added unit tests for it and Hive compatibility test suite. Author: Davies Liu <davies@databricks.com> Closes #14500 from davies/repair_table.
* [SPARK-16950] [PYSPARK] fromOffsets parameter support in ↵Mariusz Strzelecki2016-08-093-13/+10
| | | | | | | | | | | | | | | | | | KafkaUtils.createDirectStream for python3 ## What changes were proposed in this pull request? Ability to use KafkaUtils.createDirectStream with starting offsets in python 3 by using java.lang.Number instead of Long during param mapping in scala helper. This allows py4j to pass Integer or Long to the map and resolves ClassCastException problems. ## How was this patch tested? unit tests jerryshao - could you please look at this PR? Author: Mariusz Strzelecki <mariusz.strzelecki@allegrogroup.com> Closes #14540 from szczeles/kafka_pyspark.
* [SPARK-16933][ML] Fix AFTAggregator in AFTSurvivalRegression serializes ↵Yanbo Liang2016-08-091-18/+29
| | | | | | | | | | | | | | | | | | | | unnecessary data. ## What changes were proposed in this pull request? Similar to ```LeastSquaresAggregator``` in #14109, ```AFTAggregator``` used for ```AFTSurvivalRegression``` ends up serializing the ```parameters``` and ```featuresStd```, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. This PR is highly inspired by #14109. ## How was this patch tested? I tested this locally and verified the serialization reduction. Before patch ![image](https://cloud.githubusercontent.com/assets/1962026/17512035/abb93f04-5dda-11e6-97d3-8ae6b61a0dfd.png) After patch ![image](https://cloud.githubusercontent.com/assets/1962026/17512024/9e0dc44c-5dda-11e6-93d0-6e130ba0d6aa.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #14519 from yanboliang/spark-16933.
* [SPARK-16964][SQL] Remove private[sql] and private[spark] from sql.execution ↵Reynold Xin2016-08-0963-177/+170
| | | | | | | | | | | | | | | | package ## What changes were proposed in this pull request? This package is meant to be internal, and as a result it does not make sense to mark things as private[sql] or private[spark]. It simply makes debugging harder when Spark developers need to inspect the plans at runtime. This patch removes all private[sql] and private[spark] visibility modifiers in org.apache.spark.sql.execution. ## How was this patch tested? N/A - just visibility changes. Author: Reynold Xin <rxin@databricks.com> Closes #14554 from rxin/remote-private.
* [SPARK-16809] enable history server links in dispatcher UIMichael Gummelt2016-08-097-10/+75
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Links the Spark Mesos Dispatcher UI to the history server UI - adds spark.mesos.dispatcher.historyServer.url - explicitly generates frameworkIDs for the launched drivers, so the dispatcher knows how to correlate drivers and frameworkIDs ## How was this patch tested? manual testing Author: Michael Gummelt <mgummelt@mesosphere.io> Author: Sergiusz Urbaniak <sur@mesosphere.io> Closes #14414 from mgummelt/history-server.
* [SPARK-16940][SQL] `checkAnswer` should raise `TestFailedException` for ↵Dongjoon Hyun2016-08-091-1/+9
| | | | | | | | | | | | | | | | | | | | | wrong results ## What changes were proposed in this pull request? This PR fixes the following to make `checkAnswer` raise `TestFailedException` again instead of `java.util.NoSuchElementException: key not found: TZ` in the environments without `TZ` variable. Also, this PR adds `QueryTestSuite` class for testing `QueryTest` itself. ```scala - |Timezone Env: ${sys.env("TZ")} + |Timezone Env: ${sys.env.getOrElse("TZ", "")} ``` ## How was this patch tested? Pass the Jenkins tests with a new test suite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14528 from dongjoon-hyun/SPARK-16940.
* [SPARK-16522][MESOS] Spark application throws exception on exit.Sun Rui2016-08-092-1/+39
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Spark applications running on Mesos throw exception upon exit. For details, refer to https://issues.apache.org/jira/browse/SPARK-16522. I am not sure if there is any better fix, so wait for review comments. ## How was this patch tested? Manual test. Observed that the exception is gone upon application exit. Author: Sun Rui <sunrui2016@gmail.com> Closes #14175 from sun-rui/SPARK-16522.
* [SPARK-16606][CORE] Misleading warning for SparkContext.getOrCreate "WARN ↵Sean Owen2016-08-091-4/+10
| | | | | | | | | | | | | | | | | | | SparkContext: Use an existing SparkContext, some configuration may not take effect." ## What changes were proposed in this pull request? SparkContext.getOrCreate shouldn't warn about ignored config if - it wasn't ignored because a new context is created with it or - no config was actually provided ## How was this patch tested? Jenkins + existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #14533 from srowen/SPARK-16606.
* [SPARK-16610][SQL] Add `orc.compress` as an alias for `compression` option.hyukjinkwon2016-08-092-3/+32
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? For ORC source, Spark SQL has a writer option `compression`, which is used to set the codec and its value will be also set to `orc.compress` (the orc conf used for codec). However, if a user only set `orc.compress` in the writer option, we should not use the default value of `compression` (snappy) as the codec. Instead, we should respect the value of `orc.compress`. This PR makes ORC data source not ignoring `orc.compress` when `comperssion` is unset. So, here is the behaviour, 1. Check `compression` and use this if it is set. 2. If `compression` is not set, check `orc.compress` and use it. 3. If `compression` and `orc.compress` are not set, then use the default snappy. ## How was this patch tested? Unit test in `OrcQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14518 from HyukjinKwon/SPARK-16610.