aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-6339][SQL] Supports CREATE TEMPORARY VIEW tableIdentifier AS querySean Zhong2016-05-045-38/+175
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR support new SQL syntax CREATE TEMPORARY VIEW. Like: ``` CREATE TEMPORARY VIEW viewName AS SELECT * from xx CREATE OR REPLACE TEMPORARY VIEW viewName AS SELECT * from xx CREATE TEMPORARY VIEW viewName (c1 COMMENT 'blabla', c2 COMMENT 'blabla') AS SELECT * FROM xx ``` ## How was this patch tested? Unit tests. Author: Sean Zhong <clockfly@gmail.com> Closes #12872 from clockfly/spark-6399.
* [SPARK-14896][SQL] Deprecate HiveContext in pythonAndrew Or2016-05-043-4/+9
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? See title. ## How was this patch tested? PySpark tests. Author: Andrew Or <andrew@databricks.com> Closes #12917 from andrewor14/deprecate-hive-context-python.
* [MINOR][SQL] Fix typo in DataFrameReader csv documentationsethah2016-05-041-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Typo fix ## How was this patch tested? No tests My apologies for the tiny PR, but I stumbled across this today and wanted to get it corrected for 2.0. Author: sethah <seth.hendrickson16@gmail.com> Closes #12912 from sethah/csv_typo.
* [SPARK-15116] In REPL we should create SparkSession first and get ↵Wenchen Fan2016-05-044-43/+26
| | | | | | | | | | | | | | | | SparkContext from it ## What changes were proposed in this pull request? see https://github.com/apache/spark/pull/12873#discussion_r61993910. The problem is, if we create `SparkContext` first and then call `SparkSession.builder.enableHiveSupport().getOrCreate()`, we will reuse the existing `SparkContext` and the hive flag won't be set. ## How was this patch tested? verified it locally. Author: Wenchen Fan <wenchen@databricks.com> Closes #12890 from cloud-fan/repl.
* [SPARK-13001][CORE][MESOS] Prevent getting offers when reached max coresSebastien Rainville2016-05-043-17/+53
| | | | | | | | | | Similar to https://github.com/apache/spark/pull/8639 This change rejects offers for 120s when reached `spark.cores.max` in coarse-grained mode to mitigate offer starvation. This prevents Mesos to send us offers again and again, starving other frameworks. This is especially problematic when running many small frameworks on the same Mesos cluster, e.g. many small Sparks streaming jobs, and cause the bigger spark jobs to stop receiving offers. By rejecting the offers for a long period of time, they become available to those other frameworks. Author: Sebastien Rainville <sebastien@hopper.com> Closes #10924 from sebastienrainville/master.
* [SPARK-15031][EXAMPLE] Use SparkSession in Scala/Python/Java example.Dongjoon Hyun2016-05-04155-1232/+852
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`. - Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files. - Add `getConf` in Python SparkContext class: `python/pyspark/context.py` - Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**: - `SqlNetworkWordCount.scala` - `JavaSqlNetworkWordCount.java` - `sql_network_wordcount.py` Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue. - `simple_params_example.py` - `aft_survival_regression.py` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12809 from dongjoon-hyun/SPARK-15031.
* [SPARK-12299][CORE] Remove history serving functionality from MasterBryan Cutler2016-05-0411-316/+86
| | | | | | | | | | Remove history server functionality from standalone Master. Previously, the Master process rebuilt a SparkUI once the application was completed which sometimes caused problems, such as OOM, when the application event log is large (see SPARK-6270). Keeping this functionality out of the Master will help to simplify the process and increase stability. Testing for this change included running core unit tests and manually running an application on a standalone cluster to verify that it completed successfully and that the Master UI functioned correctly. Also added 2 unit tests to verify killing an application and driver from MasterWebUI makes the correct request to the Master. Author: Bryan Cutler <cutlerb@gmail.com> Closes #10991 from BryanCutler/remove-history-master-SPARK-12299.
* [SPARK-15121] Improve logging of external shuffle handlerThomas Graves2016-05-041-1/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add more informative logging in the external shuffle service to aid in debugging who is connecting to the YARN Nodemanager when the external shuffle service runs under it. ## How was this patch tested? Ran and saw logs coming out in log file. Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Closes #12900 from tgravescs/SPARK-15121.
* [SPARK-15126][SQL] RuntimeConfig.set should return UnitReynold Xin2016-05-044-16/+11
| | | | | | | | | | | | ## What changes were proposed in this pull request? Currently we return RuntimeConfig itself to facilitate chaining. However, it makes the output in interactive environments (e.g. notebooks, scala repl) weird because it'd show the response of calling set as a RuntimeConfig itself. ## How was this patch tested? Updated unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12902 from rxin/SPARK-15126.
* [SPARK-15103][SQL] Refactored FileCatalog class to allow StreamFileCatalog ↵Tathagata Das2016-05-048-285/+410
| | | | | | | | | | | | | | | | | | | | | to infer partitioning ## What changes were proposed in this pull request? File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog. This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files. - HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning. - StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log. - The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala. ## How was this patch tested? - FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query. - Other unit tests are unchanged and pass as expected. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12879 from tdas/SPARK-15103.
* [SPARK-15115][SQL] Reorganize whole stage codegen benchmark suitesReynold Xin2016-05-048-422/+603
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We currently have a single suite that is very large, making it difficult to maintain and play with specific primitives. This patch reorganizes the file by creating multiple benchmark suites in a single package. Most of the changes are straightforward move of code. On top of the code moving, I did: 1. Use SparkSession instead of SQLContext. 2. Turned most benchmark scenarios into a their own test cases, rather than having multiple scenarios in a single test case, which takes forever to run. ## How was this patch tested? This is a test only change. Author: Reynold Xin <rxin@databricks.com> Closes #12891 from rxin/SPARK-15115.
* [MINOR] Add python3 compatibility in python examplesZheng RuiFeng2016-05-042-0/+8
| | | | | | | | | | | | ## What changes were proposed in this pull request? Add python3 compatibility in python examples ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12868 from zhengruifeng/fix_gmm_py.
* [SPARK-14951] [SQL] Support subexpression elimination in TungstenAggregateLiang-Chi Hsieh2016-05-044-41/+109
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We can support subexpression elimination in TungstenAggregate by using current `EquivalentExpressions` which is already used in subexpression elimination for expression codegen. However, in wholestage codegen, we can't wrap the common expression's codes in functions as before, we simply generate the code snippets for common expressions. These code snippets are inserted before the common expressions are actually used in generated java codes. For multiple `TypedAggregateExpression` used in aggregation operator, since their input type should be the same. So their `inputDeserializer` will be the same too. This patch can also reduce redundant input deserialization. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12729 from viirya/subexpr-elimination-tungstenaggregate.
* [SPARK-15109][SQL] Accept Dataset[_] in joinsReynold Xin2016-05-042-8/+8
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch changes the join API in Dataset so they can accept any Dataset, rather than just DataFrames. ## How was this patch tested? N/A. Author: Reynold Xin <rxin@databricks.com> Closes #12886 from rxin/SPARK-15109.
* [SPARK-15022][SPARK-15023][SQL][STREAMING] Add support for testing against ↵Liwei Lin2016-05-0410-34/+89
| | | | | | | | | | | | | | | | | | | | | | | the `ProcessingTime(intervalMS > 0)` trigger and `ManualClock` ## What changes were proposed in this pull request? Currently in `StreamTest`, we have a `StartStream` which will start a streaming query against trigger `ProcessTime(intervalMS = 0)` and `SystemClock`. We also need to test cases against `ProcessTime(intervalMS > 0)`, which often requires `ManualClock`. This patch: - fixes an issue of `ProcessingTimeExecutor`, where for a batch it should run `batchRunner` only once but might run multiple times under certain conditions; - adds support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `AdvanceManualClock`, by specifying them as fields for `StartStream`, and by adding an `AdvanceClock` action; - adds a test, which takes advantage of the new `StartStream` and `AdvanceManualClock`, to test against [PR#[SPARK-14942] Reduce delay between batch construction and execution ](https://github.com/apache/spark/pull/12725). ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #12797 from lw-lin/add-trigger-test-support.
* [SPARK-4224][CORE][YARN] Support group aclsDhruve Ashar2016-05-0411-35/+468
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently only a list of users can be specified for view and modify acls. This change enables a group of admins/devs/users to be provisioned for viewing and modifying Spark jobs. **Changes Proposed in the fix** Three new corresponding config entries have been added where the user can specify the groups to be given access. ``` spark.admin.acls.groups spark.modify.acls.groups spark.ui.view.acls.groups ``` New config entries were added because specifying the users and groups explicitly is a better and cleaner way compared to specifying them in the existing config entry using a delimiter. A generic trait has been introduced to provide the user to group mapping which makes it pluggable to support a variety of mapping protocols - similar to the one used in hadoop. A default unix shell based implementation has been provided. Custom user to group mapping protocol can be specified and configured by the entry ```spark.user.groups.mapping``` **How the patch was Tested** We ran different spark jobs setting the config entries in combinations of admin, modify and ui acls. For modify acls we tried killing the job stages from the ui and using yarn commands. For view acls we tried accessing the UI tabs and the logs. Headless accounts were used to launch these jobs and different users tried to modify and view the jobs to ensure that the groups mapping applied correctly. Additional Unit tests have been added without modifying the existing ones. These test for different ways of setting the acls through configuration and/or API and validate the expected behavior. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #12760 from dhruve/impr/SPARK-4224.
* [SPARK-14844][ML] Add setFeaturesCol and setPredictionCol to KMeansM…Dominik Jastrzębski2016-05-042-0/+23
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Introduction of setFeaturesCol and setPredictionCol methods to KMeansModel in ML library. ## How was this patch tested? By running KMeansSuite. Author: Dominik Jastrzębski <dominik.jastrzebski@codilime.com> Closes #12609 from dominik-jastrzebski/master.
* [SPARK-14127][SQL] Native "DESC [EXTENDED | FORMATTED] <table>" DDL commandCheng Lian2016-05-0412-29/+131
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR implements native `DESC [EXTENDED | FORMATTED] <table>` DDL command. Sample output: ``` scala> spark.sql("desc extended src").show(100, truncate = false) +----------------------------+---------------------------------+-------+ |col_name |data_type |comment| +----------------------------+---------------------------------+-------+ |key |int | | |value |string | | | | | | |# Detailed Table Information|CatalogTable(`default`.`src`, ...| | +----------------------------+---------------------------------+-------+ scala> spark.sql("desc formatted src").show(100, truncate = false) +----------------------------+----------------------------------------------------------+-------+ |col_name |data_type |comment| +----------------------------+----------------------------------------------------------+-------+ |key |int | | |value |string | | | | | | |# Detailed Table Information| | | |Database: |default | | |Owner: |lian | | |Create Time: |Mon Jan 04 17:06:00 CST 2016 | | |Last Access Time: |Thu Jan 01 08:00:00 CST 1970 | | |Location: |hdfs://localhost:9000/user/hive/warehouse_hive121/src | | |Table Type: |MANAGED | | |Table Parameters: | | | | transient_lastDdlTime |1451898360 | | | | | | |# Storage Information | | | |SerDe Library: |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | |InputFormat: |org.apache.hadoop.mapred.TextInputFormat | | |OutputFormat: |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat| | |Num Buckets: |-1 | | |Bucket Columns: |[] | | |Sort Columns: |[] | | |Storage Desc Parameters: | | | | serialization.format |1 | | +----------------------------+----------------------------------------------------------+-------+ ``` ## How was this patch tested? A test case is added to `HiveDDLSuite` to check command output. Author: Cheng Lian <lian@databricks.com> Closes #12844 from liancheng/spark-14127-desc-table.
* [SPARK-15029] improve error message for GenerateWenchen Fan2016-05-0411-73/+126
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR improve the error message for `Generate` in 3 cases: 1. generator is nested in expressions, e.g. `SELECT explode(list) + 1 FROM tbl` 2. generator appears more than one time in SELECT, e.g. `SELECT explode(list), explode(list) FROM tbl` 3. generator appears in other operator which is not project, e.g. `SELECT * FROM tbl SORT BY explode(list)` ## How was this patch tested? new tests in `AnalysisErrorSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12810 from cloud-fan/bug.
* [SPARK-14237][SQL] De-duplicate partition value appending logic in various ↵Cheng Lian2016-05-049-53/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | buildReader() implementations ## What changes were proposed in this pull request? Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication. A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`. Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`. This PR brings two benefits: 1. Apparently, it de-duplicates partition value appending logic 2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`. Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`. ## How was this patch tested? Existing tests should do the work. Author: Cheng Lian <lian@databricks.com> Closes #12866 from liancheng/spark-14237-simplify-partition-values-appending.
* [SPARK-15107][SQL] Allow varying # iterations by test case in BenchmarkReynold Xin2016-05-033-67/+93
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch changes our micro-benchmark util to allow setting different iteration numbers for different test cases. For some of our benchmarks, turning off whole-stage codegen can make the runtime 20X slower, making it very difficult to run a large number of times without substantially shortening the input cardinality. With this change, I set the default num iterations to 2 for whole stage codegen off, and 5 for whole stage codegen on. I also updated some results. ## How was this patch tested? N/A - this is a test util. Author: Reynold Xin <rxin@databricks.com> Closes #12884 from rxin/SPARK-15107.
* [SPARK-15095][SQL] remove HiveSessionHook from ThriftServerDavies Liu2016-05-032-57/+0
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Remove HiveSessionHook ## How was this patch tested? No tests needed. Author: Davies Liu <davies@databricks.com> Closes #12881 from davies/remove_hooks.
* [SPARK-14414][SQL] Make DDL exceptions more consistentAndrew Or2016-05-0320-435/+141
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Just a bunch of small tweaks on DDL exception messages. ## How was this patch tested? `DDLCommandSuite` et al. Author: Andrew Or <andrew@databricks.com> Closes #12853 from andrewor14/make-exceptions-consistent.
* [SPARK-15097][SQL] make Dataset.sqlContext a stable identifier for importsKoert Kuipers2016-05-032-1/+15
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Make Dataset.sqlContext a lazy val so that its a stable identifier and can be used for imports. Now this works again: import someDataset.sqlContext.implicits._ ## How was this patch tested? Add unit test to DatasetSuite that uses the import show above. Author: Koert Kuipers <koert@tresata.com> Closes #12877 from koertkuipers/feat-sqlcontext-stable-import.
* [SPARK-15084][PYTHON][SQL] Use builder pattern to create SparkSession in ↵Dongjoon Hyun2016-05-032-21/+105
| | | | | | | | | | | | | | | | PySpark. ## What changes were proposed in this pull request? This is a python port of corresponding Scala builder pattern code. `sql.py` is modified as a target example case. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12860 from dongjoon-hyun/SPARK-15084.
* [SPARK-14645][MESOS] Fix python running on cluster mode mesos to have non ↵Timothy Chen2016-05-031-2/+3
| | | | | | | | | | | | | | | | local uris ## What changes were proposed in this pull request? Fix SparkSubmit to allow non-local python uris ## How was this patch tested? Manually tested with mesos-spark-dispatcher Author: Timothy Chen <tnachen@gmail.com> Closes #12403 from tnachen/enable_remote_python.
* [SPARK-14422][SQL] Improve handling of optional configs in SQLConfSandeep Singh2016-05-034-10/+25
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Create a new API for handling Optional Configs in SQLConf. Right now `getConf` for `OptionalConfigEntry[T]` returns value of type `T`, if doesn't exist throws an exception. Add new method `getOptionalConf`(suggestions on naming) which will now returns value of type `Option[T]`(so if doesn't exist it returns `None`). ## How was this patch tested? Add test and ran tests locally. Author: Sandeep Singh <sandeep@techaddict.me> Closes #12846 from techaddict/SPARK-14422.
* [MINOR][DOC] Fixed some python snippets in mllib data types documentation.Shuai Lin2016-05-031-3/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Some python snippets is using scala imports and comments. ## How was this patch tested? Generated the docs locally with `SKIP_API=1 jekyll build` and viewed the changes in the browser. Author: Shuai Lin <linshuai2012@gmail.com> Closes #12869 from lins05/fix-mllib-python-snippets.
* [SPARK-15104] Fix spacing in log lineAndrew Ash2016-05-031-1/+1
| | | | | | | | | | | | Otherwise get logs that look like this (note no space before NODE_LOCAL) ``` INFO [2016-05-03 21:18:51,477] org.apache.spark.scheduler.TaskSetManager: Starting task 0.0 in stage 101.0 (TID 7029, localhost, partition 0,NODE_LOCAL, 1894 bytes) ``` Author: Andrew Ash <andrew@andrewash.com> Closes #12880 from ash211/patch-7.
* [SQL-15102][SQL] remove delegation token support from ThriftServerDavies Liu2016-05-031-58/+7
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? These API is only useful for Hadoop, may not work for Spark SQL. The APIs is kept for source compatibility. ## How was this patch tested? No unit tests needed. Author: Davies Liu <davies@databricks.com> Closes #12878 from davies/remove_delegate.
* [SPARK-15056][SQL] Parse Unsupported Sampling Syntax and Issue Better Exceptionsgatorsmile2016-05-033-3/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? Compared with the current Spark parser, there are two extra syntax are supported in Hive for sampling - In `On` clauses, `rand()` is used for indicating sampling on the entire row instead of an individual column. For example, ```SQL SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s; ``` - Users can specify the total length to be read. For example, ```SQL SELECT * FROM source TABLESAMPLE(100M) s; ``` Below is the link for references: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling This PR is to parse and capture these two extra syntax, and issue a better error message. #### How was this patch tested? Added test cases to verify the thrown exceptions Author: gatorsmile <gatorsmile@gmail.com> Closes #12838 from gatorsmile/bucketOnRand.
* [SPARK-14973][ML] The CrossValidator and TrainValidationSplit miss the seed ↵yinxusen2016-05-035-18/+30
| | | | | | | | | | | | | | | | | | when saving and loading ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14973 Add seed support when saving/loading of CrossValidator and TrainValidationSplit. ## How was this patch tested? Spark unit test. Author: yinxusen <yinxusen@gmail.com> Closes #12825 from yinxusen/SPARK-14973.
* [SPARK-15095][SQL] drop binary mode in ThriftServerDavies Liu2016-05-033-47/+14
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR drop the support for binary mode in ThriftServer, only HTTP mode is supported now, to reduce the maintain burden. The code to support binary mode is still kept, just in case if we want it in future. ## How was this patch tested? Updated tests to use HTTP mode. Author: Davies Liu <davies@databricks.com> Closes #12876 from davies/hide_binary.
* [SPARK-15073][SQL] Hide SparkSession constructor from the publicAndrew Or2016-05-034-12/+19
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Users should use the builder pattern instead. ## How was this patch tested? Jenks. Author: Andrew Or <andrew@databricks.com> Closes #12873 from andrewor14/spark-session-constructor.
* [SPARK-11316] coalesce doesn't handle UnionRDD with partial locality properlyThomas Graves2016-05-032-62/+165
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? coalesce doesn't handle UnionRDD with partial locality properly. I had a user who had a UnionRDD that was made up of mapPartitionRDD without preferred locations and a checkpointedRDD with preferred locations (getting from hdfs). It took the driver over 20 minutes to setup the groups and put the partitions into those groups before it even started any tasks. Even perhaps worse is it didn't end up with the number of partitions he was asking for because it didn't put a partition in each of the groups properly. The changes in this patch get rid of a n^2 while loop that was causing the 20 minutes, it properly distributes the partitions to have at least one per group, and it changes from using the rotation iterator which got the preferred locations many times to get all the preferred locations once up front. Note that the n^2 while loop that I removed in setupGroups took so long because all of the partitions with preferred locations were already assigned to group, so it basically looped through every single one and wasn't ever able to assign it. At the time I had 960 partitions with preferred locations and 1020 without and did the outer while loop 319 times because that is the # of groups left to create. Note that each of those times through the inner while loop is going off to hdfs to get the block locations, so this is extremely inefficient. ## How was the this patch tested? Added unit tests for this case and ran existing ones that applied to make sure no regressions. Also manually tested on the users production job to make sure it fixed their issue. It created the proper number of partitions and now it takes about 6 seconds rather then 20 minutes. I did also run some basic manual tests with spark-shell doing coalesced to smaller number, same number, and then greater with shuffle. Author: Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com> Closes #11327 from tgravescs/SPARK-11316.
* [SPARK-14521] [SQL] StackOverflowError in Kryo when executing TPC-DSyzhou20012016-05-032-42/+129
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Observed stackOverflowError in Kryo when executing TPC-DS Query27. Spark thrift server disables kryo reference tracking (if not specified in conf). When "spark.kryo.referenceTracking" is set to true explicitly in spark-defaults.conf, query executes successfully. The root cause is that the TaskMemoryManager inside MemoryConsumer and LongToUnsafeRowMap were not transient and thus were serialized and broadcast around from within LongHashedRelation, which could potentially cause circular reference inside Kryo. But the TaskMemoryManager is per task and should not be passed around at the first place. This fix makes it transient. ## How was this patch tested? core/test, hive/test, sql/test, catalyst/test, dev/lint-scala, org.apache.spark.sql.hive.execution.HiveCompatibilitySuite, dev/scalastyle, manual test of TBC-DS Query 27 with 1GB data but without the "limit 100" which would cause a NPE due to SPARK-14752. Author: yzhou2001 <yzhou_1999@yahoo.com> Closes #12598 from yzhou2001/master.
* [SPARK-14234][CORE] Executor crashes for TaskRunner thread interruptionDevaraj K2016-05-031-1/+25
| | | | | | | | | | | | ## What changes were proposed in this pull request? Resetting the task interruption status before updating the task status. ## How was this patch tested? I have verified it manually by running multiple applications, Executor doesn't crash and updates the status to the driver without any exceptions with the patch changes. Author: Devaraj K <devaraj@apache.org> Closes #12031 from devaraj-kavali/SPARK-14234.
* [SPARK-15059][CORE] Remove fine-grained lock in ChildFirstURLClassLoader to ↵Zheng Tan2016-05-031-26/+5
| | | | | | | | | | | | avoid dead lock ## What changes were proposed in this pull request? In some cases, fine-grained lock have race condition with class-loader lock and have caused dead lock issue. It is safe to drop this fine grained lock and load all classes by single class-loader lock. Author: Zheng Tan <zheng.tan@hulu.com> Closes #12857 from tankkyo/master.
* [SPARK-15082][CORE] Improve unit test coverage for AccumulatorV2Sandeep Singh2016-05-031-1/+60
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Added tests for ListAccumulator and LegacyAccumulatorWrapper, test for ListAccumulator is one similar to old Collection Accumulators ## How was this patch tested? Ran tests locally. cc rxin Author: Sandeep Singh <sandeep@techaddict.me> Closes #12862 from techaddict/SPARK-15082.
* [SPARK-9819][STREAMING][DOCUMENTATION] Clarify doc for invReduceFunc in ↵François Garillot2016-05-035-5/+11
| | | | | | | | | | | | incremental versions of reduceByWindow - that reduceFunc and invReduceFunc should be associative - that the intermediate result in iterated applications of inverseReduceFunc is its first argument Author: François Garillot <francois@garillot.net> Closes #8103 from huitseeker/issue/invReduceFuncDoc.
* [SPARK-15087][CORE][SQL] Remove AccumulatorV2.localValue and keep only valueSandeep Singh2016-05-0311-38/+30
| | | | | | | | | | | | ## What changes were proposed in this pull request? Remove AccumulatorV2.localValue and keep only value ## How was this patch tested? existing tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #12865 from techaddict/SPARK-15087.
* [SPARK-14860][TESTS] Create a new Waiter in reset to bypass an issue of ↵Shixiong Zhu2016-05-031-9/+3
| | | | | | | | | | | | | | | | ScalaTest's Waiter.wait ## What changes were proposed in this pull request? This PR updates `QueryStatusCollector.reset` to create Waiter instead of calling `await(1 milliseconds)` to bypass an ScalaTest's issue that Waiter.await may block forever. ## How was this patch tested? I created a local stress test to call codes in `test("event ordering")` 100 times. It cannot pass without this patch. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12623 from zsxwing/flaky-test.
* [SPARK-14716][SQL] Added support for partitioning in FileStreamSinkTathagata Das2016-05-039-54/+605
| | | | | | | | | | | | | | | | | # What changes were proposed in this pull request? Support partitioning in the file stream sink. This is implemented using a new, but simpler code path for writing parquet files - both unpartitioned and partitioned. This new code path does not use Output Committers, as we will eventually write the file names to the metadata log for "committing" them. This patch duplicates < 100 LOC from the WriterContainer. But its far simpler that WriterContainer as it does not involve output committing. In addition, it introduces the new APIs in FileFormat and OutputWriterFactory in an attempt to simplify the APIs (not have Job in the `FileFormat` API, not have bucket and other stuff in the `OutputWriterFactory.newInstance()` ). # Tests - New unit tests to test the FileStreamSinkWriter for partitioned and unpartitioned files - New unit test to partially test the FileStreamSink for partitioned files (does not test recovery of partition column data, as that requires change in the StreamFileCatalog, future PR). - Updated FileStressSuite to test number of records read from partitioned output files. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12409 from tdas/streaming-partitioned-parquet.
* [SPARK-14884][SQL][STREAMING][WEBUI] Fix call site for continuous queriesLiwei Lin2016-05-032-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Since we've been processing continuous queries in separate threads, the call sites are then `run at <unknown>:0`. It's not wrong but provides very little information; in addition, we can not distinguish two queries only from their call sites. This patch fixes this. ### Before [Jobs Tab] ![s1a](https://cloud.githubusercontent.com/assets/15843379/14766101/a47246b2-0a30-11e6-8d81-06a9a600113b.png) [SQL Tab] ![s1b](https://cloud.githubusercontent.com/assets/15843379/14766102/a4750226-0a30-11e6-9ada-773d977d902b.png) ### After [Jobs Tab] ![s2a](https://cloud.githubusercontent.com/assets/15843379/14766104/a89705b6-0a30-11e6-9830-0d40ec68527b.png) [SQL Tab] ![s2b](https://cloud.githubusercontent.com/assets/15843379/14766103/a8966728-0a30-11e6-8e4d-c2e326400478.png) ## How was this patch tested? Manually checks - see screenshots above. Author: Liwei Lin <lwlin7@gmail.com> Closes #12650 from lw-lin/fix-call-site.
* [SPARK-15088] [SQL] Remove SparkSqlSerializerReynold Xin2016-05-032-118/+0
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes SparkSqlSerializer. I believe this is now dead code. ## How was this patch tested? Removed a test case related to it. Author: Reynold Xin <rxin@databricks.com> Closes #12864 from rxin/SPARK-15088.
* [SPARK-15091][SPARKR] Fix warnings and a failure in SparkR test cases with ↵Sun Rui2016-05-034-10/+9
| | | | | | | | | | | | | | testthat version 1.0.1 ## What changes were proposed in this pull request? Fix warnings and a failure in SparkR test cases with testthat version 1.0.1 ## How was this patch tested? SparkR unit test cases. Author: Sun Rui <sunrui2016@gmail.com> Closes #12867 from sun-rui/SPARK-15091.
* [SPARK-14971][ML][PYSPARK] PySpark ML Params setter code clean upYanbo Liang2016-05-0310-219/+110
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? PySpark ML Params setter code clean up. For examples, ```setInputCol``` can be simplified from ``` self._set(inputCol=value) return self ``` to: ``` return self._set(inputCol=value) ``` This is a pretty big sweeps, and we cleaned wherever possible. ## How was this patch tested? Exist unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12749 from yanboliang/spark-14971.
* [SPARK-15057][GRAPHX] Remove stale TODO comment for making `enum` in ↵Dongjoon Hyun2016-05-031-1/+0
| | | | | | | | | | | | | | | | GraphGenerators ## What changes were proposed in this pull request? This PR removes a stale TODO comment in `GraphGenerators.scala` ## How was this patch tested? Just comment removed. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12839 from dongjoon-hyun/SPARK-15057.
* [SPARK-14897][CORE] Upgrade Jetty to latest version of 8Sean Owen2016-05-031-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Update Jetty 8.1 to the latest 2016/02 release, from a 2013/10 release, for security and bug fixes. This does not resolve the JIRA necessarily, as it's still worth considering an update to 9.3. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #12842 from srowen/SPARK-14897.
* [SPARK-15081] Move AccumulatorV2 and subclasses into util packageReynold Xin2016-05-0330-34/+44
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch moves AccumulatorV2 and subclasses into util package. ## How was this patch tested? Updated relevant tests. Author: Reynold Xin <rxin@databricks.com> Closes #12863 from rxin/SPARK-15081.