aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-14410][SQL] Push functions existence check into catalogAndrew Or2016-04-0713-114/+126
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is a followup to #12117 and addresses some of the TODOs introduced there. In particular, the resolution of database is now pushed into session catalog, which knows about the current database. Further, the logic for checking whether a function exists is pushed into the external catalog. No change in functionality is expected. ## How was this patch tested? `SessionCatalogSuite`, `DDLSuite` Author: Andrew Or <andrew@databricks.com> Closes #12198 from andrewor14/function-exists.
* [SPARK-12740] [SPARK-13932] support grouping()/grouping_id() in having/order ↵Davies Liu2016-04-073-56/+211
| | | | | | | | | | | | | | | | | | | | | | clause ## What changes were proposed in this pull request? This PR brings the support of using grouping()/grouping_id() in HAVING/ORDER BY clause. The resolved grouping()/grouping_id() will be replaced by unresolved "spark_gropuing_id" virtual attribute, then resolved by ResolveMissingAttribute. This PR also fix the HAVING clause that access a grouping column that is not presented in SELECT clause, for example: ```sql select count(1) from (select 1 as a) t group by a having a > 0 ``` ## How was this patch tested? Add new tests. Author: Davies Liu <davies@databricks.com> Closes #12235 from davies/grouping_having.
* [SPARK-14456][SQL][MINOR] Remove unused variables and logics in DataSourceKousuke Saruta2016-04-071-10/+0
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? In DataSource#write method, the variables `dataSchema` and `equality`, and related logics are no longer used. Let's remove them. ## How was this patch tested? Existing tests. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #12237 from sarutak/SPARK-14456.
* [SQL][TESTS] Fix for flaky test in ContinuousQueryManagerSuiteTathagata Das2016-04-071-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The timeouts were lower the other timeouts in the test. Other tests were stable over the last month. ## How was this patch tested? Jenkins tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12219 from tdas/flaky-test-fix.
* [SPARK-12384] Enables spark-clients to set the min(-Xms) and max(*.memory ↵Dhruve Ashar2016-04-078-20/+35
| | | | | | | | | | | | | | | | | | config) j… ## What changes were proposed in this pull request? Currently Spark clients are started with the same memory setting for Xms and Xms leading to reserving unnecessary higher amounts of memory. This behavior is changed and the clients can now specify an initial heap size using the extraJavaOptions in the config for driver,executor and am individually. Note, that only -Xms can be provided through this config option, if the client wants to set the max size(-Xmx), this has to be done via the *.memory configuration knobs which are currently supported. ## How was this patch tested? Monitored executor and yarn logs in debug mode to verify the commands through which they are being launched in client and cluster mode. The driver memory was verified locally using jps -v. Setting up -Xmx parameter in the javaExtraOptions raises exception with the info provided. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #12115 from dhruve/impr/SPARK-12384.
* [SPARK-14245][WEB UI] Display the user in the application viewAlex Bozarth2016-04-073-0/+10
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The Spark UI (both active and history) should show the user who ran the application somewhere when you are in the application view. This was added under the Jobs view by total uptime and scheduler mode. ## How was this patch tested? Manual testing <img width="191" alt="username" src="https://cloud.githubusercontent.com/assets/13952758/14222830/6d1fe542-f82a-11e5-885f-c05ee2cdf857.png"> Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #12123 from ajbozarth/spark14245.
* Better host description for multi-master mesosMalte2016-04-071-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Since not having the correct zk url causes job failure, the documentation should include all parameters ## How was this patch tested? no tests necessary Author: Malte <elmalto@users.noreply.github.com> Closes #12218 from elmalto/patch-1.
* [SPARK-10063][SQL] Remove DirectParquetOutputCommitterReynold Xin2016-04-076-224/+5
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue. ## How was this patch tested? Removed the related tests also. Author: Reynold Xin <rxin@databricks.com> Closes #12229 from rxin/SPARK-10063.
* [SPARK-14452][SQL] Explicit APIs in Scala for specifying encodersReynold Xin2016-04-073-236/+327
| | | | | | | | | | | | ## What changes were proposed in this pull request? The Scala Dataset public API currently only allows users to specify encoders through SQLContext.implicits. This is OK but sometimes people want to explicitly get encoders without a SQLContext (e.g. Aggregator implementations). This patch adds public APIs to Encoders class for getting Scala encoders. ## How was this patch tested? None - I will update test cases once https://github.com/apache/spark/pull/12231 is merged. Author: Reynold Xin <rxin@databricks.com> Closes #12232 from rxin/SPARK-14452.
* [SPARK-14134][CORE] Change the package name used for shading classes.Marcelo Vanzin2016-04-0619-28/+27
| | | | | | | | | | | | | | | The current package name uses a dash, which is a little weird but seemed to work. That is, until a new test tried to mock a class that references one of those shaded types, and then things started failing. Most changes are just noise to fix the logging configs. For reference, SPARK-8815 also raised this issue, although at the time it did not cause any issues in Spark, so it was not addressed. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11941 from vanzin/SPARK-14134.
* [SPARK-12610][SQL] Left Anti JoinHerman van Hovell2016-04-0616-108/+231
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ### What changes were proposed in this pull request? This PR adds support for `LEFT ANTI JOIN` to Spark SQL. A `LEFT ANTI JOIN` is the exact opposite of a `LEFT SEMI JOIN` and can be used to identify rows in one dataset that are not in another dataset. Note that `nulls` on the left side of the join cannot match a row on the right hand side of the join; the result is that left anti join will always select a row with a `null` in one or more of its keys. We currently add support for the following SQL join syntax: SELECT * FROM tbl1 A LEFT ANTI JOIN tbl2 B ON A.Id = B.Id Or using a dataframe: tbl1.as("a").join(tbl2.as("b"), $"a.id" === $"b.id", "left_anti) This PR provides serves as the basis for implementing `NOT EXISTS` and `NOT IN (...)` correlated sub-queries. It would also serve as good basis for implementing an more efficient `EXCEPT` operator. The PR has been (losely) based on PR's by both davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/10563); credit should be given where credit is due. This PR adds supports for `LEFT ANTI JOIN` to `BroadcastHashJoin` (including codegeneration), `ShuffledHashJoin` and `BroadcastNestedLoopJoin`. ### How was this patch tested? Added tests to `JoinSuite` and ported `ExistenceJoinSuite` from https://github.com/apache/spark/pull/10563. cc davies chenghao-intel rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12214 from hvanhovell/SPARK-12610.
* [SPARK-14446][TESTS] Fix ReplSuite for Scala 2.10.Marcelo Vanzin2016-04-061-1/+1
| | | | | | | | Just use the same test code as the 2.11 version, which seems to pass. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12223 from vanzin/SPARK-14446.
* [SPARK-12555][SQL] Result should not be corrupted after input columns are ↵Luciano Resende2016-04-071-0/+19
| | | | | | | | | | reordered This PR add test case described in SPARK-12555 to validate that correct data is returned when input data is reordered and to avoid future regressions. Author: Luciano Resende <lresende@apache.org> Closes #11623 from lresende/SPARK-12555.
* [SPARK-12382][ML] Remove mllib GBT implementation and wrap mlsethah2016-04-066-226/+207
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes the implementation of gradient boosted trees in mllib/tree/GradientBoostedTrees.scala and changes mllib GBTs to call the implementation in spark.ML. Primary changes: * Removed `boost` method in mllib GradientBoostedTrees.scala * Created new test suite GradientBoostedTreesSuite in ML, which contains unit tests that were specific to GBT internals from mllib Other changes: * Added an `updatePrediction` method in GradientBoostedTrees package. This method is added to provide consistency for methods that build predictions from boosted models. There are several methods that hard code the method of predicting as: sum_{i=1}^{numTrees} (treePrediction*treeWeight). Calling this function ensures that test methods that check accuracy use the same prediction method that the algorithm uses during training * Added methods that were previously only used in testing, but were public methods, to GradientBoostedTrees. This includes `computeError` (previously part of `Loss` trait) and `evaluateEachIteration`. These are used in the new spark.ML unit tests. They are left in mllib as well so as to not break the API. ## How was this patch tested? Existing unit tests which compare ML and MLlib ensure that mllib GBTs have not changed. Only a single unit test was moved to ML, which verifies that `runWithValidation` performs as expected. Author: sethah <seth.hendrickson16@gmail.com> Closes #12050 from sethah/SPARK-12382.
* [SPARK-14436][SQL] Make JavaDatasetAggregatorSuiteBase public.Marcelo Vanzin2016-04-062-53/+83
| | | | | | | | | | Without this, unit tests that extend that class fail for me locally on maven, because JUnit tries to run methods in that class and gets an IllegalAccessError. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12212 from vanzin/SPARK-14436.
* [SPARK-13112][CORE] Make sure RegisterExecutorResponse arrive before LaunchTaskShixiong Zhu2016-04-064-10/+16
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Send `RegisterExecutorResponse` using `executorRef` in order to make sure RegisterExecutorResponse and LaunchTask are both sent using the same channel. Then RegisterExecutorResponse will always arrive before LaunchTask ## How was this patch tested? Existing unit tests Closes #12078 Author: Shixiong Zhu <shixiong@databricks.com> Closes #12211 from zsxwing/SPARK-13112.
* [SPARK-14290][CORE][NETWORK] avoid significant memory copy in netty's transferToZhang, Liye2016-04-061-1/+29
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? When netty transfer data that is not `FileRegion`, data will be in format of `ByteBuf`, If the data is large, there will occur significant performance issue because there is memory copy underlying in `sun.nio.ch.IOUtil.write`, the CPU is 100% used, and network is very low. In this PR, if data size is large, we will split it into small chunks to call `WritableByteChannel.write()`, so that avoid wasting of memory copy. Because the data can't be written within a single write, and it will call `transferTo` multiple times. ## How was this patch tested? Spark unit test and manual test. Manual test: `sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length` For more details, please refer to [SPARK-14290](https://issues.apache.org/jira/browse/SPARK-14290) Author: Zhang, Liye <liye.zhang@intel.com> Closes #12083 from liyezhang556520/spark-14290.
* [SPARK-14444][BUILD] Add a new scalastyle `NoScalaDoc` to prevent ↵Dongjoon Hyun2016-04-0614-44/+59
| | | | | | | | | | | | | | | | | | | | | ScalaDoc-style multiline comments ## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings. ``` /** In Spark, we don't use the ScalaDoc style so this * is not correct. */ ``` ## How was this patch tested? Pass the Jenkins tests (including `lint-scala`). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12221 from dongjoon-hyun/SPARK-14444.
* [SPARK-14424][BUILD][DOCS] Update the build docs to switch from assembly to ↵Holden Karau2016-04-062-11/+4
| | | | | | | | | | | | | | | | | package and add a no… ## What changes were proposed in this pull request? Change our build docs & shell scripts to that developers are aware of the change from "assembly" to "package" ## How was this patch tested? Manually ran ./bin/spark-shell after ./build/sbt assembly and verified error message printed, ran new suggested build target and verified ./bin/spark-shell runs after this. Author: Holden Karau <holden@pigscanfly.ca> Author: Holden Karau <holden@us.ibm.com> Closes #12197 from holdenk/SPARK-1424-spark-class-broken-fix-build-docs.
* [SPARK-12133][STREAMING] Streaming dynamic allocationTathagata Das2016-04-068-3/+683
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Added a new Executor Allocation Manager for the Streaming scheduler for doing Streaming Dynamic Allocation. ## How was this patch tested Unit tests, and cluster tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12154 from tdas/streaming-dynamic-allocation.
* [SPARK-14391][LAUNCHER] Increase test timeouts.Marcelo Vanzin2016-04-061-3/+3
| | | | | | | | | | Most of the time tests should still pass really quickly; it's just when machines are overloaded that the tests may take a little time, but that's still preferable over just failing the test. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12210 from vanzin/SPARK-14391.
* [SPARK-14224] [SPARK-14223] [SPARK-14310] [SQL] fix RowEncoder and parquet ↵Davies Liu2016-04-0613-234/+267
| | | | | | | | | | | | | | | | | | | | | reader for wide table ## What changes were proposed in this pull request? 1) fix the RowEncoder for wide table (many columns) by splitting the generate code into multiple functions. 2) Separate DataSourceScan as RowDataSourceScan and BatchedDataSourceScan 3) Disable the returning columnar batch in parquet reader if there are many columns. 4) Added a internal config for maximum number of fields (nested) columns supported by whole stage codegen. Closes #12098 ## How was this patch tested? Add a tests for table with 1000 columns. Author: Davies Liu <davies@databricks.com> Closes #12047 from davies/many_columns.
* [SPARK-14382][SQL] QueryProgress should be post after committedOffsets is ↵Shixiong Zhu2016-04-062-12/+6
| | | | | | | | | | | | | | | | | | updated ## What changes were proposed in this pull request? Make sure QueryProgress is post after committedOffsets is updated. If QueryProgress is post before committedOffsets is updated, the listener may see a wrong sinkStatus (created from committedOffsets). See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/644/testReport/junit/org.apache.spark.sql.util/ContinuousQueryListenerSuite/single_listener/ for an example of the failure. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12155 from zsxwing/SPARK-14382.
* [SPARK-13430][PYSPARK][ML] Python API for training summaries of linear and ↵Bryan Cutler2016-04-067-30/+602
| | | | | | | | | | | | | | | logistic regression ## What changes were proposed in this pull request? Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML. ## How was this patch tested? Added unit tests to exercise the api calls for the summary classes. Also, manually verified values are expected and match those from Scala directly. Author: Bryan Cutler <cutlerb@gmail.com> Closes #11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
* [SPARK-14320][SQL] Make ColumnarBatch.Row mutableSameer Agarwal2016-04-065-8/+135
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? In order to leverage a data structure like `AggregateHashMap` (https://github.com/apache/spark/pull/12055) to speed up aggregates with keys, we need to make `ColumnarBatch.Row` mutable. ## How was this patch tested? Unit test in `ColumnarBatchSuite`. Also, tested via `BenchmarkWholeStageCodegen`. Author: Sameer Agarwal <sameer@databricks.com> Closes #12103 from sameeragarwal/mutable-row.
* [SPARK-13538][ML] Add GaussianMixture to MLZheng RuiFeng2016-04-062-0/+444
| | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-13538 ## What changes were proposed in this pull request? Add GaussianMixture and GaussianMixtureModel to ML package ## How was this patch tested? unit tests and manual tests were done. Local Scalastyle checks passed. Author: Zheng RuiFeng <ruifengz@foxmail.com> Author: Ruifeng Zheng <ruifengz@foxmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #11419 from zhengruifeng/mlgmm.
* [SPARK-14322][MLLIB] Use treeAggregate instead of reduce in OnlineLDAOptimizerYuhao Yang2016-04-061-2/+3
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14322 OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix. This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate. See this line: https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452 and a few lines below it. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #12106 from hhbyyh/ldaTreeReduce.
* [SPARK-13786][ML][PYSPARK] Add save/load for pyspark.ml.tuningXusen Yin2016-04-066-111/+404
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13786 Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model. ## How was this patch tested? Test with Python doctest. Author: Xusen Yin <yinxusen@gmail.com> Closes #12020 from yinxusen/SPARK-13786.
* [SPARK-14383][SQL] missing "|" in the g4 filebomeng2016-04-062-1/+8
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? A very trivial one. It missed "|" between DISTRIBUTE and UNSET. ## How was this patch tested? I do not think it is really needed. Author: bomeng <bmeng@us.ibm.com> Closes #12156 from bomeng/SPARK-14383.
* [SPARK-14429][SQL] Improve LIKE pattern in "SHOW TABLES / FUNCTIONS LIKE ↵bomeng2016-04-065-24/+46
| | | | | | | | | | | | | | | | | | | | | | <pattern>" DDL LIKE <pattern> is commonly used in SHOW TABLES / FUNCTIONS etc DDL. In the pattern, user can use `|` or `*` as wildcards. 1. Currently, we used `replaceAll()` to replace `*` with `.*`, but the replacement was scattered in several places; I have created an utility method and use it in all the places; 2. Consistency with Hive: the pattern is case insensitive in Hive and white spaces will be trimmed, but current pattern matching does not do that. For example, suppose we have tables (t1, t2, t3), `SHOW TABLES LIKE ' T* ' ` will list all the t-tables. Please use Hive to verify it. 3. Combined with `|`, the result will be sorted. For pattern like `' B*|a* '`, it will list the result in a-b order. I've made some changes to the utility method to make sure we will get the same result as Hive does. A new method was created in StringUtil and test cases were added. andrewor14 Author: bomeng <bmeng@us.ibm.com> Closes #12206 from bomeng/SPARK-14429.
* [SPARK-14426][SQL] Merge PerserUtils and ParseUtilsKousuke Saruta2016-04-065-138/+144
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We have ParserUtils and ParseUtils which are both utility collections for use during the parsing process. Those names and what they are used for is very similar so I think we can merge them. Also, the original unescapeSQLString method may have a fault. When "\u0061" style character literals are passed to the method, it's not unescaped successfully. This patch fix the bug. ## How was this patch tested? Added a new test case. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #12199 from sarutak/merge-ParseUtils-and-ParserUtils.
* [SPARK-14418][PYSPARK] fix unpersist of Broadcast in PythonDavies Liu2016-04-062-1/+31
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, Broaccast.unpersist() will remove the file of broadcast, which should be the behavior of destroy(). This PR added destroy() for Broadcast in Python, to match the sematics in Scala. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #12189 from davies/py_unpersist.
* [SPARK-14288][SQL] Memory Sink for streamingMichael Armbrust2016-04-065-18/+159
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR exposes the internal testing `MemorySink` though the data source API. This will allow users to easily test streaming applications in the Spark shell or other local tests. Usage: ```scala inputStream.write .format("memory") .queryName("memStream") .startStream() // Now you can query the result of the stream here. sqlContext.table("memStream") ``` The most complicated part of the logic is choosing the checkpoint directory. There are a few requirements we are attempting to satisfy here: - when working in the shell locally, it should just work with no extra configuration. - when working on a cluster you should be able to make it easily create the checkpoint on a distributed file system so you can test aggregation (state checkpoints are also stored in this directory and must be accessible from workers). - it should be clear that you can't resume since the data is just in memory. The chosen algorithm proceeds as follows: - the user gives a checkpoint directory, use it - if the conf has a checkpoint location, use `$location/$queryName` - if neither, create a local directory - always check to make sure there are no offsets written to the directory Author: Michael Armbrust <michael@databricks.com> Closes #12119 from marmbrus/memorySink.
* [SPARK-14430][BUILD] use https while downloading binaries from build/mvnPrajwal Tuladhar2016-04-061-3/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? `./build/mvn` file was downloading binaries in non HTTPS mode. This PR tends to fix it. ## How was this patch tested? By running `./build/mvn clean package` locally Author: Prajwal Tuladhar <praj@infynyxx.com> Closes #12182 from infynyxx/mvn_use_https.
* Added omitted word in error messageVictor Chima2016-04-061-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Added an omitted word in the error message displayed by the Graphx Pregel API when `maxIterations <= 0` ## How was this patch tested? Manual test Author: Victor Chima <blazy2k9@gmail.com> Closes #12205 from blazy2k9/hotfix/pregel-error-message.
* [SPARK-14396][BUILD][HOT] Fix compilation against Scala 2.10gatorsmile2016-04-061-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? This PR is to fix the compilation errors in Scala 2.10 build, as shown in the link: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-compile-maven-scala-2.10/735/console ``` [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:266: value contains is not a member of Option[String] [error] assert(desc.viewText.contains("SELECT * FROM tab1")) [error] ^ [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:267: value contains is not a member of Option[String] [error] assert(desc.viewOriginalText.contains("SELECT * FROM tab1")) [error] ^ [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:293: value contains is not a member of Option[String] [error] assert(desc.viewText.contains("SELECT * FROM tab1")) [error] ^ [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:294: value contains is not a member of Option[String] [error] assert(desc.viewOriginalText.contains("SELECT * FROM tab1")) [error] ^ [error] four errors found [error] Compile failed at Apr 5, 2016 10:59:09 PM [10.502s] ``` #### How was this patch tested? Not sure how to trigger Scala 2.10 compilation in the test environment. Author: gatorsmile <gatorsmile@gmail.com> Closes #12201 from gatorsmile/buildBreak2.10.
* [SPARK-14252] Executors do not try to download remote cached blocksEric Liang2016-04-052-0/+21
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? As mentioned in the ticket this was because one get path in the refactored `BlockManager` did not check for remote storage. ## How was this patch tested? Unit test, also verified manually with reproduction in the ticket. cc JoshRosen Author: Eric Liang <ekl@databricks.com> Closes #12193 from ericl/spark-14252.
* [SPARK-14396][SQL] Throw Exceptions for DDLs of Partitioned Viewsgatorsmile2016-04-055-44/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? Because the concept of partitioning is associated with physical tables, we disable all the supports of partitioned views, which are defined in the following three commands in [Hive DDL Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView): ``` ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...]; ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec; CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ] [COMMENT view_comment] [TBLPROPERTIES (property_name = property_value, ...)] AS SELECT ...; ``` An exception is thrown when users issue any of these three DDL commands. #### How was this patch tested? Added test cases for parsing create view and changed the existing test cases to verify if the exceptions are thrown. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12169 from gatorsmile/viewPartition.
* [SPARK-14416][CORE] Add thread-safe comments for ↵Shixiong Zhu2016-04-052-16/+30
| | | | | | | | | | | | | | | | CoarseGrainedSchedulerBackend's fields ## What changes were proposed in this pull request? While I was reviewing #12078, I found most of CoarseGrainedSchedulerBackend's mutable fields doesn't have any comments about the thread-safe assumptions and it's hard for people to figure out which part of codes should be protected by the lock. This PR just added comments/annotations for them and also added strict access modifiers for some fields. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12188 from zsxwing/comments.
* [SPARK-14128][SQL] Alter table DDL followupAndrew Or2016-04-052-5/+21
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is just a followup to #12121, which implemented the alter table DDLs using the `SessionCatalog`. Specially, this corrects the behavior of setting the location of a datasource table. For datasource tables, we need to set the `locationUri` in addition to the `path` entry in the serde properties. Additionally, changing the location of a datasource table partition is not allowed. ## How was this patch tested? `DDLSuite` Author: Andrew Or <andrew@databricks.com> Closes #12186 from andrewor14/alter-table-ddl-followup.
* [SPARK-14296][SQL] whole stage codegen support for Dataset.mapWenchen Fan2016-04-0611-41/+247
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds a new operator `MapElements` for `Dataset.map`, it's a 1-1 mapping and is easier to adapt to whole stage codegen framework. ## How was this patch tested? new test in `WholeStageCodegenSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12087 from cloud-fan/map.
* [SPARK-13211][STREAMING] StreamingContext throws NoSuchElementException when ↵Sean Owen2016-04-053-9/+10
| | | | | | | | | | | | | | | | created from non-existent checkpoint directory ## What changes were proposed in this pull request? Take 2: avoid None.get NoSuchElementException in favor of more descriptive IllegalArgumentException if a non-existent checkpoint dir is used without a SparkContext ## How was this patch tested? Jenkins test plus new test for this particular case Author: Sean Owen <sowen@cloudera.com> Closes #12174 from srowen/SPARK-13211.
* [SPARK-14359] Unit tests for java 8 lambda syntax with typed aggregatesEric Liang2016-04-053-41/+118
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Adds unit tests for java 8 lambda syntax with typed aggregates as a follow-up to #12168 ## How was this patch tested? Unit tests. Author: Eric Liang <ekl@databricks.com> Closes #12181 from ericl/sc-2794-2.
* [SPARK-14353] Dataset Time Window `window` API for RBurak Yavuz2016-04-055-1/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008). This PR adds the R API for this function. With this PR, SQL, Java, and Scala will share the same APIs as in users can use: - `window(timeColumn, windowDuration)` - `window(timeColumn, windowDuration, slideDuration)` - `window(timeColumn, windowDuration, slideDuration, startTime)` In Python and R, users can access all APIs above, but in addition they can do - In R: `window(timeColumn, windowDuration, startTime=...)` that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows. ## How was this patch tested? Unit tests + manual tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #12141 from brkyvz/R-windows.
* [HOTFIX] Fix `optional` to `createOptional`.Dongjoon Hyun2016-04-051-1/+1
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes the following line. ``` private[spark] val STAGING_DIR = ConfigBuilder("spark.yarn.stagingDir") .doc("Staging directory used while submitting applications.") .stringConf - .optional + .createOptional ``` ## How was this patch tested? Pass the build. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12187 from dongjoon-hyun/hotfix.
* [SPARK-529][SQL] Modify SQLConf to use new config API from core.Marcelo Vanzin2016-04-0510-590/+551
| | | | | | | | | | | | Because SQL keeps track of all known configs, some customization was needed in SQLConf to allow that, since the core API does not have that feature. Tested via existing (and slightly updated) unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11570 from vanzin/SPARK-529-sql.
* [SPARK-14411][SQL] Add a note to warn that onQueryProgress is asynchronousShixiong Zhu2016-04-051-2/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? onQueryProgress is asynchronous so the user may see some future status of `ContinuousQuery`. This PR just updated comments to warn it. ## How was this patch tested? Only updated comments. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12180 from zsxwing/ContinuousQueryListener-doc.
* [SPARK-14129][SPARK-14128][SQL] Alter table DDL commandsAndrew Or2016-04-059-300/+562
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In Spark 2.0, we want to handle the most common `ALTER TABLE` commands ourselves instead of passing the entire query text to Hive. This is done using the new `SessionCatalog` API introduced recently. The commands supported in this patch include: ``` ALTER TABLE ... RENAME TO ... ALTER TABLE ... SET TBLPROPERTIES ... ALTER TABLE ... UNSET TBLPROPERTIES ... ALTER TABLE ... SET LOCATION ... ALTER TABLE ... SET SERDE ... ``` The commands we explicitly do not support are: ``` ALTER TABLE ... CLUSTERED BY ... ALTER TABLE ... SKEWED BY ... ALTER TABLE ... NOT CLUSTERED ALTER TABLE ... NOT SORTED ALTER TABLE ... NOT SKEWED ALTER TABLE ... NOT STORED AS DIRECTORIES ``` For these we throw exceptions complaining that they are not supported. ## How was this patch tested? `DDLSuite` Author: Andrew Or <andrew@databricks.com> Closes #12121 from andrewor14/alter-table-ddl.
* [SPARK-14402][SQL] initcap UDF doesn't match Hive/Oracle behavior in ↵Dongjoon Hyun2016-04-053-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | lowercasing rest of string ## What changes were proposed in this pull request? Current, SparkSQL `initCap` is using `toTitleCase` function. However, `UTF8String.toTitleCase` implementation changes only the first letter and just copy the other letters: e.g. sParK --> SParK. This is the correct implementation `toTitleCase`. ``` hive> select initcap('sParK'); Spark ``` ``` scala> sql("select initcap('sParK')").head res0: org.apache.spark.sql.Row = [SParK] ``` This PR updates the implementation of `initcap` using `toLowerCase` and `toTitleCase`. ## How was this patch tested? Pass the Jenkins tests (including new testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12175 from dongjoon-hyun/SPARK-14402.
* [SPARK-14353] Dataset Time Window `window` API for Python, and SQLBurak Yavuz2016-04-057-15/+204
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008). This PR adds the Python, and SQL, API for this function. With this PR, SQL, Java, and Scala will share the same APIs as in users can use: - `window(timeColumn, windowDuration)` - `window(timeColumn, windowDuration, slideDuration)` - `window(timeColumn, windowDuration, slideDuration, startTime)` In Python, users can access all APIs above, but in addition they can do - In Python: `window(timeColumn, windowDuration, startTime=...)` that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows. ## How was this patch tested? Unit tests + manual tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #12136 from brkyvz/python-windows.