aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-7286][SQL] Deprecate !== in favour of =!=Jakob Odersky2016-03-087-15/+33
| | | | | | | | | | | | | | | This PR replaces #9925 which had issues with CI. **Please see the original PR for any previous discussions.** ## What changes were proposed in this pull request? Deprecate the SparkSQL column operator !== and use =!= as an alternative. Fixes subtle issues related to operator precedence (basically, !== does not have the same priority as its logical negation, ===). ## How was this patch tested? All currently existing tests. Author: Jakob Odersky <jodersky@gmail.com> Closes #11588 from jodersky/SPARK-7286.
* [SPARK-13754] Keep old data source name for backwards compatibilityHossein2016-03-082-1/+12
| | | | | | | | | | | | | | | ## Motivation CSV data source was contributed by Databricks. It is the inlined version of https://github.com/databricks/spark-csv. The data source name was `com.databricks.spark.csv`. As a result there are many tables created on older versions of spark with that name as the source. For backwards compatibility we should keep the old name. ## Proposed changes `com.databricks.spark.csv` was added to list of `backwardCompatibilityMap` in `ResolvedDataSource.scala` ## Tests A unit test was added to `CSVSuite` to parse a csv file using the old name. Author: Hossein <hossein@databricks.com> Closes #11589 from falaki/SPARK-13754.
* [SPARK-13750][SQL] fix sizeInBytes of HadoopFsRelationDavies Liu2016-03-082-0/+44
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fix the sizeInBytes of HadoopFsRelation. ## How was this patch tested? Added regression test for that. Author: Davies Liu <davies@databricks.com> Closes #11590 from davies/fix_sizeInBytes.
* [SPARK-13625][PYSPARK][ML] Added a check to see if an attribute is a ↵Bryan Cutler2016-03-082-1/+19
| | | | | | | | | | | | | | | | property when getting param list ## What changes were proposed in this pull request? Added a check in pyspark.ml.param.Param.params() to see if an attribute is a property (decorated with `property`) before checking if it is a `Param` instance. This prevents the property from being invoked to 'get' this attribute, which could possibly cause an error. ## How was this patch tested? Added a test case with a class has a property that will raise an error when invoked and then call`Param.params` to verify that the property is not invoked, but still able to find another property in the class. Also ran pyspark-ml test before fix that will trigger an error, and again after the fix to verify that the error was resolved and the method was working properly. Author: Bryan Cutler <cutlerb@gmail.com> Closes #11476 from BryanCutler/pyspark-ml-property-attr-SPARK-13625.
* [SPARK-13755] Escape quotes in SQL plan visualization node labelsJosh Rosen2016-03-081-7/+7
| | | | | | | | When generating Graphviz DOT files in the SQL query visualization we need to escape double-quotes inside node labels. This is a followup to #11309, which fixed a similar graph in Spark Core's DAG visualization. Author: Josh Rosen <joshrosen@databricks.com> Closes #11587 from JoshRosen/graphviz-escaping.
* [SPARK-13668][SQL] Reorder filter/join predicates to short-circuit isNotNull ↵Sameer Agarwal2016-03-083-14/+150
| | | | | | | | | | | | | | | | | | checks ## What changes were proposed in this pull request? If a filter predicate or a join condition consists of `IsNotNull` checks, we should reorder these checks such that these non-nullability checks are evaluated before the rest of the predicates. For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we should rewrite this as `isNotNull(b) && a > 5` during physical plan generation. ## How was this patch tested? new unit tests that verify the physical plan for both filters and joins in `ReorderedPredicateSuite` Author: Sameer Agarwal <sameer@databricks.com> Closes #11511 from sameeragarwal/reorder-isnotnull.
* [SPARK-13738][SQL] Cleanup Data Source resolutionMichael Armbrust2016-03-0812-197/+197
| | | | | | | | | | | Follow-up to #11509, that simply refactors the interface that we use when resolving a pluggable `DataSource`. - Multiple functions share the same set of arguments so we make this a case class, called `DataSource`. Actual resolution is now done by calling a function on this class. - Instead of having multiple methods named `apply` (some of which do writing some of which do reading) we now explicitly have `resolveRelation()` and `write(mode, df)`. - Get rid of `Array[String]` since this is an internal API and was forcing us to awkwardly call `toArray` in a bunch of places. Author: Michael Armbrust <michael@databricks.com> Closes #11572 from marmbrus/dataSourceResolution.
* [SPARK-13400] Stop using deprecated Octal escape literalsDongjoon Hyun2016-03-082-2/+2
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This removes the remaining deprecated Octal escape literals. The followings are the warnings on those two lines. ``` LiteralExpressionSuite.scala:99: Octal escape literals are deprecated, use \u0000 instead. HiveQlSuite.scala:74: Octal escape literals are deprecated, use \u002c instead. ``` ## How was this patch tested? Manual. During building, there should be no warning on `Octal escape literals`. ``` mvn -DskipTests clean install ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11584 from dongjoon-hyun/SPARK-13400.
* [SPARK-13593] [SQL] improve the `createDataFrame` to accept data type string ↵Wenchen Fan2016-03-083-23/+214
| | | | | | | | | | | | | | | | | | and verify the data ## What changes were proposed in this pull request? This PR improves the `createDataFrame` method to make it also accept datatype string, then users can convert python RDD to DataFrame easily, for example, `df = rdd.toDF("a: int, b: string")`. It also supports flat schema so users can convert an RDD of int to DataFrame directly, we will automatically wrap int to row for users. If schema is given, now we checks if the real data matches the given schema, and throw error if it doesn't. ## How was this patch tested? new tests in `test.py` and doc test in `types.py` Author: Wenchen Fan <wenchen@databricks.com> Closes #11444 from cloud-fan/pyrdd.
* [SPARK-13740][SQL] add null check for _verify_type in types.pyWenchen Fan2016-03-081-7/+26
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds null check in `_verify_type` according to the nullability information. ## How was this patch tested? new doc tests Author: Wenchen Fan <wenchen@databricks.com> Closes #11574 from cloud-fan/py-null-check.
* [ML] testEstimatorAndModelReadWrite should call checkModelDataYanbo Liang2016-03-082-1/+5
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Although we defined ```checkModelData``` in [```read/write``` test](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L994) of ML estimators/models and pass it to ```testEstimatorAndModelReadWrite```, ```testEstimatorAndModelReadWrite``` omits to call ```checkModelData``` to check the equality of model data. So actually we did not run the check of model data equality for all test cases currently, we should fix it. BTW, fix the bug of LDA read/write test which did not set ```docConcentration```. This bug should have failed test, but it does not complain because we did not run ```checkModelData``` actually. cc jkbradley mengxr ## How was this patch tested? No new unit test, should pass the exist ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11513 from yanboliang/ml-check-model-data.
* [SPARK-12727][SQL] support SQL generation for aggregate with multi-distinctWenchen Fan2016-03-084-13/+6
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR add SQL generation support for aggregate with multi-distinct, by simply moving the `DistinctAggregationRewriter` rule to optimizer. More discussions are needed as this breaks an import contract: analyzed plan should be able to run without optimization. However, the `ComputeCurrentTime` rule has kind of broken it already, and I think maybe we should add a new phase for this kind of rules, because strictly speaking they don't belong to analysis and is coupled with the physical plan implementation. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #11579 from cloud-fan/distinct.
* [SPARK-13695] Don't cache MEMORY_AND_DISK blocks as bytes in memory after spillsJosh Rosen2016-03-083-13/+26
| | | | | | | | | | | | | | | | | When a cached block is spilled to disk and read back in serialized form (i.e. as bytes), the current BlockManager implementation will attempt to re-insert the serialized block into the MemoryStore even if the block's storage level requests deserialized caching. This behavior adds some complexity to the MemoryStore but I don't think it offers many performance benefits and I'd like to remove it in order to simplify a larger refactoring patch. Therefore, this patch changes the behavior so that disk store reads will only cache bytes in the memory store for blocks with serialized storage levels. There are two places where we request serialized bytes from the BlockStore: 1. getLocalBytes(), which is only called when reading local copies of TorrentBroadcast pieces. Broadcast pieces are always cached using a serialized storage level, so this won't lead to a mismatch in serialization forms if spilled bytes read from disk are cached as bytes in the memory store. 2. the non-shuffle-block branch in getBlockData(), which is only called by the NettyBlockRpcServer when responding to requests to read remote blocks. Caching the serialized bytes in memory will only benefit us if those cached bytes are read before they're evicted and the likelihood of that happening seems low since the frequency of remote reads of non-broadcast cached blocks seems very low. Caching these bytes when they have a low probability of being read is bad if it risks the eviction of blocks which are cached in their expected serialized/deserialized forms, since those blocks seem more likely to be read in local computation. Given the argument above, I think this change is unlikely to cause performance regressions. Author: Josh Rosen <joshrosen@databricks.com> Closes #11533 from JoshRosen/remove-memorystore-level-mismatch.
* [SPARK-13657] [SQL] Support parsing very long AND/OR expressionsDavies Liu2016-03-082-2/+51
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer. ## How was this patch tested? Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds. [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql Author: Davies Liu <davies@databricks.com> Closes #11501 from davies/long_or.
* [SPARK-13715][MLLIB] Remove last usages of jblas in testsSean Owen2016-03-0810-121/+107
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Remove last usage of jblas, in tests ## How was this patch tested? Jenkins tests -- the same ones that are being modified. Author: Sean Owen <sowen@cloudera.com> Closes #11560 from srowen/SPARK-13715.
* [HOTFIX][YARN] Fix yarn cluster mode fire and forget regressionjerryshao2016-03-081-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fire and forget is disabled by default, with this patch #10205 it is enabled by default, so this is a regression should be fixed. ## How was this patch tested? Manually verified this change. Author: jerryshao <sshao@hortonworks.com> Closes #11577 from jerryshao/hot-fix-yarn-cluster.
* [SPARK-13637][SQL] use more information to simplify the code in Expand builderWenchen Fan2016-03-082-29/+23
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The code in `Expand.apply` can be simplified by existing information: * the `groupByExprs` parameter are all `Attribute`s * the `child` parameter is a `Project` that append aliased group by expressions to its child's output ## How was this patch tested? by existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11485 from cloud-fan/expand.
* [SPARK-13675][UI] Fix wrong historyserver url link for application running ↵jerryshao2016-03-082-4/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | in yarn cluster mode ## What changes were proposed in this pull request? Current URL for each application to access history UI is like: http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or http://localhost:18080/history/application_1457058760338_0016/2/jobs/ Here **1** or **2** represents the number of attempts in `historypage.js`, but it will parse to attempt id in `HistoryServer`, while the correct attempt id should be like "appattempt_1457058760338_0016_000002", so it will fail to parse to a correct attempt id in HistoryServer. This is OK in yarn client mode, since we don't need this attempt id to fetch out the app cache, but it is failed in yarn cluster mode, where attempt id "1" or "2" is actually wrong. So here we should fix this url to parse the correct application id and attempt id. Also the suffix "jobs/" is not needed. Here is the screenshot: ![screen shot 2016-02-29 at 3 57 32 pm](https://cloud.githubusercontent.com/assets/850797/13524377/d4b44348-e235-11e5-8b3e-bc06de306e87.png) ## How was this patch tested? This patch is tested manually, with different master and deploy mode. ![image](https://cloud.githubusercontent.com/assets/850797/13524419/118be5a0-e236-11e5-8022-3ff613ccde46.png) Author: jerryshao <sshao@hortonworks.com> Closes #11518 from jerryshao/SPARK-13675.
* [SPARK-13117][WEB UI] WebUI should use the local ip not 0.0.0.0Devaraj K2016-03-081-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In WebUI, now Jetty Server starts with SPARK_LOCAL_IP config value if it is configured otherwise it starts with default value as '0.0.0.0'. It is continuation as per the closed PR https://github.com/apache/spark/pull/11133 for the JIRA SPARK-13117 and discussion in SPARK-13117. ## How was this patch tested? This has been verified using the command 'netstat -tnlp | grep <PID>' to check on which IP/hostname is binding with the below steps. In the below results, mentioned PID in the command is the corresponding process id. #### Without the patch changes, Web UI(Jetty Server) is not taking the value configured for SPARK_LOCAL_IP and it is listening to all the interfaces. ###### Master ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 3930 tcp6 0 0 :::8080 :::* LISTEN 3930/java ``` ###### Worker ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4090 tcp6 0 0 :::8081 :::* LISTEN 4090/java ``` ###### History Server Process, ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 2471 tcp6 0 0 :::18080 :::* LISTEN 2471/java ``` ###### Driver ``` [devarajstobdtserver2 spark-master]$ netstat -tnlp | grep 6556 tcp6 0 0 :::4040 :::* LISTEN 6556/java ``` #### With the patch changes ##### i. With SPARK_LOCAL_IP configured If the SPARK_LOCAL_IP is configured then all the processes Web UI(Jetty Server) is getting bind to the configured value. ###### Master ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 1561 tcp6 0 0 x.x.x.x:8080 :::* LISTEN 1561/java ``` ###### Worker ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 2229 tcp6 0 0 x.x.x.x:8081 :::* LISTEN 2229/java ``` ###### History Server ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 3747 tcp6 0 0 x.x.x.x:18080 :::* LISTEN 3747/java ``` ###### Driver ``` [devarajstobdtserver2 spark-master]$ netstat -tnlp | grep 6013 tcp6 0 0 x.x.x.x:4040 :::* LISTEN 6013/java ``` ##### ii. Without SPARK_LOCAL_IP configured If the SPARK_LOCAL_IP is not configured then all the processes Web UI(Jetty Server) will start with the '0.0.0.0' as default value. ###### Master ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4573 tcp6 0 0 :::8080 :::* LISTEN 4573/java ``` ###### Worker ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4703 tcp6 0 0 :::8081 :::* LISTEN 4703/java ``` ###### History Server ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4846 tcp6 0 0 :::18080 :::* LISTEN 4846/java ``` ###### Driver ``` [devarajstobdtserver2 sbin]$ netstat -tnlp | grep 5437 tcp6 0 0 :::4040 :::* LISTEN 5437/java ``` Author: Devaraj K <devaraj@apache.org> Closes #11490 from devaraj-kavali/SPARK-13117-v1.
* [HOT-FIX][BUILD] Use the new location of `checkstyle-suppressions.xml`Dongjoon Hyun2016-03-082-2/+2
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes `dev/lint-java` and `mvn checkstyle:check` failures due the recent file location change. The following is the error message of current master. ``` Checkstyle checks failed at following occurrences: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (default-cli) on project spark-parent_2.11: Failed during checkstyle configuration: cannot initialize module SuppressionFilter - Cannot set property 'file' to 'checkstyle-suppressions.xml' in module SuppressionFilter: InvocationTargetException: Unable to find: checkstyle-suppressions.xml -> [Help 1] ``` ## How was this patch tested? Manual. The following command should run correctly. ``` ./dev/lint-java mvn checkstyle:check ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11567 from dongjoon-hyun/hotfix_checkstyle_suppression.
* [SPARK-13659] Refactor BlockStore put*() APIs to remove returnValuesJosh Rosen2016-03-076-273/+223
| | | | | | | | | | | | | | In preparation for larger refactoring, this patch removes the confusing `returnValues` option from the BlockStore put() APIs: returning the value is only useful in one place (caching) and in other situations, such as block replication, it's simpler to put() and then get(). As part of this change, I needed to refactor `BlockManager.doPut()`'s block replication code. I also changed `doPut()` to access the memory and disk stores directly rather than calling them through the BlockStore interface; this is in anticipation of a followup patch to remove the BlockStore interface so that the disk store can expose a binary-data-oriented API which is not concerned with Java objects or serialization. These changes should be covered by the existing storage unit tests. The best way to review this patch is probably to look at the individual commits, all of which are small and have useful descriptions to guide the review. /cc davies for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #11502 from JoshRosen/remove-returnvalues.
* [SPARK-13711][CORE] Don't call SparkUncaughtExceptionHandler in AppClient as ↵Shixiong Zhu2016-03-071-10/+8
| | | | | | | | | | | | | | | | it's in driver ## What changes were proposed in this pull request? AppClient runs in the driver side. It should not call `Utils.tryOrExit` as it will send exception to SparkUncaughtExceptionHandler and call `System.exit`. This PR just removed `Utils.tryOrExit`. ## How was this patch tested? manual tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11566 from zsxwing/SPARK-13711.
* [SPARK-13404] [SQL] Create variables for input row when it's actually usedDavies Liu2016-03-079-155/+224
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR change the way how we generate the code for the output variables passing from a plan to it's parent. Right now, they are generated before call consume() of it's parent. It's not efficient, if the parent is a Filter or Join, which could filter out most the rows, the time to access some of the columns that are not used by the Filter or Join are wasted. This PR try to improve this by defering the access of columns until they are actually used by a plan. After this PR, a plan does not need to generate code to evaluate the variables for output, just passing the ExprCode to its parent by `consume()`. In `parent.consumeChild()`, it will check the output from child and `usedInputs`, generate the code for those columns that is part of `usedInputs` before calling `doConsume()`. This PR also change the `if` from ``` if (cond) { xxx } ``` to ``` if (!cond) continue; xxx ``` The new one could help to reduce the nested indents for multiple levels of Filter and BroadcastHashJoin. It also added some comments for operators. ## How was the this patch tested? Unit tests. Manually ran TPCDS Q55, this PR improve the performance about 30% (scale=10, from 2.56s to 1.96s) Author: Davies Liu <davies@databricks.com> Closes #11274 from davies/gen_defer.
* [SPARK-13689][SQL] Move helper things in CatalystQl to new utils objectAndrew Or2016-03-074-143/+177
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? When we add more DDL parsing logic in the future, SparkQl will become very big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse alter table commands. However, these parser objects will need to access some helper methods that exist in CatalystQl. The proposal is to move those methods to an isolated ParserUtils object. This is based on viirya's changes in #11048. It prefaces the bigger fix for SPARK-13139 to make the diff of that patch smaller. ## How was this patch tested? No change in functionality, so just Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #11529 from andrewor14/parser-utils.
* [SPARK-13648] Add Hive Cli to classes for isolated classloaderTim Preece2016-03-071-1/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Adding the hive-cli classes to the classloader ## How was this patch tested? The hive Versionssuite tests were run This is my original work and I license the work to the project under the project's open source license. Author: Tim Preece <tim.preece.in.oz@gmail.com> Closes #11495 from preecet/master.
* [SPARK-13665][SQL] Separate the concerns of HadoopFsRelationMichael Armbrust2016-03-0750-2656/+1450
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `HadoopFsRelation` is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0. ### HadoopFsRelation A simple `case class` that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers. ```scala case class HadoopFsRelation( sqlContext: SQLContext, location: FileCatalog, partitionSchema: StructType, dataSchema: StructType, bucketSpec: Option[BucketSpec], fileFormat: FileFormat, options: Map[String, String]) extends BaseRelation ``` ### FileFormat The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`. A format can optionally return a schema that is inferred from a set of files. ```scala trait FileFormat { def inferSchema( sqlContext: SQLContext, options: Map[String, String], files: Seq[FileStatus]): Option[StructType] def prepareWrite( sqlContext: SQLContext, job: Job, options: Map[String, String], dataSchema: StructType): OutputWriterFactory def buildInternalScan( sqlContext: SQLContext, dataSchema: StructType, requiredColumns: Array[String], filters: Array[Filter], bucketSet: Option[BitSet], inputFiles: Array[FileStatus], broadcastedConf: Broadcast[SerializableConfiguration], options: Map[String, String]): RDD[InternalRow] } ``` The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning `RDD`s instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file. ### FileCatalog This interface is used to list the files that make up a given relation, as well as handle directory based partitioning. ```scala trait FileCatalog { def paths: Seq[Path] def partitionSpec(schema: Option[StructType]): PartitionSpec def allFiles(): Seq[FileStatus] def getStatus(path: Path): Array[FileStatus] def refresh(): Unit } ``` Currently there are two implementations: - `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`. Infers partitioning by recursive listing and caches this data for performance - `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore. ### ResolvedDataSource Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore): - `paths: Seq[String] = Nil` - `userSpecifiedSchema: Option[StructType] = None` - `partitionColumns: Array[String] = Array.empty` - `bucketSpec: Option[BucketSpec] = None` - `provider: String` - `options: Map[String, String]` This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here. ### DataSourceAnalysis / DataSourceStrategy Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including: - pruning the files from partitions that will be read based on filters. - appending partition columns* - applying additional filters when a data source can not evaluate them internally. - constructing an RDD that is bucketed correctly when required* - sanity checking schema match-up and other analysis when writing. *In the future we should do that following: - Break out file handling into its own Strategy as its sufficiently complex / isolated. - Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization. - Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2` Author: Michael Armbrust <michael@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #11509 from marmbrus/fileDataSource.
* [SPARK-13596][BUILD] Move misc top-level build files into appropriate subdirsSean Owen2016-03-0714-49/+18
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Move many top-level files in dev/ or other appropriate directory. In particular, put `make-distribution.sh` in `dev` and update docs accordingly. Remove deprecated `sbt/sbt`. I was (so far) unable to figure out how to move `tox.ini`. `scalastyle-config.xml` should be movable but edits to the project `.sbt` files didn't work; config file location is updatable for compile but not test scope. ## How was this patch tested? `./dev/run-tests` to verify RAT and checkstyle work. Jenkins tests for the rest. Author: Sean Owen <sowen@cloudera.com> Closes #11522 from srowen/SPARK-13596.
* [SPARK-13442][SQL] Make type inference recognize boolean typeshyukjinkwon2016-03-074-0/+38
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13442 This PR adds the support for inferring `BooleanType` for schema. It supports to infer case-insensitive `true` / `false` as `BooleanType`. Unittests were added for `CSVInferSchemaSuite` and `CSVSuite` for end-to-end test. ## How was the this patch tested? This was tested with unittests and with `dev/run_tests` for coding style Author: hyukjinkwon <gurwls223@gmail.com> Closes #11315 from HyukjinKwon/SPARK-13442.
* [SPARK-529][CORE][YARN] Add type-safe config keys to SparkConf.Marcelo Vanzin2016-03-0720-255/+1019
| | | | | | | | | | | | | | | | This is, in a way, the basics to enable SPARK-529 (which was closed as won't fix but I think is still valuable). In fact, Spark SQL created something for that, and this change basically factors out that code and inserts it into SparkConf, with some extra bells and whistles. To showcase the usage of this pattern, I modified the YARN backend to use the new config keys (defined in the new `config` package object under `o.a.s.deploy.yarn`). Most of the changes are mechanic, although logic had to be slightly modified in a handful of places. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10205 from vanzin/conf-opts.
* [SPARK-13655] Improve isolation between tests in KinesisBackedBlockRDDSuiteJosh Rosen2016-03-071-13/+17
| | | | | | | | | | | | This patch modifies `KinesisBackedBlockRDDTests` to increase the isolation between tests in order to fix a bug which causes the tests to hang. See #11558 for more details. /cc zsxwing srowen Author: Josh Rosen <joshrosen@databricks.com> Closes #11564 from JoshRosen/SPARK-13655.
* [SPARK-13722][SQL] No Push Down for Non-deterministics Predicates through ↵gatorsmile2016-03-072-1/+19
| | | | | | | | | | | | | | | | Generate #### What changes were proposed in this pull request? Non-deterministic predicates should not be pushed through Generate. #### How was this patch tested? Added a test case in `FilterPushdownSuite.scala` Author: gatorsmile <gatorsmile@gmail.com> Closes #11562 from gatorsmile/pushPredicateDownWindow.
* [MINOR][DOC] improve the doc for "spark.memory.offHeap.size"CodingCat2016-03-071-1/+1
| | | | | | | | | | | | The description of "spark.memory.offHeap.size" in the current document does not clearly state that memory is counted with bytes.... This PR contains a small fix for this tiny issue document fix Author: CodingCat <zhunansjtu@gmail.com> Closes #11561 from CodingCat/master.
* [SPARK-12243][BUILD][PYTHON] PySpark tests are slow in Jenkins.Dongjoon Hyun2016-03-071-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In the Jenkins pull request builder, PySpark tests take around [962 seconds ](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52530/console) of end-to-end time to run, despite the fact that we run four Python test suites in parallel. According to the log, the basic reason is that the long running test starts at the end due to FIFO queue. We first try to reduce the test time by just starting some long running tests first with simple priority queue. ``` ======================================================================== Running PySpark tests ======================================================================== ... Finished test(python3.4): pyspark.streaming.tests (213s) Finished test(pypy): pyspark.sql.tests (92s) Finished test(pypy): pyspark.streaming.tests (280s) Tests passed in 962 seconds ``` ## How was this patch tested? Manual check. Check 'Running PySpark tests' part of the Jenkins log. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11551 from dongjoon-hyun/SPARK-12243.
* [SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins ↵Sameer Agarwal2016-03-078-20/+182
| | | | | | | | | | | | | | | | | | | | | | based on their data constraints ## What changes were proposed in this pull request? This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting `isNotNull` filters is the query plan. These filters are currently inserted beneath existing `Filter` and `Join` operators and are inferred based on their data constraints. Note: While this optimization is applicable to all types of join, it primarily benefits `Inner` and `LeftSemi` joins. ## How was this patch tested? 1. Added a new `NullFilteringSuite` that tests for `IsNotNull` filters in the query plan for joins and filters. Also, tests interaction with the `CombineFilters` optimizer rules. 2. Test generated ExpressionTrees via `OrcFilterSuite` 3. Test filter source pushdown logic via `SimpleTextHadoopFsRelationSuite` cc yhuai nongli Author: Sameer Agarwal <sameer@databricks.com> Closes #11372 from sameeragarwal/gen-isnotnull.
* [SPARK-13694][SQL] QueryPlan.expressions should always include all expressionsWenchen Fan2016-03-074-7/+3
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? It's weird that expressions don't always have all the expressions in it. This PR marks `QueryPlan.expressions` final to forbid sub classes overriding it to exclude some expressions. Currently only `Generate` override it, we can use `producedAttributes` to fix the unresolved attribute problem for it. Note that this PR doesn't fix the problem in #11497 ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11532 from cloud-fan/generate.
* [SPARK-13651] Generator outputs are not resolved correctly resulting in run ↵Dilip Biswal2016-03-072-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | time error ## What changes were proposed in this pull request? ``` Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src") sqlContext.sql("SELECT t1.* FROM src LATERAL VIEW explode(map('key1', 100, 'key2', 200)) t1 AS key, value") ``` Results in following logical plan ``` Project [key#2,value#3] +- Generate explode(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap(key1,100,key2,200)), true, false, Some(genoutput), [key#2,value#3] +- SubqueryAlias src +- Project [_1#0 AS key#2,_2#1 AS value#3] +- LocalRelation [_1#0,_2#1], [[id1,value1]] ``` The above query fails with following runtime error. ``` java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:221) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:42) at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:98) at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:96) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) <stack-trace omitted.....> ``` In this case the generated outputs are wrongly resolved from its child (LocalRelation) due to https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L537-L548 ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Added unit tests in hive/SQLQuerySuite and AnalysisSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #11497 from dilipbiswal/spark-13651.
* Fixing the type of the sentiment happiness valueYury Liavitski2016-03-071-2/+2
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Added the conversion to int for the 'happiness value' read from the file. Otherwise, later on line 75 the multiplication will multiply a string by a number, yielding values like "-2-2" instead of -4. ## How was this patch tested? Tested manually. Author: Yury Liavitski <seconds.before@gmail.com> Author: Yury Liavitski <yury.liavitski@il111.ice.local> Closes #11540 from heliocentrist/fix-sentiment-value-type.
* [SPARK-13705][DOCS] UpdateStateByKey Operation documentation incorrectly ↵rmishra2016-03-071-4/+1
| | | | | | | | | | | | | | refers to StatefulNetworkWordCount ## What changes were proposed in this pull request? The reference to StatefulNetworkWordCount.scala from updateStatesByKey documentation should be removed, till there is a example for updateStatesByKey. ## How was this patch tested? Have tested the new documentation with jekyll build. Author: rmishra <rmishra@pivotal.io> Closes #11545 from rishitesh/SPARK-13705.
* [SPARK-13685][SQL] Rename catalog.Catalog to ExternalCatalogAndrew Or2016-03-079-27/+34
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Today we have `analysis.Catalog` and `catalog.Catalog`. In the future the former will call the latter. When that happens, if both of them are still called `Catalog` it will be very confusing. This patch renames the latter `ExternalCatalog` because it is expected to talk to external systems. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #11526 from andrewor14/rename-catalog.
* [SPARK-13697] [PYSPARK] Fix the missing module name of ↵Shixiong Zhu2016-03-062-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TransformFunctionSerializer.loads ## What changes were proposed in this pull request? Set the function's module name to `__main__` if it's missing in `TransformFunctionSerializer.loads`. ## How was this patch tested? Manually test in the shell. Before this patch: ``` >>> from pyspark.streaming import StreamingContext >>> from pyspark.streaming.util import TransformFunction >>> ssc = StreamingContext(sc, 1) >>> func = TransformFunction(sc, lambda x: x, sc.serializer) >>> func.rdd_wrapper(lambda x: x) TransformFunction(<function <lambda> at 0x106ac8b18>) >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers))) >>> func2 = ssc._transformerSerializer.loads(bytes) >>> print(func2.func.__module__) None >>> print(func2.rdd_wrap_func.__module__) None >>> ``` After this patch: ``` >>> from pyspark.streaming import StreamingContext >>> from pyspark.streaming.util import TransformFunction >>> ssc = StreamingContext(sc, 1) >>> func = TransformFunction(sc, lambda x: x, sc.serializer) >>> func.rdd_wrapper(lambda x: x) TransformFunction(<function <lambda> at 0x108bf1b90>) >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers))) >>> func2 = ssc._transformerSerializer.loads(bytes) >>> print(func2.func.__module__) __main__ >>> print(func2.rdd_wrap_func.__module__) __main__ >>> ``` Author: Shixiong Zhu <shixiong@databricks.com> Closes #11535 from zsxwing/loads-module.
* Revert "[SPARK-13616][SQL] Let SQLBuilder convert logical plan without a ↵Cheng Lian2016-03-062-63/+1
| | | | | | | | | | | | project on top of it" This reverts commit f87ce0504ea0697969ac3e67690c78697b76e94a. According to discussion in #11466, let's revert PR #11466 for safe. Author: Cheng Lian <lian@databricks.com> Closes #11539 from liancheng/revert-pr-11466.
* [SPARK-13693][STREAMING][TESTS] Stop StreamingContext before deleting ↵Shixiong Zhu2016-03-051-1/+1
| | | | | | | | | | | | | | | | | | checkpoint dir ## What changes were proposed in this pull request? Stop StreamingContext before deleting checkpoint dir to avoid the race condition that deleting the checkpoint dir and writing checkpoint happen at the same time. The flaky test log is here: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/ ## How was this patch tested? unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11531 from zsxwing/SPARK-13693.
* [SPARK-12720][SQL] SQL Generation Support for Cube, Rollup, and Grouping Setsgatorsmile2016-03-054-8/+226
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? This PR is for supporting SQL generation for cube, rollup and grouping sets. For example, a query using rollup: ```SQL SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP ``` Original logical plan: ``` Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46], [(count(1),mode=Complete,isDistinct=false) AS cnt#43L, (key#17L % cast(5 as bigint))#47L AS _c1#45L, grouping__id#46 AS _c2#44] +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0), List(key#17L, value#18, null, 1)], [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46] +- Project [key#17L, value#18, (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L] +- Subquery t1 +- Relation[key#17L,value#18] ParquetRelation ``` Converted SQL: ```SQL SELECT count( 1) AS `cnt`, (`t1`.`key` % CAST(5 AS BIGINT)), grouping_id() AS `_c2` FROM `default`.`t1` GROUP BY (`t1`.`key` % CAST(5 AS BIGINT)) GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ()) ``` #### How was the this patch tested? Added eight test cases in `LogicalPlanToSQLSuite`. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11283 from gatorsmile/groupingSetsToSQL.
* [SPARK-12073][STREAMING] backpressure rate controller consumes events ↵Jason White2016-03-047-30/+101
| | | | | | | | | | | | | | | | | | | | | | preferentially from lagg… …ing partitions I'm pretty sure this is the reason we couldn't easily recover from an unbalanced Kafka partition under heavy load when using backpressure. `maxMessagesPerPartition` calculates an appropriate limit for the message rate from all partitions, and then divides by the number of partitions to determine how many messages to retrieve per partition. The problem with this approach is that when one partition is behind by millions of records (due to random Kafka issues), but the rate estimator calculates only 100k total messages can be retrieved, each partition (out of say 32) only retrieves max 100k/32=3125 messages. This PR (still needing a test) determines a per-partition desired message count by using the current lag for each partition to preferentially weight the total message limit among the partitions. In this situation, if each partition gets 1k messages, but 1 partition starts 1M behind, then the total number of messages to retrieve is (32 * 1k + 1M) = 1032000 messages, of which the one partition needs 1001000. So, it gets (1001000 / 1032000) = 97% of the 100k messages, and the other 31 partitions share the remaining 3%. Assuming all of 100k the messages are retrieved and processed within the batch window, the rate calculator will increase the number of messages to retrieve in the next batch, until it reaches a new stable point or the backlog is finished processed. We're going to try deploying this internally at Shopify to see if this resolves our issue. tdas koeninger holdenk Author: Jason White <jason.white@shopify.com> Closes #10089 from JasonMWhite/rate_controller_offsets.
* [SPARK-13255] [SQL] Update vectorized reader to directly return ↵Nong Li2016-03-049-45/+284
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ColumnarBatch instead of InternalRows. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Currently, the parquet reader returns rows one by one which is bad for performance. This patch updates the reader to directly return ColumnarBatches. This is only enabled with whole stage codegen, which is the only operator currently that is able to consume ColumnarBatches (instead of rows). The current implementation is a bit of a hack to get this to work and we should do more refactoring of these low level interfaces to make this work better. ## How was this patch tested? ``` Results: TPCDS: Best/Avg Time(ms) Rate(M/s) Per Row(ns) --------------------------------------------------------------------------------- q55 (before) 8897 / 9265 12.9 77.2 q55 5486 / 5753 21.0 47.6 ``` Author: Nong Li <nong@databricks.com> Closes #11435 from nongli/spark-13255.
* [SPARK-13459][WEB UI] Separate Alive and Dead Executors in Executor Totals TableAlex Bozarth2016-03-041-39/+45
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Now that dead executors are shown in the executors table (#10058) the totals table is updated to include the separate totals for alive and dead executors as well as the current total, as originally discussed in #10668 ## How was this patch tested? Manually verified by running the Standalone Web UI in the latest Safari and Firefox ESR Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #11381 from ajbozarth/spark13459.
* [SPARK-13633][SQL] Move things into catalyst.parser packageAndrew Or2016-03-0416-18/+25
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch simply moves things to existing package `o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in #11048. This is conceptually the same as a recently merged patch #11482. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #11506 from andrewor14/parser-package.
* [SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.pyXusen Yin2016-03-043-48/+341
| | | | | | | | | | Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`. In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in https://github.com/apache/spark/pull/9884. I'll add them in this PR if https://github.com/apache/spark/pull/9884 gets merged first. Or add a follow-up JIRA for `RFormula`. Author: Xusen Yin <yinxusen@gmail.com> Closes #11203 from yinxusen/SPARK-13036.
* [SPARK-13676] Fix mismatched default values for regParam in LogisticRegressionDongjoon Hyun2016-03-041-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The default value of regularization parameter for `LogisticRegression` algorithm is different in Scala and Python. We should provide the same value. **Scala** ``` scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam res0: Double = 0.0 ``` **Python** ``` >>> from pyspark.ml.classification import LogisticRegression >>> LogisticRegression().getRegParam() 0.1 ``` ## How was this patch tested? manual. Check the following in `pyspark`. ``` >>> from pyspark.ml.classification import LogisticRegression >>> LogisticRegression().getRegParam() 0.0 ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11519 from dongjoon-hyun/SPARK-13676.
* [SPARK-13673][WINDOWS] Fixed not to pollute environment variables.Masayoshi TSUZUKI2016-03-041-2/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch fixes the problem that `bin\beeline.cmd` pollutes environment variables. The similar problem is reported and fixed in https://issues.apache.org/jira/browse/SPARK-3943, but `bin\beeline.cmd` seems to be added later. ## How was this patch tested? manual tests: I executed the new `bin\beeline.cmd` and confirmed that %SPARK_HOME% doesn't remain in the command prompt. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #11516 from tsudukim/feature/SPARK-13673.