aboutsummaryrefslogtreecommitdiff
path: root/sql/core/src
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-14145][SQL] Remove the untyped version of Dataset.groupByKeyReynold Xin2016-03-244-99/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? Dataset has two variants of groupByKey, one for untyped and the other for typed. It actually doesn't make as much sense to have an untyped API here, since apps that want to use untyped APIs should just use the groupBy "DataFrame" API. ## How was this patch tested? This patch removes a method, and removes the associated tests. Author: Reynold Xin <rxin@databricks.com> Closes #11949 from rxin/SPARK-14145.
* [SPARK-14142][SQL] Replace internal use of unionAll with unionReynold Xin2016-03-249-15/+15
| | | | | | | | | | | | ## What changes were proposed in this pull request? unionAll has been deprecated in SPARK-14088. ## How was this patch tested? Should be covered by all existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #11946 from rxin/SPARK-14142.
* [SPARK-13957][SQL] Support Group By Ordinal in SQLgatorsmile2016-03-252-7/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? This PR is to support group by position in SQL. For example, when users input the following query ```SQL select c1 as a, c2, c3, sum(*) from tbl group by 1, 3, c4 ``` The ordinals are recognized as the positions in the select list. Thus, `Analyzer` converts it to ```SQL select c1, c2, c3, sum(*) from tbl group by c1, c3, c4 ``` This is controlled by the config option `spark.sql.groupByOrdinal`. - When true, the ordinal numbers in group by clauses are treated as the position in the select list. - When false, the ordinal numbers are ignored. - Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them. - When the positions specified in the group by clauses correspond to the aggregate functions in select list, output an exception message. - star is not allowed to use in the select list when users specify ordinals in group by Note: This PR is taken from https://github.com/apache/spark/pull/10731. When merging this PR, please give the credit to zhichao-li Also cc all the people who are involved in the previous discussion: rxin cloud-fan marmbrus yhuai hvanhovell adrian-wang chenghao-intel tejasapatil #### How was this patch tested? Added a few test cases for both positive and negative test cases. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11846 from gatorsmile/groupByOrdinal.
* Revert "[SPARK-14014][SQL] Replace existing catalog with SessionCatalog"Andrew Or2016-03-2310-105/+73
| | | | This reverts commit 5dfc01976bb0d72489620b4f32cc12d620bb6260.
* [SPARK-14085][SQL] Star Expansion for Hashgatorsmile2016-03-242-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? This PR is to support star expansion in hash. For example, ```SQL val structDf = testData2.select("a", "b").as("record") structDf.select(hash($"*") ``` In addition, it refactors the codes for the rule `ResolveStar` and fixes a regression for star expansion in group by when using SQL API. For example, ```SQL SELECT * FROM testData2 group by a, b ``` cc cloud-fan Now, the code for star resolution is much cleaner. The coverage is better. Could you check if this refactoring is good? Thanks! #### How was this patch tested? Added a few test cases to cover it. Author: gatorsmile <gatorsmile@gmail.com> Closes #11904 from gatorsmile/starResolution.
* [SPARK-14014][SQL] Replace existing catalog with SessionCatalogAndrew Or2016-03-2310-73/+105
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`. As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely: - SPARK-14013: Properly implement temporary functions in `SessionCatalog` - SPARK-13879: Decide which DDL/DML commands to support natively in Spark - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`. - SPARK-?????: Merge SQL/HiveContext ## How was this patch tested? This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #11836 from andrewor14/use-session-catalog.
* [SPARK-14078] Streaming Parquet Based FileSinkMichael Armbrust2016-03-2314-15/+430
| | | | | | | | | | This PR adds a new `Sink` implementation that writes out Parquet files. In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based `DataSource` is initialized for reading, we first check for this log directory and use it instead of file listing when present. Unit tests are added, as well as a stress test that checks the answer after non-deterministic injected failures. Author: Michael Armbrust <michael@databricks.com> Closes #11897 from marmbrus/fileSink.
* [SPARK-13809][SQL] State store for streaming aggregationsTathagata Das2016-03-2311-0/+2052
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In this PR, I am implementing a new abstraction for management of streaming state data - State Store. It is a key-value store for persisting running aggregates for aggregate operations in streaming dataframes. The motivation and design is discussed here. https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254/edit# ## How was this patch tested? - [x] Unit tests - [x] Cluster tests **Coverage from unit tests** <img width="952" alt="screen shot 2016-03-21 at 3 09 40 pm" src="https://cloud.githubusercontent.com/assets/663212/13935872/fdc8ba86-ef76-11e5-93e8-9fa310472c7b.png"> ## TODO - [x] Fix updates() iterator to avoid duplicate updates for same key - [x] Use Coordinator in ContinuousQueryManager - [x] Plugging in hadoop conf and other confs - [x] Unit tests - [x] StateStore object lifecycle and methods - [x] StateStoreCoordinator communication and logic - [x] StateStoreRDD fault-tolerance - [x] StateStoreRDD preferred location using StateStoreCoordinator - [ ] Cluster tests - [ ] Whether preferred locations are set correctly - [ ] Whether recovery works correctly with distributed storage - [x] Basic performance tests - [x] Docs Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #11645 from tdas/state-store.
* [SPARK-14015][SQL] Support TimestampType in vectorized parquet readerSameer Agarwal2016-03-235-18/+39
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds support for TimestampType in the vectorized parquet reader ## How was this patch tested? 1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.INT96)` that made us fall back on parquet-mr for handling timestamps. This condition is now removed. 2. The `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `TimestampType`) fails when the gating condition is removed (https://github.com/apache/spark/pull/11808) and should now pass with this change. Similarly, the `ParquetHiveCompatibilitySuite.SPARK-10177 timestamp` test that fails when the gating condition is removed, should now pass as well. 3. Added tests in `HadoopFsRelationTest` that test both the dictionary encoded and non-encoded versions across all supported datatypes. Author: Sameer Agarwal <sameer@databricks.com> Closes #11882 from sameeragarwal/timestamp-parquet.
* [SPARK-14092] [SQL] move shouldStop() to end of while loopDavies Liu2016-03-233-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR rollback some changes in #11274 , which introduced some performance regression when do a simple aggregation on parquet scan with one integer column. Does not really understand how this change introduce this huge impact, maybe related show JIT compiler inline functions. (saw very different stats from profiling). ## How was this patch tested? Manually run the parquet reader benchmark, before this change: ``` Intel(R) Core(TM) i7-4558U CPU 2.80GHz Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 2391 / 3107 43.9 22.8 1.0X ``` After this change ``` Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5 Intel(R) Core(TM) i7-4558U CPU 2.80GHz Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 2032 / 2626 51.6 19.4 1.0X``` Author: Davies Liu <davies@databricks.com> Closes #11912 from davies/fix_regression.
* [SPARK-14075] Refactor MemoryStore to be testable independent of BlockManagerJosh Rosen2016-03-236-4/+16
| | | | | | | | | | | | | This patch refactors the `MemoryStore` so that it can be tested without needing to construct / mock an entire `BlockManager`. - The block manager's serialization- and compression-related methods have been moved from `BlockManager` to `SerializerManager`. - `BlockInfoManager `is now passed directly to classes that need it, rather than being passed via the `BlockManager`. - The `MemoryStore` now calls `dropFromMemory` via a new `BlockEvictionHandler` interface rather than directly calling the `BlockManager`. This change helps to enforce a narrow interface between the `MemoryStore` and `BlockManager` functionality and makes this interface easier to mock in tests. - Several of the block unrolling tests have been moved from `BlockManagerSuite` into a new `MemoryStoreSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #11899 from JoshRosen/reduce-memorystore-blockmanager-coupling.
* [SPARK-13817][SQL][MINOR] Renames Dataset.newDataFrame to Dataset.ofRowsCheng Lian2016-03-2419-41/+41
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR does the renaming as suggested by marmbrus in [this comment][1]. ## How was this patch tested? Existing tests. [1]: https://github.com/apache/spark/commit/6d37e1eb90054cdb6323b75fb202f78ece604b15#commitcomment-16654694 Author: Cheng Lian <lian@databricks.com> Closes #11889 from liancheng/spark-13817-follow-up.
* [HOTFIX][SQL] Don't stop ContinuousQuery in quietlyShixiong Zhu2016-03-232-25/+12
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Try to fix a flaky hang ## How was this patch tested? Existing Jenkins test Author: Shixiong Zhu <shixiong@databricks.com> Closes #11909 from zsxwing/hotfix2.
* [SPARK-14088][SQL] Some Dataset API touch-upReynold Xin2016-03-226-38/+41
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL. 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups. 3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users. 4. Remove "subtract" function since it is just an alias for "except". ## How was this patch tested? All changes should be covered by existing tests. Also added couple test cases to cover "name". Author: Reynold Xin <rxin@databricks.com> Closes #11908 from rxin/SPARK-14088.
* [MINOR][SQL][DOCS] Update `sql/README.md` and remove some unused imports in ↵Dongjoon Hyun2016-03-2211-18/+3
| | | | | | | | | | | | | | | | `sql` module. ## What changes were proposed in this pull request? This PR updates `sql/README.md` according to the latest console output and removes some unused imports in `sql` module. This is done by manually, so there is no guarantee to remove all unused imports. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11907 from dongjoon-hyun/update_sql_module.
* [SPARK-13401][SQL][TESTS] Fix SQL test warnings.Yong Tang2016-03-226-0/+6
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This fix tries to fix several SQL test warnings under the sql/core/src/test directory. The fixed warnings includes "[unchecked]", "[rawtypes]", and "[varargs]". ## How was this patch tested? All existing tests passed. Author: Yong Tang <yong.tang.github@outlook.com> Closes #11857 from yongtang/SPARK-13401.
* [HOTFIX][SQL] Add a timeout for 'cq.stop'Shixiong Zhu2016-03-221-1/+9
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix an issue that DataFrameReaderWriterSuite may hang forever. ## How was this patch tested? Existing tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11902 from zsxwing/hotfix.
* [SPARK-14060][SQL] Move StringToColumn implicit class into SQLImplicitsReynold Xin2016-03-223-19/+11
| | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch moves StringToColumn implicit class into SQLImplicits. This was kept in SQLContext.implicits object for binary backward compatibility, in the Spark 1.x series. It makes more sense for this API to be in SQLImplicits since that's the single class that defines all the SQL implicits. ## How was this patch tested? Should be covered by existing unit tests. Author: Reynold Xin <rxin@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #11878 from rxin/SPARK-14060.
* [SPARK-14063][SQL] SQLContext.range should return Dataset[java.lang.Long]Reynold Xin2016-03-223-7/+16
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch changed the return type for SQLContext.range from `Dataset[Long]` (Scala primitive) to `Dataset[java.lang.Long]` (Java boxed long). Previously, SPARK-13894 changed the return type of range from `Dataset[Row]` to `Dataset[Long]`. The problem is that due to https://issues.scala-lang.org/browse/SI-4388, Scala compiles primitive types in generics into just Object, i.e. range at bytecode level now just returns `Dataset[Object]`. This is really bad for Java users because they are losing type safety and also need to add a type cast every time they use range. Talked to Jason Zaugg from Lightbend (Typesafe) who suggested the best approach is to return `Dataset[java.lang.Long]`. The downside is that when Scala users want to explicitly type a closure used on the dataset returned by range, they would need to use `java.lang.Long` instead of the Scala `Long`. ## How was this patch tested? The signature change should be covered by existing unit tests and API tests. I also added a new test case in DatasetSuite for range. Author: Reynold Xin <rxin@databricks.com> Closes #11880 from rxin/SPARK-14063.
* [SPARK-13985][SQL] Deterministic batches with idsMichael Armbrust2016-03-2219-241/+319
| | | | | | | | | | | | This PR relaxes the requirements of a `Sink` for structured streaming to only require idempotent appending of data. Previously the `Sink` needed to be able to transactionally append data while recording an opaque offset indicated how far in a stream we have processed. In order to do this, a new write-ahead-log has been added to stream execution, which records the offsets that will are present in each batch. The log is created in the newly added `checkpointLocation`, which defaults to `${spark.sql.streaming.checkpointLocation}/${queryName}` but can be overriden by setting `checkpointLocation` in `DataFrameWriter`. In addition to making sinks easier to write the addition of batchIds and a checkpoint location is done in anticipation of integration with the the `StateStore` (#11645). Author: Michael Armbrust <michael@databricks.com> Closes #11804 from marmbrus/batchIds.
* [SPARK-13774][SQL] - Improve error message for non-existent paths and add testsSunitha Kambhampati2016-03-223-2/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SPARK-13774: IllegalArgumentException: Can not create a Path from an empty string for incorrect file path **Overview:** - If a non-existent path is given in this call `` scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") `` it throws the following error: `java.lang.IllegalArgumentException: Can not create a Path from an empty string` ….. `It gets called from inferSchema call in org.apache.spark.sql.execution.datasources.DataSource.resolveRelation` - The purpose of this JIRA is to throw a better error message. - With the fix, you will now get a _Path does not exist_ error message. ``` scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/ksunitha/trunk/spark/file-path-is-incorrect.csv; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:215) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:204) ... at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:204) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141) ... 49 elided ``` **Details** _Changes include:_ - Check if path exists or not in resolveRelation in DataSource, and throw an AnalysisException with message like “Path does not exist: $path” - AnalysisException is thrown similar to the exceptions thrown in resolveRelation. - The glob path and the non glob path is checked with minimal calls to path exists. If the globPath is empty, then it is a nonexistent glob pattern and an error will be thrown. In the scenario that it is not globPath, it is necessary to only check if the first element in the Seq is valid or not. _Test modifications:_ - Changes went in for 3 tests to account for this error checking. - SQLQuerySuite:test("run sql directly on files") – Error message needed to be updated. - 2 tests failed in MetastoreDataSourcesSuite because they had a dummy path and so test is modified to give a tempdir and allow it to move past so it can continue to test the codepath it meant to test _New Tests:_ 2 new tests are added to DataFrameSuite to validate that glob and non-glob path will throw the new error message. _Testing:_ Unit tests were run with the fix. **Notes/Questions to reviewers:** - There is some code duplication in DataSource.scala in resolveRelation method and also createSource with respect to getting the paths. I have not made any changes to the createSource codepath. Should we make the change there as well ? - From other JIRAs, I know there is restructuring and changes going on in this area, not sure how that will affect these changes, but since this seemed like a starter issue, I looked into it. If we prefer not to add the overhead of the checks, or if there is a better place to do so, let me know. I would appreciate your review. Thanks for your time and comments. Author: Sunitha Kambhampati <skambha@us.ibm.com> Closes #11775 from skambha/improve_errmsg.
* [SPARK-13953][SQL] Specifying the field name for corrupted record via option ↵hyukjinkwon2016-03-224-6/+46
| | | | | | | | | | | | | | | | | | | | | at JSON datasource ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13953 Currently, JSON data source creates a new field in `PERMISSIVE` mode for storing malformed string. This field can be renamed via `spark.sql.columnNameOfCorruptRecord` option but it is a global configuration. This PR make that option can be applied per read and can be specified via `option()`. This will overwrites `spark.sql.columnNameOfCorruptRecord` if it is set. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11881 from HyukjinKwon/SPARK-13953.
* [SPARK-14038][SQL] enable native view by defaultWenchen Fan2016-03-221-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? As we have completed the `SQLBuilder`, we can safely turn on native view by default. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11872 from cloud-fan/native-view.
* [SPARK-13883][SQL] Parquet Implementation of FileFormat.buildReaderMichael Armbrust2016-03-2112-46/+365
| | | | | | | | | | | | This PR add implements the new `buildReader` interface for the Parquet `FileFormat`. An simple implementation of `FileScanRDD` is also included. This code should be tested by the many existing tests for parquet. Author: Michael Armbrust <michael@databricks.com> Author: Sameer Agarwal <sameer@databricks.com> Author: Nong Li <nong@databricks.com> Closes #11709 from marmbrus/parquetReader.
* [SPARK-14016][SQL] Support high-precision decimals in vectorized parquet readerSameer Agarwal2016-03-212-4/+13
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds support for reading `DecimalTypes` with high (> 18) precision in `VectorizedColumnReader` ## How was this patch tested? 1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getDecimalMetadata().getPrecision() > Decimal.MAX_LONG_DIGITS()` that made us fall back on parquet-mr for handling high-precision decimals. This condition is now removed. 2. In particular, the `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `DecimalType(25, 5)`) fails when the gating condition is removed (https://github.com/apache/spark/pull/11808) and should now pass with this change. Author: Sameer Agarwal <sameer@databricks.com> Closes #11869 from sameeragarwal/bigdecimal-parquet.
* [SPARK-13320][SQL] Support Star in CreateStruct/CreateArray and Error ↵gatorsmile2016-03-221-7/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Handling when DataFrame/DataSet Functions using Star This PR resolves two issues: First, expanding * inside aggregate functions of structs when using Dataframe/Dataset APIs. For example, ```scala structDf.groupBy($"a").agg(min(struct($"record.*"))) ``` Second, it improves the error messages when having invalid star usage when using Dataframe/Dataset APIs. For example, ```scala pagecounts4PartitionsDS .map(line => (line._1, line._3)) .toDF() .groupBy($"_1") .agg(sum("*") as "sumOccurances") ``` Before the fix, the invalid usage will issue a confusing error message, like: ``` org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns _1, _2; ``` After the fix, the message is like: ``` org.apache.spark.sql.AnalysisException: Invalid usage of '*' in function 'sum' ``` cc: rxin nongli cloud-fan Author: gatorsmile <gatorsmile@gmail.com> Closes #11208 from gatorsmile/sumDataSetResolution.
* [SPARK-13898][SQL] Merge DatasetHolder and DataFrameHolderReynold Xin2016-03-2110-126/+24
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch merges DatasetHolder and DataFrameHolder. This makes more sense because DataFrame/Dataset are now one class. In addition, fixed some minor issues with pull request #11732. ## How was this patch tested? Updated existing unit tests that test these implicits. Author: Reynold Xin <rxin@databricks.com> Closes #11737 from rxin/SPARK-13898.
* [SPARK-13916][SQL] Add a metric to WholeStageCodegen to measure duration.Nong Li2016-03-216-14/+97
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? WholeStageCodegen naturally breaks the execution into pipelines that are easier to measure duration. This is more granular than the task timings (a task can be multiple pipelines) and is integrated with the web ui. We currently report total time (across all tasks), min/mask/median to get a sense of how long each is taking. ## How was this patch tested? Manually tested looking at the web ui. Author: Nong Li <nong@databricks.com> Closes #11741 from nongli/spark-13916.
* [SPARK-14004][FOLLOW-UP] Implementations of NonSQLExpression should not ↵Wenchen Fan2016-03-211-3/+1
| | | | | | | | | | | | | | | | override sql method ## What changes were proposed in this pull request? There is only one exception: `PythonUDF`. However, I don't think the `PythonUDF#` prefix is useful, as we can only create python udf under python context. This PR removes the `PythonUDF#` prefix from `PythonUDF.toString`, so that it doesn't need to overrde `sql`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11859 from cloud-fan/tmp.
* [SPARK-13805] [SQL] Generate code that get a value in each column from ↵Kazuaki Ishizaki2016-03-219-33/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ColumnVector when ColumnarBatch is used ## What changes were proposed in this pull request? This PR generates code that get a value in each column from ```ColumnVector``` instead of creating ```InternalRow``` when ```ColumnarBatch``` is accessed. This PR improves benchmark program by up to 15%. This PR consists of two parts: 1. Get an ```ColumnVector ``` by using ```ColumnarBatch.column()``` method 2. Get a value of each column by using ```rdd_col${COLIDX}.getInt(ROWIDX)``` instead of ```rdd_row.getInt(COLIDX)``` This is a motivated example. ```` sqlContext.conf.setConfString(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true") sqlContext.conf.setConfString(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true") val values = 10 withTempPath { dir => withTempTable("t1", "tempTable") { sqlContext.range(values).registerTempTable("t1") sqlContext.sql("select id % 2 as p, cast(id as INT) as id from t1") .write.partitionBy("p").parquet(dir.getCanonicalPath) sqlContext.read.parquet(dir.getCanonicalPath).registerTempTable("tempTable") sqlContext.sql("select sum(p) from tempTable").collect } } ```` The original code ````java ... /* 072 */ while (!shouldStop() && rdd_batchIdx < numRows) { /* 073 */ InternalRow rdd_row = rdd_batch.getRow(rdd_batchIdx++); /* 074 */ /*** CONSUME: TungstenAggregate(key=[], functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], output=[sum#10L]) */ /* 075 */ /* input[0, int] */ /* 076 */ boolean rdd_isNull = rdd_row.isNullAt(0); /* 077 */ int rdd_value = rdd_isNull ? -1 : (rdd_row.getInt(0)); ... ```` The code generated by this PR ````java /* 072 */ while (!shouldStop() && rdd_batchIdx < numRows) { /* 073 */ org.apache.spark.sql.execution.vectorized.ColumnVector rdd_col0 = rdd_batch.column(0); /* 074 */ /*** CONSUME: TungstenAggregate(key=[], functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], output=[sum#10L]) */ /* 075 */ /* input[0, int] */ /* 076 */ boolean rdd_isNull = rdd_col0.getIsNull(rdd_batchIdx); /* 077 */ int rdd_value = rdd_isNull ? -1 : (rdd_col0.getInt(rdd_batchIdx)); ... /* 128 */ rdd_batchIdx++; /* 129 */ } /* 130 */ if (shouldStop()) return; ```` Performance Without this PR ```` model name : Intel(R) Xeon(R) CPU E5-2667 v2 3.30GHz Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- Read data column 434 / 488 36.3 27.6 1.0X Read partition column 302 / 346 52.1 19.2 1.4X Read both columns 588 / 643 26.8 37.4 0.7X ```` With this PR ```` model name : Intel(R) Xeon(R) CPU E5-2667 v2 3.30GHz Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- Read data column 392 / 516 40.1 24.9 1.0X Read partition column 256 / 318 61.4 16.3 1.5X Read both columns 523 / 539 30.1 33.3 0.7X ```` ## How was this patch tested? Tested by existing test suites and benchmark Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #11636 from kiszk/SPARK-13805.
* [SPARK-14007] [SQL] Manage the memory used by hash map in shuffled hash joinDavies Liu2016-03-214-132/+51
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR try acquire the memory for hash map in shuffled hash join, fail the task if there is no enough memory (otherwise it could OOM the executor). It also removed unused HashedRelation. ## How was this patch tested? Existing unit tests. Manual tests with TPCDS Q78. Author: Davies Liu <davies@databricks.com> Closes #11826 from davies/cleanup_hash2.
* [SPARK-13826][SQL] Ad-hoc Dataset API ScalaDoc fixesCheng Lian2016-03-211-18/+21
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Ad-hoc Dataset API ScalaDoc fixes ## How was this patch tested? By building and checking ScalaDoc locally. Author: Cheng Lian <lian@databricks.com> Closes #11862 from liancheng/ds-doc-fixes.
* [SPARK-14000][SQL] case class with a tuple field can't work in DatasetWenchen Fan2016-03-211-2/+11
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? When we validate an encoder, we may call `dataType` on unresolved expressions. This PR fix the validation so that we will resolve attributes first. ## How was this patch tested? a new test in `DatasetSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11816 from cloud-fan/encoder.
* [SPARK-12789][SQL] Support Order By Ordinal in SQLgatorsmile2016-03-212-0/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? This PR is to support order by position in SQL, e.g. ```SQL select c1, c2, c3 from tbl order by 1 desc, 3 ``` should be equivalent to ```SQL select c1, c2, c3 from tbl order by c1 desc, c3 asc ``` This is controlled by config option `spark.sql.orderByOrdinal`. - When true, the ordinal numbers are treated as the position in the select list. - When false, the ordinal number in order/sort By clause are ignored. - Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them - This also works with select *. **Question**: Do we still need sort by columns that contain zero reference? In this case, it will have no impact on the sorting results. IMO, we should not allow users do it. rxin cloud-fan marmbrus yhuai hvanhovell -- Update: In these cases, they are ignored in this case. **Note**: This PR is taken from https://github.com/apache/spark/pull/10731. When merging this PR, please give the credit to zhichao-li Also cc all the people who are involved in the previous discussion: adrian-wang chenghao-intel tejasapatil #### How was this patch tested? Added a few test cases for both positive and negative test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #11815 from gatorsmile/orderByPosition.
* [MINOR][DOCS] Add proper periods and spaces for CLI help messages and ↵Dongjoon Hyun2016-03-211-12/+12
| | | | | | | | | | | | | | | | `config` doc. ## What changes were proposed in this pull request? This PR adds some proper periods and spaces to Spark CLI help messages and SQL/YARN conf docs for consistency. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11848 from dongjoon-hyun/add_proper_period_and_space.
* [SPARK-14011][CORE][SQL] Enable `LineLength` Java checkstyle ruleDongjoon Hyun2016-03-218-126/+128
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? [Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`. ```xml - <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places --> - <!-- <module name="LineLength"> <property name="max" value="100"/> <property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/> </module> - --> <module name="NoLineWrap"/> <module name="EmptyBlock"> <property name="option" value="TEXT"/> -167,5 +164,7 </module> <module name="CommentsIndentation"/> <module name="UnusedImports"/> + <module name="RedundantImport"/> + <module name="RedundantModifier"/> ``` ## How was this patch tested? Currently, `lint-java` is disabled in Jenkins. It needs a manual test. After passing the Jenkins tests, `dev/lint-java` should passes locally. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11831 from dongjoon-hyun/SPARK-14011.
* [SPARK-13764][SQL] Parse modes in JSON data sourcehyukjinkwon2016-03-217-45/+156
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source . This PR adds the support for parse modes just like CSV data source. There are three modes below: - `PERMISSIVE` : When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode. - `DROPMALFORMED`: When it fails to parse, this drops the whole record. - `FAILFAST`: When it fails to parse, it just throws an exception. This PR also make JSON data source share the `ParseModes` in CSV data source. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for code style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11756 from HyukjinKwon/SPARK-13764.
* [SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDatasetReynold Xin2016-03-194-67/+69
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a GroupedDataset. The naming is very similar, and unfortunately does not convey the real differences between the two. Assume we are grouping by some keys (K). groupByKey is a key-value style group by, in which the schema of the returned dataset is a tuple of just two fields: key and value. groupBy, on the other hand, is a relational style group by, in which the schema of the returned dataset is flattened and contain |K| + |V| fields. This pull request also removes the experimental tag from RelationalGroupedDataset. It has been with DataFrame since 1.3, and we have enough confidence now to stabilize it. ## How was this patch tested? This is a rename to improve API understandability. Should be covered by all existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #11841 from rxin/SPARK-13897.
* [SPARK-14018][SQL] Use 64-bit num records in BenchmarkWholeStageCodegenReynold Xin2016-03-191-4/+4
| | | | | | | | | | | | ## What changes were proposed in this pull request? 500L << 20 is actually pretty close to 32-bit int limit. I was trying to increase this to 500L << 23 and got negative numbers instead. ## How was this patch tested? I'm only modifying test code. Author: Reynold Xin <rxin@databricks.com> Closes #11839 from rxin/SPARK-14018.
* [SPARK-14012][SQL] Extract VectorizedColumnReader from ↵Sameer Agarwal2016-03-182-450/+476
| | | | | | | | | | | | | | | | VectorizedParquetRecordReader ## What changes were proposed in this pull request? This is a minor followup on https://github.com/apache/spark/pull/11799 that extracts out the `VectorizedColumnReader` from `VectorizedParquetRecordReader` into its own file. ## How was this patch tested? N/A (refactoring only) Author: Sameer Agarwal <sameer@databricks.com> Closes #11834 from sameeragarwal/rename.
* [SPARK-13989] [SQL] Remove non-vectorized/unsafe-row parquet record readerSameer Agarwal2016-03-188-364/+75
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR cleans up the new parquet record reader with the following changes: 1. Removes the non-vectorized parquet reader code from `UnsafeRowParquetRecordReader`. 2. Removes the non-vectorized column reader code from `ColumnReader`. 3. Renames `UnsafeRowParquetRecordReader` to `VectorizedParquetRecordReader` and `ColumnReader` to `VectorizedColumnReader` 4. Deprecate `PARQUET_UNSAFE_ROW_RECORD_READER_ENABLED` ## How was this patch tested? Refactoring only; Existing tests should reveal any problems. Author: Sameer Agarwal <sameer@databricks.com> Closes #11799 from sameeragarwal/vectorized-parquet.
* [SPARK-13977] [SQL] Brings back Shuffled hash joinDavies Liu2016-03-1812-117/+276
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also fast. ShuffledHashJoin is still useful in this case: 1) one table is much smaller than the other one, then cost to build a hash table on smaller table is smaller than sorting the larger table 2) any partition of the small table could fit in memory. This PR brings back ShuffledHashJoin, basically revert #9645, and fix the conflict. Also merging outer join and left-semi join into the same class. This PR does not implement full outer join, because it's not implemented efficiently (requiring build hash table on both side). A simple benchmark (one table is 5x smaller than other one) show that ShuffledHashJoin could be 2X faster than SortMergeJoin. ## How was this patch tested? Added new unit tests for ShuffledHashJoin. Author: Davies Liu <davies@databricks.com> Closes #11788 from davies/shuffle_join.
* [SPARK-13826][SQL] Addendum: update documentation for DatasetsReynold Xin2016-03-184-31/+70
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch updates documentations for Datasets. I also updated some internal documentation for exchange/broadcast. ## How was this patch tested? Just documentation/api stability update. Author: Reynold Xin <rxin@databricks.com> Closes #11814 from rxin/dataset-docs.
* [SPARK-13930] [SQL] Apply fast serialization on collect limit operatorLiang-Chi Hsieh2016-03-172-28/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-13930 Recently the fast serialization has been introduced to collecting DataFrame/Dataset (#11664). The same technology can be used on collect limit operator too. ## How was this patch tested? Add a benchmark for collect limit to `BenchmarkWholeStageCodegen`. Without this patch: model name : Westmere E56xx/L56xx/X56xx (Nehalem-C) collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect limit 1 million 3413 / 3768 0.3 3255.0 1.0X collect limit 2 millions 9728 / 10440 0.1 9277.3 0.4X With this patch: model name : Westmere E56xx/L56xx/X56xx (Nehalem-C) collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect limit 1 million 833 / 1284 1.3 794.4 1.0X collect limit 2 millions 3348 / 4005 0.3 3193.3 0.2X Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #11759 from viirya/execute-take.
* [SPARK-13826][SQL] Revises Dataset ScalaDocCheng Lian2016-03-171-319/+522
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR revises Dataset API ScalaDoc. All public methods are divided into the following groups * `groupname basic`: Basic Dataset functions * `groupname action`: Actions * `groupname untypedrel`: Untyped Language Integrated Relational Queries * `groupname typedrel`: Typed Language Integrated Relational Queries * `groupname func`: Functional Transformations * `groupname rdd`: RDD Operations * `groupname output`: Output Operations `since` tag and sample code are also updated. We may want to add more sample code for typed APIs. ## How was this patch tested? Documentation change. Checked by building unidoc locally. Author: Cheng Lian <lian@databricks.com> Closes #11769 from liancheng/spark-13826-ds-api-doc.
* [SPARK-13427][SQL] Support USING clause in JOIN.Dilip Biswal2016-03-172-34/+69
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Support queries that JOIN tables with USING clause. SELECT * from table1 JOIN table2 USING <column_list> USING clause can be used as a means to simplify the join condition when : 1) Equijoin semantics is desired and 2) The column names in the equijoin have the same name. We already have the support for Natural Join in Spark. This PR makes use of the already existing infrastructure for natural join to form the join condition and also the projection list. ## How was the this patch tested? Have added unit tests in SQLQuerySuite, CatalystQlSuite, ResolveNaturalJoinSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #11297 from dilipbiswal/spark-13427.
* [SPARK-13928] Move org.apache.spark.Logging into ↵Wenchen Fan2016-03-1747-45/+60
| | | | | | | | | | | | | | | | org.apache.spark.internal.Logging ## What changes were proposed in this pull request? Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11764 from cloud-fan/logger.
* [SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with ↵Josh Rosen2016-03-162-3/+3
| | | | | | | | | | | | | | simple types Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo. This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers. In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places). Author: Josh Rosen <joshrosen@databricks.com> Closes #11755 from JoshRosen/automatically-pick-best-serializer.
* [SPARK-12855][MINOR][SQL][DOC][TEST] remove spark.sql.dialect from doc and testDaoyuan Wang2016-03-161-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly. ## How was this patch tested? This patch will not affect the real code path. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #11758 from adrian-wang/spark12855.
* [MINOR][SQL][BUILD] Remove duplicated linesDongjoon Hyun2016-03-161-1/+0
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR removes three minor duplicated lines. First one is making the following unreachable code warning. ``` JoinSuite.scala:52: unreachable code [warn] case j: BroadcastHashJoin => j ``` The other two are just consecutive repetitions in `Seq` of MiMa filters. ## How was this patch tested? Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11773 from dongjoon-hyun/remove_duplicated_line.