aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-11455][SQL] fix case sensitivity of partition byWenchen Fan2015-11-034-11/+39
| | | | | | | | depend on `caseSensitive` to do column name equality check, instead of just `==` Author: Wenchen Fan <wenchen@databricks.com> Closes #9410 from cloud-fan/partition.
* [SPARK-11329] [SQL] Cleanup from spark-11329 fix.Nong2015-11-034-52/+55
| | | | | | Author: Nong <nong@cloudera.com> Closes #9442 from nongli/spark-11483.
* [SPARK-11489][SQL] Only include common first order statistics in GroupedDataReynold Xin2015-11-032-119/+28
| | | | | | | | | | | | | | | | | | We added a bunch of higher order statistics such as skewness and kurtosis to GroupedData. I don't think they are common enough to justify being listed, since users can always use the normal statistics aggregate functions. That is to say, after this change, we won't support ```scala df.groupBy("key").kurtosis("colA", "colB") ``` However, we will still support ```scala df.groupBy("key").agg(kurtosis(col("colA")), kurtosis(col("colB"))) ``` Author: Reynold Xin <rxin@databricks.com> Closes #9446 from rxin/SPARK-11489.
* [SPARK-11477] [SQL] support create Dataset from RDDWenchen Fan2015-11-043-0/+20
| | | | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9434 from cloud-fan/rdd2ds and squashes the following commits: 0892d72 [Wenchen Fan] support create Dataset from RDD
* [SPARK-11467][SQL] add Python API for stddev/varianceDavies Liu2015-11-031-67/+0
| | | | | | | | Add Python API for stddev/stddev_pop/stddev_samp/variance/var_pop/var_samp/skewness/kurtosis Author: Davies Liu <davies@databricks.com> Closes #9424 from davies/py_var.
* [SPARK-10978][SQL] Allow data sources to eliminate filtersCheng Lian2015-11-036-68/+315
| | | | | | | | This PR adds a new method `unhandledFilters` to `BaseRelation`. Data sources which implement this method properly may avoid the overhead of defensive filtering done by Spark SQL. Author: Cheng Lian <lian@databricks.com> Closes #9399 from liancheng/spark-10978.unhandled-filters.
* [SPARK-10304] [SQL] Partition discovery should throw an exception if the dir ↵Liang-Chi Hsieh2015-11-032-13/+59
| | | | | | | | | | | | | | | | structure is invalid JIRA: https://issues.apache.org/jira/browse/SPARK-10304 This patch detects if the structure of partition directories is not valid. The test cases are from #8547. Thanks zhzhan. cc liancheng Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8840 from viirya/detect_invalid_part_dir.
* [SPARK-10533][SQL] handle scientific notation in sqlParserDaoyuan Wang2015-11-033-5/+32
| | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-10533 val df = sqlContext.createDataFrame(Seq(("a",1.0),("b",2.0),("c",3.0))) df.filter("_2 < 2.0e1").show Scientific notation didn't work. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9085 from adrian-wang/scinotation.
* [SPARK-11404] [SQL] Support for groupBy using column expressionsMichael Armbrust2015-11-033-6/+106
| | | | | | | | | | | | | | | | | | | | | | | | This PR adds a new method `groupBy(cols: Column*)` to `Dataset` that allows users to group using column expressions instead of a lambda function. Since the return type of these expressions is not known at compile time, we just set the key type as a generic `Row`. If the user would like to work the key in a type-safe way, they can call `grouped.asKey[Type]`, which is also added in this PR. ```scala val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() val grouped = ds.groupBy($"_1").asKey[String] val agged = grouped.mapGroups { case (g, iter) => Iterator((g, iter.map(_._2).sum)) } agged.collect() res0: Array(("a", 30), ("b", 3), ("c", 1)) ``` Author: Michael Armbrust <michael@databricks.com> Closes #9359 from marmbrus/columnGroupBy and squashes the following commits: bbcb03b [Michael Armbrust] Update DatasetSuite.scala 8fd2908 [Michael Armbrust] Update DatasetSuite.scala 0b0e2f8 [Michael Armbrust] [SPARK-11404] [SQL] Support for groupBy using column expressions
* [SPARK-11436] [SQL] rebind right encoder when join 2 datasetsWenchen Fan2015-11-032-1/+11
| | | | | | | | | | When we join 2 datasets, we will combine 2 encoders into a tupled one, and use it as the encoder for the jioned dataset. Assume both of the 2 encoders are flat, their `constructExpression`s both reference to the first element of input row. However, when we combine 2 encoders, the schema of input row changed, now the right encoder should reference to second element of input row. So we should rebind right encoder to let it know the new schema of input row before combine it. Author: Wenchen Fan <wenchen@databricks.com> Closes #9391 from cloud-fan/join and squashes the following commits: 846d3ab [Wenchen Fan] rebind right encoder when join 2 datasets
* [SPARK-10429] [SQL] make mutableProjection atomicDavies Liu2015-11-033-98/+97
| | | | | | | | | | | | | | | | | | | Right now, SQL's mutable projection updates every value of the mutable project after it evaluates the corresponding expression. This makes the behavior of MutableProjection confusing and complicate the implementation of common aggregate functions like stddev because developers need to be aware that when evaluating {{i+1}}th expression of a mutable projection, {{i}}th slot of the mutable row has already been updated. This PR make the MutableProjection atomic, by generating all the results of expressions first, then copy them into mutableRow. Had run a mircro-benchmark, there is no notable performance difference between using class members and local variables. cc yhuai Author: Davies Liu <davies@databricks.com> Closes #9422 from davies/atomic_mutable and squashes the following commits: bbc1758 [Davies Liu] support wide table 8a0ae14 [Davies Liu] fix bug bec07da [Davies Liu] refactor 2891628 [Davies Liu] make mutableProjection atomic
* [SPARK-9858][SPARK-9859][SPARK-9861][SQL] Add an ExchangeCoordinator to ↵Yin Huai2015-11-039-44/+1115
| | | | | | | | | | | | estimate the number of post-shuffle partitions for aggregates and joins https://issues.apache.org/jira/browse/SPARK-9858 https://issues.apache.org/jira/browse/SPARK-9859 https://issues.apache.org/jira/browse/SPARK-9861 Author: Yin Huai <yhuai@databricks.com> Closes #9276 from yhuai/numReducer.
* [SPARK-9034][SQL] Reflect field names defined in GenericUDTFnavis.ryu2015-11-0216-17/+34
| | | | | | | | | | Hive GenericUDTF#initialize() defines field names in a returned schema though, the current HiveGenericUDTF drops these names. We might need to reflect these in a logical plan tree. Author: navis.ryu <navis@apache.org> Closes #8456 from navis/SPARK-9034.
* [SPARK-11469][SQL] Allow users to define nondeterministic udfs.Yin Huai2015-11-025-78/+215
| | | | | | | | This is the first task (https://issues.apache.org/jira/browse/SPARK-11469) of https://issues.apache.org/jira/browse/SPARK-11438 Author: Yin Huai <yhuai@databricks.com> Closes #9393 from yhuai/udfNondeterministic.
* [SPARK-11329][SQL] Support star expansion for structs.Nong Li2015-11-026-38/+230
| | | | | | | | | | | | | | | 1. Supporting expanding structs in Projections. i.e. "SELECT s.*" where s is a struct type. This is fixed by allowing the expand function to handle structs in addition to tables. 2. Supporting expanding * inside aggregate functions of structs. "SELECT max(struct(col1, structCol.*))" This requires recursively expanding the expressions. In this case, it it the aggregate expression "max(...)" and we need to recursively expand its children inputs. Author: Nong Li <nongli@gmail.com> Closes #9343 from nongli/spark-11329.
* [SPARK-5354][SQL] Cached tables should preserve partitioning and ord…Nong Li2015-11-023-9/+97
| | | | | | | | | | | …ering. For cached tables, we can just maintain the partitioning and ordering from the source relation. Author: Nong Li <nongli@gmail.com> Closes #9404 from nongli/spark-5354.
* [SPARK-11371] Make "mean" an alias for "avg" operatortedyu2015-11-022-0/+10
| | | | | | | | | | | From Reynold in the thread 'Exception when using some aggregate operators' (http://search-hadoop.com/m/q3RTt0xFr22nXB4/): I don't think these are bugs. The SQL standard for average is "avg", not "mean". Similarly, a distinct count is supposed to be written as "count(distinct col)", not "countDistinct(col)". We can, however, make "mean" an alias for "avg" to improve compatibility between DataFrame and SQL. Author: tedyu <yuzhihong@gmail.com> Closes #9332 from ted-yu/master.
* [SPARK-11311][SQL] spark cannot describe temporary functionsDaoyuan Wang2015-11-022-1/+15
| | | | | | | | When describe temporary function, spark would return 'Unable to find function', this is not right. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9277 from adrian-wang/functionreg.
* [SPARK-10786][SQL] Take the whole statement to generate the CommandProcessorhuangzhaowei2015-11-021-1/+1
| | | | | | | | | | | | | | | | | | | In the now implementation of `SparkSQLCLIDriver.scala`: `val proc: CommandProcessor = CommandProcessorFactory.get(Array(tokens(0)), hconf)` `CommandProcessorFactory` only take the first token of the statement, and this will be hard to diff the statement `delete jar xxx` and `delete from xxx`. So maybe it's better to take the whole statement into the `CommandProcessorFactory`. And in [HiveCommand](https://github.com/SaintBacchus/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/processors/HiveCommand.java#L76), it already special handing these two statement. ```java if(command.length > 1 && "from".equalsIgnoreCase(command[1])) { //special handling for SQL "delete from <table> where..." return null; } ``` Author: huangzhaowei <carlmartinmax@gmail.com> Closes #8895 from SaintBacchus/SPARK-10786.
* [SPARK-9298][SQL] Add pearson correlation aggregation functionLiang-Chi Hsieh2015-11-017-2/+311
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-9298 This patch adds pearson correlation aggregation function based on `AggregateExpression2`. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8587 from viirya/corr_aggregation.
* [SPARK-11410][SQL] Add APIs to provide functionality similar to Hive's ↵Nong Li2015-11-014-19/+186
| | | | | | | | | | | | | | | | | | | DISTRIBUTE BY and SORT BY. DISTRIBUTE BY allows the user to hash partition the data by specified exprs. It also allows for optioning sorting within each resulting partition. There is no required relationship between the exprs for partitioning and sorting (i.e. one does not need to be a prefix of the other). This patch adds to APIs to DataFrames which can be used together to provide this functionality: 1. distributeBy() which partitions the data frame into a specified number of partitions using the partitioning exprs. 2. localSort() which sorts each partition using the provided sorting exprs. To get the DISTRIBUTE BY functionality, the user simply does: df.distributeBy(...).localSort(...) Author: Nong Li <nongli@gmail.com> Closes #9364 from nongli/spark-11410.
* [SPARK-11117] [SPARK-11345] [SQL] Makes all HadoopFsRelation data sources ↵Cheng Lian2015-10-3112-59/+156
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | produce UnsafeRow This PR fixes two issues: 1. `PhysicalRDD.outputsUnsafeRows` is always `false` Thus a `ConvertToUnsafe` operator is often required even if the underlying data source relation does output `UnsafeRow`. 1. Internal/external row conversion for `HadoopFsRelation` is kinda messy Currently we're using `HadoopFsRelation.needConversion` and [dirty type erasure hacks][1] to indicate whether the relation outputs external row or internal row and apply external-to-internal conversion when necessary. Basically, all builtin `HadoopFsRelation` data sources, i.e. Parquet, JSON, ORC, and Text output `InternalRow`, while typical external `HadoopFsRelation` data sources, e.g. spark-avro and spark-csv, output `Row`. This PR adds a `private[sql]` interface method `HadoopFsRelation.buildInternalScan`, which by default invokes `HadoopFsRelation.buildScan` and converts `Row`s to `UnsafeRow`s (which are also `InternalRow`s). All builtin `HadoopFsRelation` data sources override this method and directly output `UnsafeRow`s. In this way, now `HadoopFsRelation` always produces `UnsafeRow`s. Thus `PhysicalRDD.outputsUnsafeRows` can be properly set by checking whether the underlying data source is a `HadoopFsRelation`. A remaining question is that, can we assume that all non-builtin `HadoopFsRelation` data sources output external rows? At least all well known ones do so. However it's possible that some users implemented their own `HadoopFsRelation` data sources that leverages `InternalRow` and thus all those unstable internal data representations. If this assumption is safe, we can deprecate `HadoopFsRelation.needConversion` and cleanup some more conversion code (like [here][2] and [here][3]). This PR supersedes #9125. Follow-ups: 1. Makes JSON and ORC data sources output `UnsafeRow` directly 1. Makes `HiveTableScan` output `UnsafeRow` directly This is related to 1 since ORC data source shares the same `Writable` unwrapping code with `HiveTableScan`. [1]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L353 [2]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L331-L335 [3]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L630-L669 Author: Cheng Lian <lian@databricks.com> Closes #9305 from liancheng/spark-11345.unsafe-hadoop-fs-relation.
* [SPARK-11024][SQL] Optimize NULL in <inlist-expressions> by folding it to ↵Dilip Biswal2015-10-312-1/+55
| | | | | | | | | | | | | | | Literal(null) Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to Literal(null). This is a follow up defect to SPARK-8654 cloud-fan Can you please take a look ? Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #9348 from dilipbiswal/spark_11024.
* [SPARK-11226][SQL] Empty line in json file should be skippedJeff Zhang2015-10-313-24/+36
| | | | | | | | | | Currently the empty line in json file will be parsed into Row with all null field values. But in json, "{}" represents a json object, empty line is supposed to be skipped. Make a trivial change for this. Author: Jeff Zhang <zjffdu@apache.org> Closes #9211 from zjffdu/SPARK-11226.
* [SPARK-11434][SPARK-11103][SQL] Fix test ": Filter applied on merged Parquet ↵Yin Huai2015-10-301-3/+3
| | | | | | | | | | schema with new column fails" https://issues.apache.org/jira/browse/SPARK-11434 Author: Yin Huai <yhuai@databricks.com> Closes #9387 from yhuai/SPARK-11434.
* [SPARK-11423] remove MapPartitionsWithPreparationRDDDavies Liu2015-10-304-124/+70
| | | | | | | | | | Since we do not need to preserve a page before calling compute(), MapPartitionsWithPreparationRDD is not needed anymore. This PR basically revert #8543, #8511, #8038, #8011 Author: Davies Liu <davies@databricks.com> Closes #9381 from davies/remove_prepare2.
* [SPARK-11393] [SQL] CoGroupedIterator should respect the fact that ↵Wenchen Fan2015-10-302-6/+32
| | | | | | | | | | | | GroupedIterator.hasNext is not idempotent When we cogroup 2 `GroupedIterator`s in `CoGroupedIterator`, if the right side is smaller, we will consume right data and keep the left data unchanged. Then we call `hasNext` which will call `left.hasNext`. This will make `GroupedIterator` generate an extra group as the previous one has not been comsumed yet. Author: Wenchen Fan <wenchen@databricks.com> Closes #9346 from cloud-fan/cogroup and squashes the following commits: 9be67c8 [Wenchen Fan] SPARK-11393
* [SPARK-11103][SQL] Filter applied on Merged Parquet shema with new column failhyukjinkwon2015-10-302-1/+25
| | | | | | | | | | | When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema. This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389). For now, it just simply disables predicate push down when using merged schema in this PR. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9327 from HyukjinKwon/SPARK-11103.
* [SPARK-11417] [SQL] no @Override in codegenDavies Liu2015-10-303-9/+0
| | | | | | | | Older version of Janino (>2.7) does not support Override, we should not use that in codegen. Author: Davies Liu <davies@databricks.com> Closes #9372 from davies/no_override.
* [SPARK-10342] [SPARK-10309] [SPARK-10474] [SPARK-10929] [SQL] Cooperative ↵Davies Liu2015-10-296-89/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memory management This PR introduce a mechanism to call spill() on those SQL operators that support spilling (for example, BytesToBytesMap, UnsafeExternalSorter and ShuffleExternalSorter) if there is not enough memory for execution. The preserved first page is needed anymore, so removed. Other Spillable objects in Spark core (ExternalSorter and AppendOnlyMap) are not included in this PR, but those could benefit from this (trigger others' spilling). The PrepareRDD may be not needed anymore, could be removed in follow up PR. The following script will fail with OOM before this PR, finished in 150 seconds with 2G heap (also works in 1.5 branch, with similar duration). ```python sqlContext.setConf("spark.sql.shuffle.partitions", "1") df = sqlContext.range(1<<25).selectExpr("id", "repeat(id, 2) as s") df2 = df.select(df.id.alias('id2'), df.s.alias('s2')) j = df.join(df2, df.id==df2.id2).groupBy(df.id).max("id", "id2") j.explain() print j.count() ``` For thread-safety, here what I'm got: 1) Without calling spill(), the operators should only be used by single thread, no safety problems. 2) spill() could be triggered in two cases, triggered by itself, or by other operators. we can check trigger == this in spill(), so it's still in the same thread, so safety problems. 3) if it's triggered by other operators (right now cache will not trigger spill()), we only spill the data into disk when it's in scanning stage (building is finished), so the in-memory sorter or memory pages are read-only, we only need to synchronize the iterator and change it. 4) During scanning, the iterator will only use one record in one page, we can't free this page, because the downstream is currently using it (used by UnsafeRow or other objects). In BytesToBytesMap, we just skip the current page, and dump all others into disk. In UnsafeExternalSorter, we keep the page that is used by current record (having the same baseObject), free it when loading the next record. In ShuffleExternalSorter, the spill() will not trigger during scanning. 5) In order to avoid deadlock, we didn't call acquireMemory during spill (so we reused the pointer array in InMemorySorter). Author: Davies Liu <davies@databricks.com> Closes #9241 from davies/force_spill.
* [SPARK-11301] [SQL] fix case sensitivity for filter on partitioned columnsWenchen Fan2015-10-292-7/+15
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9271 from cloud-fan/filter.
* [SPARK-10641][SQL] Add Skewness and Kurtosis Supportsethah2015-10-2912-11/+823
| | | | | | | | | Implementing skewness and kurtosis support based on following algorithm: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics Author: sethah <seth.hendrickson16@gmail.com> Closes #9003 from sethah/SPARK-10641.
* [SPARK-11188][SQL] Elide stacktraces in bin/spark-sql for AnalysisExceptionsDilip Biswal2015-10-293-6/+27
| | | | | | | | Only print the error message to the console for Analysis Exceptions in sql-shell. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #9194 from dilipbiswal/spark-11188.
* [SPARK-11246] [SQL] Table cache for Parquet broken in 1.5xin Wu2015-10-292-0/+16
| | | | | | | | | The root cause is that when spark.sql.hive.convertMetastoreParquet=true by default, the cached InMemoryRelation of the ParquetRelation can not be looked up from the cachedData of CacheManager because the key comparison fails even though it is the same LogicalPlan representing the Subquery that wraps the ParquetRelation. The solution in this PR is overriding the LogicalPlan.sameResult function in Subquery case class to eliminate subquery node first before directly comparing the child (ParquetRelation), which will find the key to the cached InMemoryRelation. Author: xin Wu <xinwu@us.ibm.com> Closes #9326 from xwu0226/spark-11246-commit.
* [SPARK-11370] [SQL] fix a bug in GroupedIterator and create unit test for itWenchen Fan2015-10-292-37/+144
| | | | | | | | Before this PR, user has to consume the iterator of one group before process next group, or we will get into infinite loops. Author: Wenchen Fan <wenchen@databricks.com> Closes #9330 from cloud-fan/group.
* [SPARK-11379][SQL] ExpressionEncoder can't handle top level primitive type ↵Wenchen Fan2015-10-292-1/+2
| | | | | | | | | | | | correctly For inner primitive type(e.g. inside `Product`), we use `schemaFor` to get the catalyst type for it, https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L403. However, for top level primitive type, we use `dataTypeFor`, which is wrong. Author: Wenchen Fan <wenchen@databricks.com> Closes #9337 from cloud-fan/encoder.
* [SPARK-11351] [SQL] support hive interval literalWenchen Fan2015-10-282-20/+103
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9304 from cloud-fan/interval.
* [SPARK-11376][SQL] Removes duplicated `mutableRow` fieldCheng Lian2015-10-291-2/+0
| | | | | | | | This PR fixes a mistake in the code generated by `GenerateColumnAccessor`. Interestingly, although the code is illegal in Java (the class has two fields with the same name), Janino accepts it happily and accidentally works properly. Author: Cheng Lian <lian@databricks.com> Closes #9335 from liancheng/spark-11376.fix-generated-code.
* [SPARK-11363] [SQL] LeftSemiJoin should be LeftSemi in SparkStrategiesLiang-Chi Hsieh2015-10-281-3/+3
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-11363 In SparkStrategies some places use LeftSemiJoin. It should be LeftSemi. cc chenghao-intel liancheng Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9318 from viirya/no-left-semi-join.
* [SPARK-11377] [SQL] withNewChildren should not convert StructType to SeqMichael Armbrust2015-10-281-1/+3
| | | | | | | | | | This is minor, but I ran into while writing Datasets and while it wasn't needed for the final solution, it was super confusing so we should fix it. Basically we recurse into `Seq` to see if they have children. This breaks because we don't preserve the original subclass of `Seq` (and `StructType <:< Seq[StructField]`). Since a struct can never contain children, lets just not recurse into it. Author: Michael Armbrust <michael@databricks.com> Closes #9334 from marmbrus/structMakeCopy.
* [SPARK-11313][SQL] implement cogroup on DataSets (support 2 datasets)Wenchen Fan2015-10-288-0/+257
| | | | | | | | A simpler version of https://github.com/apache/spark/pull/9279, only support 2 datasets. Author: Wenchen Fan <wenchen@databricks.com> Closes #9324 from cloud-fan/cogroup2.
* [SPARK-10484] [SQL] Optimize the cartesian join with broadcast join for some ↵Cheng Hao2015-10-2710-16/+261
| | | | | | | | | | cases In some cases, we can broadcast the smaller relation in cartesian join, which improve the performance significantly. Author: Cheng Hao <hao.cheng@intel.com> Closes #8652 from chenghao-intel/cartesian.
* [SPARK-11347] [SQL] Support for joinWith in DatasetsMichael Armbrust2015-10-2718-615/+563
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR adds a new operation `joinWith` to a `Dataset`, which returns a `Tuple` for each pair where a given `condition` evaluates to true. ```scala case class ClassData(a: String, b: Int) val ds1 = Seq(ClassData("a", 1), ClassData("b", 2)).toDS() val ds2 = Seq(("a", 1), ("b", 2)).toDS() > ds1.joinWith(ds2, $"_1" === $"a").collect() res0: Array((ClassData("a", 1), ("a", 1)), (ClassData("b", 2), ("b", 2))) ``` This operation is similar to the relation `join` function with one important difference in the result schema. Since `joinWith` preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names `_1` and `_2`. This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common. ## Required Changes to Encoders In the process of working on this patch, several deficiencies to the way that we were handling encoders were discovered. Specifically, it turned out to be very difficult to `rebind` the non-expression based encoders to extract the nested objects from the results of joins (and also typed selects that return tuples). As a result the following changes were made. - `ClassEncoder` has been renamed to `ExpressionEncoder` and has been improved to also handle primitive types. Additionally, it is now possible to take arbitrary expression encoders and rewrite them into a single encoder that returns a tuple. - All internal operations on `Dataset`s now require an `ExpressionEncoder`. If the users tries to pass a non-`ExpressionEncoder` in, an error will be thrown. We can relax this requirement in the future by constructing a wrapper class that uses expressions to project the row to the expected schema, shielding the users code from the required remapping. This will give us a nice balance where we don't force user encoders to understand attribute references and binding, but still allow our native encoder to leverage runtime code generation to construct specific encoders for a given schema that avoid an extra remapping step. - Additionally, the semantics for different types of objects are now better defined. As stated in the `ExpressionEncoder` scaladoc: - Classes will have their sub fields extracted by name using `UnresolvedAttribute` expressions and `UnresolvedExtractValue` expressions. - Tuples will have their subfields extracted by position using `BoundReference` expressions. - Primitives will have their values extracted from the first ordinal with a schema that defaults to the name `value`. - Finally, the binding lifecycle for `Encoders` has now been unified across the codebase. Encoders are now `resolved` to the appropriate schema in the constructor of `Dataset`. This process replaces an unresolved expressions with concrete `AttributeReference` expressions. Binding then happens on demand, when an encoder is going to be used to construct an object. This closely mirrors the lifecycle for standard expressions when executing normal SQL or `DataFrame` queries. Author: Michael Armbrust <michael@databricks.com> Closes #9300 from marmbrus/datasets-tuples.
* [SPARK-11303][SQL] filter should not be pushed down into sampleYanbo Liang2015-10-272-4/+10
| | | | | | | | When sampling and then filtering DataFrame, the SQL Optimizer will push down filter into sample and produce wrong result. This is due to the sampler is calculated based on the original scope rather than the scope after filtering. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9294 from yanboliang/spark-11303.
* [SPARK-11277][SQL] sort_array throws exception scala.MatchErrorJia Li2015-10-272-1/+12
| | | | | | | | | | | | I'm new to spark. I was trying out the sort_array function then hit this exception. I looked into the spark source code. I found the root cause is that sort_array does not check for an array of NULLs. It's not meaningful to sort an array of entirely NULLs anyway. I'm adding a check on the input array type to SortArray. If the array consists of NULLs entirely, there is no need to sort such array. I have also added a test case for this. Please help to review my fix. Thanks! Author: Jia Li <jiali@us.ibm.com> Closes #9247 from jliwork/SPARK-11277.
* [SPARK-10562] [SQL] support mixed case partitionBy column names for tables ↵Wenchen Fan2015-10-263-27/+54
| | | | | | | | | | stored in metastore https://issues.apache.org/jira/browse/SPARK-10562 Author: Wenchen Fan <wenchen@databricks.com> Closes #9226 from cloud-fan/par.
* [SPARK-10947] [SQL] With schema inference from JSON into a Dataframe, add ↵Stephen De Gennaro2015-10-264-11/+171
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | option to infer all primitive object types as strings Currently, when a schema is inferred from a JSON file using sqlContext.read.json, the primitive object types are inferred as string, long, boolean, etc. However, if the inferred type is too specific (JSON obviously does not enforce types itself), this can cause issues with merging dataframe schemas. This pull request adds the option "primitivesAsString" to the JSON DataFrameReader which when true (defaults to false if not set) will infer all primitives as strings. Below is an example usage of this new functionality. ``` val jsonDf = sqlContext.read.option("primitivesAsString", "true").json(sampleJsonFile) scala> jsonDf.printSchema() root |-- bigInteger: string (nullable = true) |-- boolean: string (nullable = true) |-- double: string (nullable = true) |-- integer: string (nullable = true) |-- long: string (nullable = true) |-- null: string (nullable = true) |-- string: string (nullable = true) ``` Author: Stephen De Gennaro <stepheng@realitymine.com> Closes #9249 from stephend-realitymine/stephend-primitives.
* [SPARK-11325] [SQL] Alias 'alias' in Scala's DataFrame APINong Li2015-10-262-0/+21
| | | | | | Author: Nong Li <nongli@gmail.com> Closes #9286 from nongli/spark-11325.
* [SQL][DOC] Minor document fixes in interfaces.scalaAlexander Slesarenko2015-10-261-7/+7
| | | | | | | | rxin just noticed this while reading the code. Author: Alexander Slesarenko <avslesarenko@gmail.com> Closes #9284 from aslesarenko/doc-typos.
* [SPARK-11258] Converting a Spark DataFrame into an R data.frame is slow / ↵Frank Rosner2015-10-262-7/+47
| | | | | | | | | | | | requires a lot of memory https://issues.apache.org/jira/browse/SPARK-11258 I was not able to locate an existing unit test for this function so I wrote one. Author: Frank Rosner <frank@fam-rosner.de> Closes #9222 from FRosner/master.