| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
And ClientWrapper -> HiveClientImpl.
I have some followup pull requests to introduce a new internal catalog, and I think this new naming reflects better the functionality of the two classes.
Author: Reynold Xin <rxin@databricks.com>
Closes #10981 from rxin/SPARK-13076.
|
|
|
|
|
|
|
|
|
|
| |
This is an existing issue uncovered recently by #10835. The reason for the exception was because the `SQLHistoryListener` gets all sorts of accumulators, not just the ones that represent SQL metrics. For example, the listener gets the `internal.metrics.shuffleRead.remoteBlocksFetched`, which is an Int, then it proceeds to cast the Int to a Long, which fails.
The fix is to mark accumulators representing SQL metrics using some internal metadata. Then we can identify which ones are SQL metrics and only process those in the `SQLHistoryListener`.
Author: Andrew Or <andrew@databricks.com>
Closes #10971 from andrewor14/fix-sql-history.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: https://github.com/apache/spark/pull/10566
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes #10630 from gatorsmile/IntersectBySemiJoin.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
simplify(remove several unnecessary local variables) the generated code of hash expression, and avoid null check if possible.
generated code comparison for `hash(int, double, string, array<string>)`:
**before:**
```
public UnsafeRow apply(InternalRow i) {
/* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) */
int value1 = 42;
/* input[0, int] */
int value3 = i.getInt(0);
if (!false) {
value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1);
}
/* input[1, double] */
double value5 = i.getDouble(1);
if (!false) {
value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1);
}
/* input[2, string] */
boolean isNull6 = i.isNullAt(2);
UTF8String value7 = isNull6 ? null : (i.getUTF8String(2));
if (!isNull6) {
value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1);
}
/* input[3, array<int>] */
boolean isNull8 = i.isNullAt(3);
ArrayData value9 = isNull8 ? null : (i.getArray(3));
if (!isNull8) {
int result10 = value1;
for (int index11 = 0; index11 < value9.numElements(); index11++) {
if (!value9.isNullAt(index11)) {
final int element12 = value9.getInt(index11);
result10 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element12, result10);
}
}
value1 = result10;
}
}
```
**after:**
```
public UnsafeRow apply(InternalRow i) {
/* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) */
int value1 = 42;
/* input[0, int] */
int value3 = i.getInt(0);
value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1);
/* input[1, double] */
double value5 = i.getDouble(1);
value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1);
/* input[2, string] */
boolean isNull6 = i.isNullAt(2);
UTF8String value7 = isNull6 ? null : (i.getUTF8String(2));
if (!isNull6) {
value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1);
}
/* input[3, array<int>] */
boolean isNull8 = i.isNullAt(3);
ArrayData value9 = isNull8 ? null : (i.getArray(3));
if (!isNull8) {
for (int index10 = 0; index10 < value9.numElements(); index10++) {
final int element11 = value9.getInt(index10);
value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element11, value1);
}
}
rowWriter14.write(0, value1);
return result12;
}
```
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10974 from cloud-fan/codegen.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. enable whole stage codegen during tests even there is only one operator supports that.
2. split doProduce() into two APIs: upstream() and doProduce()
3. generate prefix for fresh names of each operator
4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again)
5. fix bugs and tests.
This PR re-open #10944 and fix the bug.
Author: Davies Liu <davies@databricks.com>
Closes #10977 from davies/gen_refactor.
|
|
|
|
|
|
|
|
|
| |
A simple workaround to avoid getting parameter types when convert a
logical plan to json.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10970 from cloud-fan/reflection.
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-12968
Implement command to set current database.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #10916 from viirya/ddl-use-database.
|
|
|
|
| |
This reverts commit cc18a7199240bf3b03410c1ba6704fe7ce6ae38e.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
pushdowning filters in Parquet
JIRA: https://issues.apache.org/jira/browse/SPARK-11955
Currently we simply skip pushdowning filters in parquet if we enable schema merging.
However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet.
Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #9940 from viirya/safe-pushdown-parquet-filters.
|
|
|
|
|
|
|
|
|
|
| |
I tried to add this via `USE_BIG_DECIMAL_FOR_FLOATS` option from Jackson with no success.
Added test for non-complex types. Should I add a test for complex types?
Author: Brandon Bradley <bradleytastic@gmail.com>
Closes #10936 from blbradley/spark-12749.
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. enable whole stage codegen during tests even there is only one operator supports that.
2. split doProduce() into two APIs: upstream() and doProduce()
3. generate prefix for fresh names of each operator
4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again)
5. fix bugs and tests.
Author: Davies Liu <davies@databricks.com>
Closes #10944 from davies/gen_refactor.
|
|
|
|
|
|
|
|
|
|
| |
configs are being set
Users unknowingly try to set core Spark configs in SQLContext but later realise that it didn't work. eg. sqlContext.sql("SET spark.shuffle.memoryFraction=0.4"). This PR adds a warning message when such operations are done.
Author: Tejas Patil <tejasp@fb.com>
Closes #10849 from tejasapatil/SPARK-12926.
|
|
|
|
|
|
|
|
| |
This PR is a follow-up of #10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`.
Author: Cheng Lian <lian@databricks.com>
Closes #10968 from liancheng/cms-specialized.
|
|
|
|
|
|
|
|
| |
These two classes became identical as the implementation progressed.
Author: Nong Li <nong@databricks.com>
Closes #10952 from nongli/spark-13045.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commands to new Parser
This PR moves all the functionality provided by the SparkSQLParser/ExtendedHiveQlParser to the new Parser hierarchy (SparkQl/HiveQl). This also improves the current SET command parsing: the current implementation swallows ```set role ...``` and ```set autocommit ...``` commands, this PR respects these commands (and passes them on to Hive).
This PR and https://github.com/apache/spark/pull/10723 end the use of Parser-Combinator parsers for SQL parsing. As a result we can also remove the ```AbstractSQLParser``` in Catalyst.
The PR is marked WIP as long as it doesn't pass all tests.
cc rxin viirya winningsix (this touches https://github.com/apache/spark/pull/10144)
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes #10905 from hvanhovell/SPARK-12866.
|
|
|
|
|
|
|
|
|
|
| |
This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs.
This PR also add 2 specify `put` version(`putBinary` and `putLong`) into `BloomFilter`, which makes it easier to build a Bloom filter over a `DataFrame`.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10937 from cloud-fan/bloom-filter.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The high level idea is that instead of having the executors send both accumulator updates and TaskMetrics, we should have them send only accumulator updates. This eliminates the need to maintain both code paths since one can be implemented in terms of the other. This effort is split into two parts:
**SPARK-12895: Implement TaskMetrics using accumulators.** TaskMetrics is basically just a bunch of accumulable fields. This patch makes TaskMetrics a syntactic wrapper around a collection of accumulators so we don't need to send TaskMetrics from the executors to the driver.
**SPARK-12896: Send only accumulator updates to the driver.** Now that TaskMetrics are expressed in terms of accumulators, we can capture all TaskMetrics values if we just send accumulator updates from the executors to the driver. This completes the parent issue SPARK-10620.
While an effort has been made to preserve as much of the public API as possible, there were a few known breaking DeveloperApi changes that would be very awkward to maintain. I will gather the full list shortly and post it here.
Note: This was once part of #10717. This patch is split out into its own patch from there to make it easier for others to review. Other smaller pieces of already been merged into master.
Author: Andrew Or <andrew@databricks.com>
Closes #10835 from andrewor14/task-metrics-use-accums.
|
|
|
|
|
|
|
|
|
|
| |
`None` triggers cryptic failure
The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works.
Author: Jason Lee <cjlee@us.ibm.com>
Closes #8969 from jasoncl/SPARK-10847.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR is a follow-up of PR #10541. It integrates the newly introduced SQL generation feature with native view to make native view canonical.
In this PR, a new SQL option `spark.sql.nativeView.canonical` is added. When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to handle `CREATE VIEW` DDL statements using SQL query strings generated from view definition logical plans. If we failed to map the plan to SQL, we fallback to the original native view approach.
One important issue this PR fixes is that, now we can use CTE when defining a view. Originally, when native view is turned on, we wrap the view definition text with an extra `SELECT`. However, HiveQL parser doesn't allow CTE appearing as a subquery. Namely, something like this is disallowed:
```sql
SELECT n
FROM (
WITH w AS (SELECT 1 AS n)
SELECT * FROM w
) v
```
This PR fixes this issue because the extra `SELECT` is no longer needed (also, CTE expressions are inlined as subqueries during analysis phase, thus there won't be CTE expressions in the generated SQL query string).
Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes #10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view.
|
|
|
|
|
|
|
|
| |
This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs.
Author: Cheng Lian <lian@databricks.com>
Closes #10911 from liancheng/cms-df-api.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs
and arrays. There is a simple mapping between the richer catalyst types to these two. Strings
are treated as an array of bytes.
ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists
of just leaf nodes. Structs represent an internal node with one child for each field. Arrays
are internal nodes with one child. Structs just contain nullability. Arrays contain offsets
and lengths into the child array. This structure is able to handle arbitrary nesting. It has
the key property that we maintain columnar throughout and that primitive types are only stored
in the leaf nodes and contiguous across rows. For example, if the schema is
```
array<array<int>>
```
There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively.
As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v)
vs appendLong(v)). These APIs are necessary when the batch contains variable length elements.
The vectors are not fixed length and will grow as necessary. This should make the usage a lot
simpler for the writer.
Author: Nong Li <nong@databricks.com>
Closes #10820 from nongli/spark-12854.
|
|
|
|
|
|
|
|
| |
Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag.
Author: Cheng Lian <lian@databricks.com>
Closes #10926 from liancheng/agg-doc-fix.
|
|
|
|
|
|
|
|
|
|
| |
metadata format
This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL.
Author: Sameer Agarwal <sameer@databricks.com>
Closes #10826 from sameeragarwal/skip-hive-metadata.
|
|
|
|
|
|
|
|
|
|
|
|
| |
inconsistent with Scala's Iterator->Iterator
Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable.
CC rxin pwendell for API change; tdas since it also touches streaming.
Author: Sean Owen <sowen@cloudera.com>
Closes #10413 from srowen/SPARK-3369.
|
|
|
|
|
|
|
|
| |
This pull request simply fixes a few minor coding style issues in csv, as I was reviewing the change post-hoc.
Author: Reynold Xin <rxin@databricks.com>
Closes #10919 from rxin/csv-minor.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use.
This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily.
a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR:
**old version**
```
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate
-------------------------------------------------------------------------------
single long 2616.04 102.61 1.00 X
single nullable long 3032.54 88.52 0.86 X
primitive types 9121.05 29.43 0.29 X
nullable primitive types 12410.60 21.63 0.21 X
```
**new version**
```
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate
-------------------------------------------------------------------------------
single long 1533.34 175.07 1.00 X
single nullable long 2306.73 116.37 0.66 X
primitive types 8403.93 31.94 0.18 X
nullable primitive types 12448.39 21.56 0.12 X
```
For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10809 from cloud-fan/unsafe-projection.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Partitioning Columns
When users are using `partitionBy` and `bucketBy` at the same time, some bucketing columns might be part of partitioning columns. For example,
```
df.write
.format(source)
.partitionBy("i")
.bucketBy(8, "i", "k")
.saveAsTable("bucketed_table")
```
However, in the above case, adding column `i` into `bucketBy` is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, we can issue an exception and let users do the change.
Also added a test case for checking if the information of `sortBy` and `bucketBy` columns are correctly saved in the metastore table.
Could you check if my understanding is correct? cloud-fan rxin marmbrus Thanks!
Author: gatorsmile <gatorsmile@gmail.com>
Closes #10891 from gatorsmile/commonKeysInPartitionByBucketBy.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR brings back visualization for generated operators, they looks like:
![sql](https://cloud.githubusercontent.com/assets/40902/12460920/0dc7956a-bf6b-11e5-9c3f-8389f452526e.png)
![stage](https://cloud.githubusercontent.com/assets/40902/12460923/11806ac4-bf6b-11e5-9c72-e84a62c5ea93.png)
Note: SQL metrics are not supported right now, because they are very slow, will be supported once we have batch mode.
Author: Davies Liu <davies@databricks.com>
Closes #10828 from davies/viz_codegen.
|
|
|
|
|
|
| |
Author: Andy Grove <andygrove73@gmail.com>
Closes #10865 from andygrove/SPARK-12932.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
class and same format).
https://issues.apache.org/jira/browse/SPARK-12901
This PR refactors the options in JSON and CSV datasources.
In more details,
1. `JSONOptions` uses the same format as `CSVOptions`.
2. Not case classes.
3. `CSVRelation` that does not have to be serializable (it was `with Serializable` but I removed)
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #10895 from HyukjinKwon/SPARK-12901.
|
|
|
|
|
|
|
|
|
|
| |
Python rows
When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`.
Author: Cheng Lian <lian@databricks.com>
Closes #10886 from liancheng/spark-12624.
|
|
|
|
|
|
|
|
|
|
| |
ErrorPositionSuite and one of the HiveComparisonTest tests have been consistently failing on the Hadoop 2.3 SBT build (but on no other builds). I believe that this is due to test isolation issues (e.g. tests sharing state via the sets of temporary tables that are registered to TestHive).
This patch attempts to improve the isolation of these tests in order to address this issue.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #10884 from JoshRosen/fix-failing-hadoop-2.3-hive-tests.
|
|
|
|
|
|
|
|
|
|
| |
comparisons
This pull request implements strength reduction for comparing integral expressions and decimal literals, which is more common now because we switch to parsing fractional literals as decimal types (rather than doubles). I added the rules to the existing DecimalPrecision rule with some refactoring to simplify the control flow. I also moved DecimalPrecision rule into its own file due to the growing size.
Author: Reynold Xin <rxin@databricks.com>
Closes #10882 from rxin/SPARK-12904-1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JSON datasource
https://issues.apache.org/jira/browse/SPARK-12872
This PR makes the JSON datasource can compress output by option instead of manually setting Hadoop configurations.
For reflecting codec by names, it is similar with https://github.com/apache/spark/pull/10805.
As `CSVCompressionCodecs` can be shared with other datasources, it became a separate class to share as `CompressionCodecs`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #10858 from HyukjinKwon/SPARK-12872.
|
|
|
|
|
|
|
|
|
|
|
|
| |
When users turn off bucketing in SQLConf, we should issue some messages to tell users these operations will be converted to normal way.
Also added a test case for this scenario and fixed the helper function.
Do you think this PR is helpful when using bucket tables? cloud-fan Thank you!
Author: gatorsmile <gatorsmile@gmail.com>
Closes #10870 from gatorsmile/bucketTableWritingTestcases.
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-12747
Postgres JDBC driver uses "FLOAT4" or "FLOAT8" not "real".
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #10695 from viirya/fix-postgres-jdbc.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
partitioning to optimize the memory overhead
Now the hash based writer dynamic partitioning show the bad performance for big data and cause many small files and high GC. This patch we do external sort first so that each time we only need open one writer.
before this patch:
![gc](https://cloud.githubusercontent.com/assets/7018048/9149788/edc48c6e-3dec-11e5-828c-9995b56e4d65.PNG)
after this patch:
![gc-optimize-externalsort](https://cloud.githubusercontent.com/assets/7018048/9149794/60f80c9c-3ded-11e5-8a56-7ae18ddc7a2f.png)
Author: wangfei <wangfei_hello@126.com>
Author: scwf <wangfei1@huawei.com>
Closes #7336 from scwf/dynamic-optimize-basedon-apachespark.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As discussed in #10786, the generated TungstenAggregate does not support imperative functions.
For a query
```
sqlContext.range(10).filter("id > 1").groupBy().count()
```
The generated code will looks like:
```
/* 032 */ if (!initAgg0) {
/* 033 */ initAgg0 = true;
/* 034 */
/* 035 */ // initialize aggregation buffer
/* 037 */ long bufValue2 = 0L;
/* 038 */
/* 039 */
/* 040 */ // initialize Range
/* 041 */ if (!range_initRange5) {
/* 042 */ range_initRange5 = true;
...
/* 071 */ }
/* 072 */
/* 073 */ while (!range_overflow8 && range_number7 < range_partitionEnd6) {
/* 074 */ long range_value9 = range_number7;
/* 075 */ range_number7 += 1L;
/* 076 */ if (range_number7 < range_value9 ^ 1L < 0) {
/* 077 */ range_overflow8 = true;
/* 078 */ }
/* 079 */
/* 085 */ boolean primitive11 = false;
/* 086 */ primitive11 = range_value9 > 1L;
/* 087 */ if (!false && primitive11) {
/* 092 */ // do aggregate and update aggregation buffer
/* 099 */ long primitive17 = -1L;
/* 100 */ primitive17 = bufValue2 + 1L;
/* 101 */ bufValue2 = primitive17;
/* 105 */ }
/* 107 */ }
/* 109 */
/* 110 */ // output the result
/* 112 */ bufferHolder25.reset();
/* 114 */ rowWriter26.initialize(bufferHolder25, 1);
/* 118 */ rowWriter26.write(0, bufValue2);
/* 120 */ result24.pointTo(bufferHolder25.buffer, bufferHolder25.totalSize());
/* 121 */ currentRow = result24;
/* 122 */ return;
/* 124 */ }
/* 125 */
```
cc nongli
Author: Davies Liu <davies@databricks.com>
Closes #10840 from davies/gen_agg.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current parser turns a decimal literal, for example ```12.1```, into a Double. The problem with this approach is that we convert an exact literal into a non-exact ```Double```. The PR changes this behavior, a Decimal literal is now converted into an extact ```BigDecimal```.
The behavior for scientific decimals, for example ```12.1e01```, is unchanged. This will be converted into a Double.
This PR replaces the ```BigDecimal``` literal by a ```Double``` literal, because the ```BigDecimal``` is the default now. You can use the double literal by appending a 'D' to the value, for instance: ```3.141527D```
cc davies rxin
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes #10796 from hvanhovell/SPARK-12848.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Benchmark it on 4 different schemas, the result:
```
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
Hash For simple: Avg Time(ms) Avg Rate(M/s) Relative Rate
-------------------------------------------------------------------------------
interpreted version 31.47 266.54 1.00 X
codegen version 64.52 130.01 0.49 X
```
```
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
Hash For normal: Avg Time(ms) Avg Rate(M/s) Relative Rate
-------------------------------------------------------------------------------
interpreted version 4068.11 0.26 1.00 X
codegen version 1175.92 0.89 3.46 X
```
```
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
Hash For array: Avg Time(ms) Avg Rate(M/s) Relative Rate
-------------------------------------------------------------------------------
interpreted version 9276.70 0.06 1.00 X
codegen version 14762.23 0.04 0.63 X
```
```
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
Hash For map: Avg Time(ms) Avg Rate(M/s) Relative Rate
-------------------------------------------------------------------------------
interpreted version 58869.79 0.01 1.00 X
codegen version 9285.36 0.06 6.34 X
```
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10816 from cloud-fan/hash-benchmark.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
of Children
The existing `Union` logical operator only supports two children. Thus, adding a new logical operator `Unions` which can have arbitrary number of children to replace the existing one.
`Union` logical plan is a binary node. However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). It is not uncommon to union hundreds of thousands of files. In this case, our optimizer can become very slow due to the large number of logical unions. We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer to collapse all adjacent `Unions` into a single `Unions`. Note that this problem doesn't exist in physical plan, because the physical `Unions` already supports arbitrary number of children.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes #10577 from gatorsmile/unionAllMultiChildren.
|
|
|
|
|
|
|
|
| |
Currently, HiveTableScan runs with getCallSite which is really expensive and shows up when scanning through large table with partitions (e.g TPC-DS) which slows down the overall runtime of the job. It would be good to consider having dummyCallSite in HiveTableScan.
Author: Rajesh Balamohan <rbalamohan@apache.org>
Closes #10825 from rajeshbalamohan/SPARK-12898.
|
|
|
|
|
|
|
|
| |
Text is in UTF-8 and converting it via "UTF8String.fromString" incurs decoding and encoding, which turns out to be expensive and redundant. Profiler snapshot details is attached in the JIRA (ref:https://issues.apache.org/jira/secure/attachment/12783331/SPARK-12925_profiler_cpu_samples.png)
Author: Rajesh Balamohan <rbalamohan@apache.org>
Closes #10848 from rajeshbalamohan/SPARK-12925.
|
|
|
|
|
|
| |
Author: Davies Liu <davies@databricks.com>
Closes #10814 from davies/mutable_subexpr.
|
|
|
|
|
|
|
|
| |
Also updated documentation to explain why ComputeCurrentTime and EliminateSubQueries are in the optimizer rather than analyzer.
Author: Reynold Xin <rxin@databricks.com>
Closes #10837 from rxin/optimizer-analyzer-comment.
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-12871
This PR added an option to support to specify compression codec.
This adds the option `codec` as an alias `compression` as filed in [SPARK-12668 ](https://issues.apache.org/jira/browse/SPARK-12668).
Note that I did not add configurations for Hadoop 1.x as this `CsvRelation` is using Hadoop 2.x API and I guess it is going to drop Hadoop 1.x support.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #10805 from HyukjinKwon/SPARK-12420.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The three optimization cases are:
1. If the first branch's condition is a true literal, remove the CaseWhen and use the value from that branch.
2. If a branch's condition is a false or null literal, remove that branch.
3. If only the else branch is left, remove the CaseWhen and use the value from the else branch.
Author: Reynold Xin <rxin@databricks.com>
Closes #10827 from rxin/SPARK-12770.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Call `dealias` on local types to fix schema generation for abstract type members, such as
```scala
type KeyValue = (Int, String)
```
Add simple test
Author: Jakob Odersky <jodersky@gmail.com>
Closes #10749 from jodersky/aliased-schema.
|