| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #10559 from rxin/remove-deprecated-sql.
|
|
|
|
|
|
|
|
|
|
|
|
| |
length.
The reader was previously not setting the row length meaning it was wrong if there were variable
length columns. This problem does not manifest usually, since the value in the column is correct and
projecting the row fixes the issue.
Author: Nong Li <nong@databricks.com>
Closes #10576 from nongli/spark-12589.
|
|
|
|
|
|
|
|
|
|
|
| |
This PR enable cube/rollup as function, so they can be used as this:
```
select a, b, sum(c) from t group by rollup(a, b)
```
Author: Davies Liu <davies@databricks.com>
Closes #10522 from davies/rollup.
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is currently possible to change the values of the supposedly immutable ```GenericRow``` and ```GenericInternalRow``` classes. This is caused by the fact that scala's ArrayOps ```toArray``` (returned by calling ```toSeq```) will return the backing array instead of a copy. This PR fixes this problem.
This PR was inspired by https://github.com/apache/spark/pull/10374 by apo1.
cc apo1 sarutak marmbrus cloud-fan nongli (everyone in the previous conversation).
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes #10553 from hvanhovell/SPARK-12421.
|
|
|
|
|
|
|
|
| |
also only allocate required buffer size
Author: Pete Robbins <robbinspg@gmail.com>
Closes #10421 from robbinspg/master.
|
|
|
|
|
|
|
|
| |
Avoiding the the No such table exception and throwing analysis exception as per the bug: SPARK-12533
Author: thomastechs <thomas.sebastian@tcs.com>
Closes #10529 from thomastechs/topic-branch.
|
|
|
|
|
|
|
|
| |
A following pr for #9712. Move the test for arrayOfUDT.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #10538 from viirya/move-udt-test.
|
|
|
|
|
|
|
|
|
|
| |
Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes is calculated, making pointTo() a little bit heavy.
It should be part of constructor of UnsafeRow.
Author: Davies Liu <davies@databricks.com>
Closes #10528 from davies/numFields.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(docs & tests)
This PR is a follow-up for PR https://github.com/apache/spark/pull/9819. It adds documentation for the window functions and a couple of NULL tests.
The documentation was largely based on the documentation in (the source of) Hive and Presto:
* https://prestodb.io/docs/current/functions/window.html
* https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
I am not sure if we need to add the licenses of these two projects to the licenses directory. They are both under the ASL. srowen any thoughts?
cc yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes #10402 from hvanhovell/SPARK-8641-docs.
|
|
|
|
|
|
|
|
|
|
| |
Most of cases we should propagate null when call `NewInstance`, and so far there is only one case we should stop null propagation: create product/java bean. So I think it makes more sense to propagate null by dafault.
This also fixes a bug when encode null array/map, which is firstly discovered in https://github.com/apache/spark/pull/10401
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10443 from cloud-fan/encoder.
|
|
|
|
|
|
|
|
|
|
|
|
| |
```
org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input columns text;
```
lets put a `:` after `columns` and put the columns in `[]` so that they match the toString of DataFrame.
Author: gatorsmile <gatorsmile@gmail.com>
Closes #10518 from gatorsmile/improveAnalysisExceptionMsg.
|
|
|
|
|
|
|
|
| |
In Spark we allow UDFs to declare its expected input types in order to apply type coercion. The expected input type parameter takes a Seq[DataType] and uses Nil when no type coercion is applied. It makes more sense to take Option[Seq[DataType]] instead, so we can differentiate a no-arg function vs function with no expected input type specified.
Author: Reynold Xin <rxin@databricks.com>
Closes #10504 from rxin/SPARK-12549.
|
|
|
|
|
|
|
|
|
|
|
| |
Compilation error caused due to string concatenations that are not a constant
Use raw string literal to avoid string concatenations
https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes #10488 from kiszk/SPARK-12530.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Generate/MapPartitions/AppendColumns/MapGroups/CoGroup
When explain any plan with Generate, we will see an exclamation mark in the plan. Normally, when we see this mark, it means the plan has an error. This PR is to correct the `missingInput` in `Generate`.
For example,
```scala
val df = Seq((1, "a b c"), (2, "a b"), (3, "a")).toDF("number", "letters")
val df2 =
df.explode('letters) {
case Row(letters: String) => letters.split(" ").map(Tuple1(_)).toSeq
}
df2.explain(true)
```
Before the fix, the plan is like
```
== Parsed Logical Plan ==
'Generate UserDefinedGenerator('letters), true, false, None
+- Project [_1#0 AS number#2,_2#1 AS letters#3]
+- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]]
== Analyzed Logical Plan ==
number: int, letters: string, _1: string
Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8]
+- Project [_1#0 AS number#2,_2#1 AS letters#3]
+- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]]
== Optimized Logical Plan ==
Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8]
+- LocalRelation [number#2,letters#3], [[1,a b c],[2,a b],[3,a]]
== Physical Plan ==
!Generate UserDefinedGenerator(letters#3), true, false, [number#2,letters#3,_1#8]
+- LocalTableScan [number#2,letters#3], [[1,a b c],[2,a b],[3,a]]
```
**Updates**: The same issues are also found in the other four Dataset operators: `MapPartitions`/`AppendColumns`/`MapGroups`/`CoGroup`. Fixed all these four.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes #10393 from gatorsmile/generateExplain.
|
|
|
|
|
|
|
|
|
|
| |
Moved (case) classes Strategy, Once, FixedPoint and Batch to the companion object. This is necessary if we want to have the Optimizer easily extendable in the following sense: Usually a user wants to add additional rules, and just take the ones that are already there. However, inner classes made that impossible since the code did not compile
This allows easy extension of existing Optimizers see the DefaultOptimizerExtendableSuite for a corresponding test case.
Author: Stephan Kessler <stephan.kessler@sap.com>
Closes #10174 from stephankessler/SPARK-7727.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Accessing null elements in an array field fails when tungsten is enabled.
It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled.
This PR solves this by checking if the accessed element in the array field is null, in the generated code.
Example:
```
// Array of String
case class AS( as: Seq[String] )
val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
dfAS.registerTempTable("T_AS")
for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))}
```
With Tungsten disabled:
```
0 = [a]
1 = [null]
2 = [b]
```
With Tungsten enabled:
```
0 = [a]
15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15)
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
```
Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com>
Closes #10429 from pierre-borckmans/SPARK-12477_Tungsten-Projection-Null-Element-In-Array.
|
|
|
|
|
|
|
|
|
|
| |
When the filter is ```"b in ('1', '2')"```, the filter is not pushed down to Parquet. Thanks!
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes #10278 from gatorsmile/parquetFilterNot.
|
|
|
|
|
|
|
|
|
|
|
|
| |
When creating extractors for product types (i.e. case classes and tuples), a null check is missing, thus we always assume input product values are non-null.
This PR adds a null check in the extractor expression for product types. The null check is stripped off for top level product fields, which are mapped to the outermost `Row`s, since they can't be null.
Thanks cloud-fan for helping investigating this issue!
Author: Cheng Lian <lian@databricks.com>
Closes #10431 from liancheng/spark-12478.top-level-null-field.
|
|
|
|
|
|
|
|
|
|
| |
during analysis
Compare both left and right side of the case expression ignoring nullablity when checking for type equality.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes #10156 from dilipbiswal/spark-12102.
|
|
|
|
|
|
|
|
| |
First try, not sure how much information we need to provide in the usage part.
Author: Xiu Guo <xguo27@gmail.com>
Closes #10423 from xguo27/SPARK-12456.
|
|
|
|
|
|
|
|
| |
This PR adds a new expression `AssertNotNull` to ensure non-nullable fields of products and case classes don't receive null values at runtime.
Author: Cheng Lian <lian@databricks.com>
Closes #10331 from liancheng/dataset-nullability-check.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Based on the suggestions from marmbrus , added logical/physical operators for Range for improving the performance.
Also added another API for resolving the JIRA Spark-12150.
Could you take a look at my implementation, marmbrus ? If not good, I can rework it. : )
Thank you very much!
Author: gatorsmile <gatorsmile@gmail.com>
Closes #10335 from gatorsmile/rangeOperators.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An alternative solution for https://github.com/apache/spark/pull/10295 , instead of implementing json format for all logical/physical plans and expressions, use reflection to implement it in `TreeNode`.
Here I use pre-order traversal to flattern a plan tree to a plan list, and add an extra field `num-children` to each plan node, so that we can reconstruct the tree from the list.
example json:
logical plan tree:
```
[ {
"class" : "org.apache.spark.sql.catalyst.plans.logical.Sort",
"num-children" : 1,
"order" : [ [ {
"class" : "org.apache.spark.sql.catalyst.expressions.SortOrder",
"num-children" : 1,
"child" : 0,
"direction" : "Ascending"
}, {
"class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
"num-children" : 0,
"name" : "i",
"dataType" : "integer",
"nullable" : true,
"metadata" : { },
"exprId" : {
"id" : 10,
"jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
},
"qualifiers" : [ ]
} ] ],
"global" : false,
"child" : 0
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
"num-children" : 1,
"projectList" : [ [ {
"class" : "org.apache.spark.sql.catalyst.expressions.Alias",
"num-children" : 1,
"child" : 0,
"name" : "i",
"exprId" : {
"id" : 10,
"jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
},
"qualifiers" : [ ]
}, {
"class" : "org.apache.spark.sql.catalyst.expressions.Add",
"num-children" : 2,
"left" : 0,
"right" : 1
}, {
"class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
"num-children" : 0,
"name" : "a",
"dataType" : "integer",
"nullable" : true,
"metadata" : { },
"exprId" : {
"id" : 0,
"jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
},
"qualifiers" : [ ]
}, {
"class" : "org.apache.spark.sql.catalyst.expressions.Literal",
"num-children" : 0,
"value" : "1",
"dataType" : "integer"
} ], [ {
"class" : "org.apache.spark.sql.catalyst.expressions.Alias",
"num-children" : 1,
"child" : 0,
"name" : "j",
"exprId" : {
"id" : 11,
"jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
},
"qualifiers" : [ ]
}, {
"class" : "org.apache.spark.sql.catalyst.expressions.Multiply",
"num-children" : 2,
"left" : 0,
"right" : 1
}, {
"class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
"num-children" : 0,
"name" : "a",
"dataType" : "integer",
"nullable" : true,
"metadata" : { },
"exprId" : {
"id" : 0,
"jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
},
"qualifiers" : [ ]
}, {
"class" : "org.apache.spark.sql.catalyst.expressions.Literal",
"num-children" : 0,
"value" : "2",
"dataType" : "integer"
} ] ],
"child" : 0
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical.LocalRelation",
"num-children" : 0,
"output" : [ [ {
"class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
"num-children" : 0,
"name" : "a",
"dataType" : "integer",
"nullable" : true,
"metadata" : { },
"exprId" : {
"id" : 0,
"jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
},
"qualifiers" : [ ]
} ] ],
"data" : [ ]
} ]
```
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10311 from cloud-fan/toJson-reflection.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a DataFrame or Dataset has a long schema, we should intelligently truncate to avoid flooding the screen with unreadable information.
// Standard output
[a: int, b: int]
// Truncate many top level fields
[a: int, b, string ... 10 more fields]
// Truncate long inner structs
[a: struct<a: Int ... 10 more fields>]
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes #10373 from dilipbiswal/spark-12398.
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #10387 from rxin/version-bump.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now `StaticInvoke` receives `Any` as a object and `StaticInvoke` can be serialized but sometimes the object passed is not serializable.
For example, following code raises Exception because `RowEncoder#extractorsFor` invoked indirectly makes `StaticInvoke`.
```
case class TimestampContainer(timestamp: java.sql.Timestamp)
val rdd = sc.parallelize(1 to 2).map(_ => TimestampContainer(System.currentTimeMillis))
val df = rdd.toDF
val ds = df.as[TimestampContainer]
val rdd2 = ds.rdd <----------------- invokes extractorsFor indirectory
```
I'll add test cases.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: Michael Armbrust <michael@databricks.com>
Closes #10357 from sarutak/SPARK-12404.
|
|
|
|
|
|
|
|
|
|
| |
This could simplify the generated code for expressions that is not nullable.
This PR fix lots of bugs about nullability.
Author: Davies Liu <davies@databricks.com>
Closes #10333 from davies/skip_nullable.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Description of the problem from cloud-fan
Actually this line: https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L689
When we use `selectExpr`, we pass in `UnresolvedFunction` to `DataFrame.select` and fall in the last case. A workaround is to do special handling for UDTF like we did for `explode`(and `json_tuple` in 1.6), wrap it with `MultiAlias`.
Another workaround is using `expr`, for example, `df.select(expr("explode(a)").as(Nil))`, I think `selectExpr` is no longer needed after we have the `expr` function....
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes #9981 from dilipbiswal/spark-11619.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR removes Hive windows functions from Spark and replaces them with (native) Spark ones. The PR is on par with Hive in terms of features.
This has the following advantages:
* Better memory management.
* The ability to use spark UDAFs in Window functions.
cc rxin / yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes #9819 from hvanhovell/SPARK-8641-2.
|
|
|
|
|
|
|
|
| |
for Tuple encoder
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10293 from cloud-fan/err-msg.
|
|
|
|
|
|
|
|
| |
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes #10316 from davies/remove_generate_projection.
|
|
|
|
|
|
| |
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #8645 from cloud-fan/test.
|
|
|
|
|
|
|
|
| |
schemas.
Author: Nong Li <nong@databricks.com>
Closes #10260 from nongli/spark-11271.
|
|
|
|
|
|
|
|
| |
I think it was a mistake, and we have not catched it so far until https://github.com/apache/spark/pull/10260 which begin to check if the `fromRowExpression` is resolved.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10263 from cloud-fan/encoder.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, we could generate different plans for query with single distinct (depends on spark.sql.specializeSingleDistinctAggPlanning), one works better on low cardinality columns, the other
works better for high cardinality column (default one).
This PR change to generate a single plan (three aggregations and two exchanges), which work better in both cases, then we could safely remove the flag `spark.sql.specializeSingleDistinctAggPlanning` (introduced in 1.6).
For a query like `SELECT COUNT(DISTINCT a) FROM table` will be
```
AGG-4 (count distinct)
Shuffle to a single reducer
Partial-AGG-3 (count distinct, no grouping)
Partial-AGG-2 (grouping on a)
Shuffle by a
Partial-AGG-1 (grouping on a)
```
This PR also includes large refactor for aggregation (reduce 500+ lines of code)
cc yhuai nongli marmbrus
Author: Davies Liu <davies@databricks.com>
Closes #10228 from davies/single_distinct.
|
|
|
|
|
|
|
|
| |
This is a follow-up PR for #10259
Author: Davies Liu <davies@databricks.com>
Closes #10266 from davies/null_udf2.
|
|
|
|
|
|
|
|
|
|
| |
Check nullability and passing them into ScalaUDF.
Closes #10249
Author: Davies Liu <davies@databricks.com>
Closes #10259 from davies/udf_null.
|
|
|
|
|
|
|
|
|
|
|
|
| |
in https://github.com/apache/spark/pull/10133 we found that, we shoud ensure the children of `TreeNode` are all accessible in the `productIterator`, or the behavior will be very confusing.
In this PR, I try to fix this problem by expsing the `loopVar`.
This also fixes SPARK-12131 which is caused by the hacky `MapObjects`.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10239 from cloud-fan/map-objects.
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #10060 from marmbrus/docs.
|
|
|
|
|
|
|
|
| |
Delays application of ResolvePivot until all aggregates are resolved to prevent problems with UnresolvedFunction and adds unit test
Author: Andrew Ray <ray.andrew@gmail.com>
Closes #10202 from aray/sql-pivot-unresolved-function.
|
|
|
|
|
|
|
|
|
|
| |
This PR is to add three more data types into Encoder, including `BigDecimal`, `Date` and `Timestamp`.
marmbrus cloud-fan rxin Could you take a quick look at these three types? Not sure if it can be merged to 1.6. Thank you very much!
Author: gatorsmile <gatorsmile@gmail.com>
Closes #10188 from gatorsmile/dataTypesinEncoder.
|
|
|
|
|
|
|
|
|
| |
checked with hive, greatest/least should cast their children to a tightest common type,
i.e. `(int, long) => long`, `(int, string) => error`, `(decimal(10,5), decimal(5, 10)) => error`
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10196 from cloud-fan/type-coercion.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, the order of joins is exactly the same as SQL query, some conditions may not pushed down to the correct join, then those join will become cross product and is extremely slow.
This patch try to re-order the inner joins (which are common in SQL query), pick the joins that have self-contain conditions first, delay those that does not have conditions.
After this patch, the TPCDS query Q64/65 can run hundreds times faster.
cc marmbrus nongli
Author: Davies Liu <davies@databricks.com>
Closes #10073 from davies/reorder_joins.
|
|
|
|
|
|
|
|
|
|
| |
When \u appears in a comment block (i.e. in /**/), code gen will break. So, in Expression and CodegenFallback, we escape \u to \\u.
yhuai Please review it. I did reproduce it and it works after the fix. Thanks!
Author: gatorsmile <gatorsmile@gmail.com>
Closes #10155 from gatorsmile/escapeU.
|
|
|
|
|
|
|
|
|
|
| |
We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).
I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This replaces https://github.com/apache/spark/pull/9696
Invoke Checkstyle and print any errors to the console, failing the step.
Use Google's style rules modified according to
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
Some important checks are disabled (see TODOs in `checkstyle.xml`) due to
multiple violations being present in the codebase.
Suggest fixing those TODOs in a separate PR(s).
More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/).
Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles):
> Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1
Also fix some of the minor violations that didn't require sweeping changes.
Apologies for the previous botched PRs - I finally figured out the issue.
cr: JoshRosen, pwendell
> I state that the contribution is my original work, and I license the work to the project under the project's open source license.
Author: Dmitry Erastov <derastov@gmail.com>
Closes #9867 from dskrvk/master.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-12109
The change of https://issues.apache.org/jira/browse/SPARK-11596 exposed the problem.
In the sql plan viz, the filter shows
![image](https://cloud.githubusercontent.com/assets/2072857/11547075/1a285230-9906-11e5-8481-2bb451e35ef1.png)
After changes in this PR, the viz is back to normal.
![image](https://cloud.githubusercontent.com/assets/2072857/11547080/2bc570f4-9906-11e5-8897-3b3bff173276.png)
Author: Yin Huai <yhuai@databricks.com>
Closes #10111 from yhuai/SPARK-12109.
|
|
|
|
|
|
|
|
| |
When examining plans of complex queries with multiple joins, a pain point of mine is that, it's hard to immediately see the sibling node of a specific query plan node. This PR adds tree lines for the tree string of a `TreeNode`, so that the result can be visually more intuitive.
Author: Cheng Lian <lian@databricks.com>
Closes #10099 from liancheng/prettier-tree-string.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Following up #10038.
We can use bitmasks to determine which grouping expressions need to be set as nullable.
cc yhuai
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #10067 from viirya/fix-cube-following.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the current TreeNode, we should only return the simpleString.
In TreeNode's argString, if a TreeNode is not a child of the current TreeNode, we will only return the simpleString.
I tested the [following case provided by Cristian](https://issues.apache.org/jira/browse/SPARK-11596?focusedCommentId=15019241&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15019241).
```
val c = (1 to 20).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
println(s"PROCESSING >>>>>>>>>>> $idx")
val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
val union = curr.map(_.unionAll(df)).getOrElse(df)
union.cache()
Some(union)
}
c.get.explain(true)
```
Without the change, `c.get.explain(true)` took 100s. With the change, `c.get.explain(true)` took 26ms.
https://issues.apache.org/jira/browse/SPARK-11596
Author: Yin Huai <yhuai@databricks.com>
Closes #10079 from yhuai/SPARK-11596.
|