| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This will be helpful for [SPARK-1495](https://issues.apache.org/jira/browse/SPARK-1495) and other cases where we want to have custom hash join implementations but don't want to repeat the logic for finding the join keys.
Author: Michael Armbrust <michael@databricks.com>
Closes #418 from marmbrus/hashFilter and squashes the following commits:
d5cc79b [Michael Armbrust] Address @rxin 's comments.
366b6d9 [Michael Armbrust] style fixes
14560eb [Michael Armbrust] Generalize pattern for planning hash joins.
f4809c1 [Michael Armbrust] Move common functions to PredicateHelper.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unfortunately, this is not exhaustive - particularly hive tests still fail due to path issues.
Author: Mridul Muralidharan <mridulm80@apache.org>
This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>
Closes #505 from mridulm/windows_fixes and squashes the following commits:
ef12283 [Mridul Muralidharan] Move to org.apache.commons.lang3 for StringEscapeUtils. Earlier version was buggy appparently
cdae406 [Mridul Muralidharan] Remove leaked changes from > 2G fix branch
3267f4b [Mridul Muralidharan] Fix build failures
35b277a [Mridul Muralidharan] Fix Scalastyle failures
bc69d14 [Mridul Muralidharan] Change from hardcoded path separator
10c4d78 [Mridul Muralidharan] Use explicit encoding while using getBytes
1337abd [Mridul Muralidharan] fix classpath while running in windows
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #518 from marmbrus/parseArrayIndex and squashes the following commits:
afd2d6b [Michael Armbrust] 100 chars
c3d6026 [Michael Armbrust] Add support for parsing indexing into arrays in SQL.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
copying form previous pull request https://github.com/apache/spark/pull/462
Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.
In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.
Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
sample(fraction, withReplacement=false, seed=math.random)
Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.
If backward compatible is important, 3 new method can be introduced (without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)
Added some tests for the scala RDD takeSample method.
Author: Arun Ramakrishnan <smartnut007@gmail.com>
This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>
Closes #477 from smartnut007/master and squashes the following commits:
07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues
b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler
8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance.
69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue
0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Any comments are welcome
Author: Sandeep <sandeep@techaddict.me>
Closes #531 from techaddict/stylefix-1 and squashes the following commits:
7492730 [Sandeep] Pass 4
98b2428 [Sandeep] fix rxin suggestions
b5e2e6f [Sandeep] Pass 3
05932d7 [Sandeep] fix if else styling 2
08690e5 [Sandeep] fix if else styling
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
BooleanType to NumericTy...
...pe.
`Cast` from `BooleanType` to `NumericType` are all using `Int` value.
But it causes `ClassCastException` when the casted value is used by the following evaluation like the code below:
``` scala
scala> import org.apache.spark.sql.catalyst._
import org.apache.spark.sql.catalyst._
scala> import types._
import types._
scala> import expressions._
import expressions._
scala> Add(Cast(Literal(true), ShortType), Literal(1.toShort)).eval()
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Short
at scala.runtime.BoxesRunTime.unboxToShort(BoxesRunTime.java:102)
at scala.math.Numeric$ShortIsIntegral$.plus(Numeric.scala:72)
at org.apache.spark.sql.catalyst.expressions.Add$$anonfun$eval$2.apply(arithmetic.scala:58)
at org.apache.spark.sql.catalyst.expressions.Add$$anonfun$eval$2.apply(arithmetic.scala:58)
at org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:114)
at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:58)
at .<init>(<console>:17)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760)
at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:717)
at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:581)
at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:588)
at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:591)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:882)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:837)
at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:83)
at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:96)
at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:105)
at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala)
```
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #533 from ueshin/issues/SPARK-1610 and squashes the following commits:
70f36e8 [Takuya UESHIN] Fix Cast to use exact type value when cast from BooleanType to NumericType.
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #489 from marmbrus/sqlDocFixes and squashes the following commits:
acee4f3 [Michael Armbrust] Fix visibility / annotation of Spark SQL APIs
|
|
|
|
|
|
|
|
|
|
| |
I think I hit a class loading issue when running JavaSparkSQL example using spark-submit in local mode.
Author: Kan Zhang <kzhang@apache.org>
Closes #484 from kanzhang/SPARK-1570 and squashes the following commits:
feaaeba [Kan Zhang] [SPARK-1570] Fix classloading in JavaSQLContext.applySchema
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
... so that we don't follow an unspoken set of forbidden rules for adding **@AlphaComponent**, **@DeveloperApi**, and **@Experimental** annotations in the code.
In addition, this PR
(1) removes unnecessary `:: * ::` tags,
(2) adds missing `:: * ::` tags, and
(3) removes annotations for internal APIs.
Author: Andrew Or <andrewor14@gmail.com>
Closes #470 from andrewor14/annotations-fix and squashes the following commits:
92a7f42 [Andrew Or] Document + fix annotation usages
|
|
|
|
|
|
|
|
|
| |
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #432 from liancheng/reuseRow and squashes the following commits:
9e6d083 [Cheng Lian] Simplified code with BufferedIterator
52acec9 [Cheng Lian] Reuses Row object in ExistingRdd.productToRowRdd()
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-1483
From the original JIRA: " The parameter name is part of the public API in Scala and Python, since you can pass named parameters to a method, so we should name it to this more descriptive term. Everywhere else we refer to "splits" as partitions." - @mateiz
Author: CodingCat <zhunansjtu@gmail.com>
Closes #430 from CodingCat/SPARK-1483 and squashes the following commits:
4b60541 [CodingCat] deprecate defaultMinSplits
ba2c663 [CodingCat] Rename minSplits to minPartitions in public APIs
|
|
|
|
|
|
|
|
|
|
| |
When running Hive tests, the working directory is `$SPARK_HOME/sql/hive`, while when running `sbt hive/console`, it becomes `$SPARK_HOME`, and test tables are not loaded if `HIVE_DEV_HOME` is not defined.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #417 from liancheng/loadTestTables and squashes the following commits:
7cea8d6 [Cheng Lian] Loads test tables when running "sbt hive/console" without HIVE_DEV_HOME
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This makes it possible to create tables and insert into them using the DSL and SQL for the scala and java apis.
Author: Michael Armbrust <michael@databricks.com>
Closes #354 from marmbrus/insertIntoTable and squashes the following commits:
6c6f227 [Michael Armbrust] Create random temporary files in python parquet unit tests.
f5e6d5c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into insertIntoTable
765c506 [Michael Armbrust] Add to JavaAPI.
77b512c [Michael Armbrust] typos.
5c3ef95 [Michael Armbrust] use names for boolean args.
882afdf [Michael Armbrust] Change createTableAs to saveAsTable. Clean up api annotations.
d07d94b [Michael Armbrust] Add tests, support for creating parquet files and hive tables.
fa3fe81 [Michael Armbrust] Make insertInto available on JavaSchemaRDD as well. Add createTableAs function.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries.
```
from pyspark.context import SQLContext
sqlCtx = SQLContext(sc)
rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
srdd = sqlCtx.applySchema(rdd)
sqlCtx.registerRDDAsTable(srdd, "table1")
srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
srdd2.collect()
```
The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]```
Author: Ahir Reddy <ahirreddy@gmail.com>
Author: Michael Armbrust <michael@databricks.com>
Closes #363 from ahirreddy/pysql and squashes the following commits:
0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns
307d6e0 [Ahir Reddy] Style fix
6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies
3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py
29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD
f2312c7 [Ahir Reddy] Moved everything into sql.py
a19afe4 [Ahir Reddy] Doc fixes
6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL
521ff6d [Ahir Reddy] Trying to get spark to build with hive
ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins
ded03e7 [Ahir Reddy] Added doc test for HiveContext
22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency
e4da06c [Ahir Reddy] Display message if hive is not built into spark
227a0be [Michael Armbrust] Update API links. Fix Hive example.
58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes.
4285340 [Michael Armbrust] Fix building of Hive API Docs.
38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs.
337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build
40491c9 [Ahir Reddy] PR Changes + Method Visibility
1836944 [Michael Armbrust] Fix comments.
e00980f [Michael Armbrust] First draft of python sql programming guide.
b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test
f98a422 [Ahir Reddy] HiveContexts
79621cf [Ahir Reddy] cleaning up cruft
b406ba0 [Ahir Reddy] doctest formatting
20936a5 [Ahir Reddy] Added tests and documentation
e4d21b4 [Ahir Reddy] Added pyrolite dependency
79f739d [Ahir Reddy] added more tests
7515ba0 [Ahir Reddy] added more tests :)
d26ec5e [Ahir Reddy] added test
e9f5b8d [Ahir Reddy] adding tests
906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python
251f99d [Ahir Reddy] for now only allow dictionaries as input
09b9980 [Ahir Reddy] made jrdd explicitly lazy
c608947 [Ahir Reddy] SchemaRDD now has all RDD operations
725c91e [Ahir Reddy] awesome row objects
55d1c76 [Ahir Reddy] return row objects
4fe1319 [Ahir Reddy] output dictionaries correctly
be079de [Ahir Reddy] returning dictionaries works
cd5f79f [Ahir Reddy] Switched to using Scala SQLContext
e948bd9 [Ahir Reddy] yippie
4886052 [Ahir Reddy] even better
c0fb1c6 [Ahir Reddy] more working
043ca85 [Ahir Reddy] working
5496f9f [Ahir Reddy] doesn't crash
b8b904b [Ahir Reddy] Added schema rdd class
67ba875 [Ahir Reddy] java to python, and python to java
bcc0f23 [Ahir Reddy] Java to python
ab6025d [Ahir Reddy] compiling
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For your consideration: scalac currently notes a number of feature warnings during compilation:
```
[warn] there were 65 feature warning(s); re-run with -feature for details
```
Warnings are like:
```
[warn] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261: implicit conversion method rddToPairRDDFunctions should be enabled
[warn] by making the implicit value scala.language.implicitConversions visible.
[warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions'
[warn] or by setting the compiler option -language:implicitConversions.
[warn] See the Scala docs for value scala.language.implicitConversions for a discussion
[warn] why the feature should be explicitly enabled.
[warn] implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) =
[warn] ^
```
scalac is suggesting that it's just best practice to explicitly enable certain language features by importing them where used.
This PR simply adds the imports it suggests (and squashes one other Java warning along the way). This leaves just deprecation warnings in the build.
Author: Sean Owen <sowen@cloudera.com>
Closes #404 from srowen/SPARK-1488 and squashes the following commits:
8598980 [Sean Owen] Quiet scalac warnings about language features by explicitly importing language features.
39bc831 [Sean Owen] Enable -feature in scalac to emit language feature warnings
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixed several bugs of in-memory columnar storage to make `HiveInMemoryCompatibilitySuite` pass.
@rxin @marmbrus It is reasonable to include `HiveInMemoryCompatibilitySuite` in this PR, but I didn't, since it significantly increases test execution time. What do you think?
**UPDATE** `HiveCompatibilitySuite` has been made to cache tables in memory. `HiveInMemoryCompatibilitySuite` was removed.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Michael Armbrust <michael@databricks.com>
Closes #374 from liancheng/inMemBugFix and squashes the following commits:
6ad6d9b [Cheng Lian] Merged HiveCompatibilitySuite and HiveInMemoryCompatibilitySuite
5bdbfe7 [Cheng Lian] Revert 882c538 & 8426ddc, which introduced regression
882c538 [Cheng Lian] Remove attributes field from InMemoryColumnarTableScan
32cc9ce [Cheng Lian] Code style cleanup
99382bf [Cheng Lian] Enable compression by default
4390bcc [Cheng Lian] Report error for any Throwable in HiveComparisonTest
d1df4fd [Michael Armbrust] Remove test tables that might always get created anyway?
ab9e807 [Michael Armbrust] Fix the logged console version of failed test cases to use the new syntax.
1965123 [Michael Armbrust] Don't use coalesce for gathering all data to a single partition, as it does not work correctly with mutable rows.
e36cdd0 [Michael Armbrust] Spelling.
2d0e168 [Michael Armbrust] Run Hive tests in-memory too.
6360723 [Cheng Lian] Made PreInsertionCasts support SparkLogicalPlan and InMemoryColumnarTableScan
c9b0f6f [Cheng Lian] Let InsertIntoTable support InMemoryColumnarTableScan
9c8fc40 [Cheng Lian] Disable compression by default
e619995 [Cheng Lian] Bug fix: incorrect byte order in CompressionScheme.columnHeaderSize
8426ddc [Cheng Lian] Bug fix: InMemoryColumnarTableScan should cache columns specified by the attributes argument
036cd09 [Cheng Lian] Clean up unused imports
44591a5 [Cheng Lian] Bug fix: NullableColumnAccessor.hasNext must take nulls into account
052bf41 [Cheng Lian] Bug fix: should only gather compressibility info for non-null values
95b3301 [Cheng Lian] Fixed bugs in IntegralDelta
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Spark codebase is a bit fast-and-loose when accessing classloaders and this has caused a few bugs to surface in master.
This patch defines some utility methods for accessing classloaders. This makes the intention when accessing a classloader much more explicit in the code and fixes a few cases where the wrong one was chosen.
case (a) -> We want the classloader that loaded Spark
case (b) -> We want the context class loader, or if not present, we want (a)
This patch provides a better fix for SPARK-1403 (https://issues.apache.org/jira/browse/SPARK-1403) than the current work around, which it reverts. It also fixes a previously unreported bug that the `./spark-submit` script did not work for running with `local` master. It didn't work because the executor classloader did not properly delegate to the context class loader (if it is defined) and in local mode the context class loader is set by the `./spark-submit` script. A unit test is added for that case.
Author: Patrick Wendell <pwendell@gmail.com>
Closes #398 from pwendell/class-loaders and squashes the following commits:
b4a1a58 [Patrick Wendell] Minor clean up
14f1272 [Patrick Wendell] SPARK-1480: Clean up use of classloaders
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #378 from marmbrus/columnPruning and squashes the following commits:
779da56 [Michael Armbrust] More consistent naming.
1a4e9ea [Michael Armbrust] More comments.
2f4e7b9 [Michael Armbrust] Improve column pruning in the optimizer.
|
|
|
|
|
|
|
|
|
|
| |
stack these together in a commit else they show up chunk by chunk in different commits.
Author: Sandeep <sandeep@techaddict.me>
Closes #380 from techaddict/white_space and squashes the following commits:
b58f294 [Sandeep] Remove Unnecessary Whitespace's
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: witgo <witgo@qq.com>
Closes #325 from witgo/SPARK-1413 and squashes the following commits:
e57cd8e [witgo] use scala reflection to access and call the SLF4JBridgeHandler methods
45c8f40 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
5e35d87 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
0d5f819 [witgo] review commit
45e5b70 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
fa69dcf [witgo] Merge branch 'master' into SPARK-1413
3c98dc4 [witgo] Merge branch 'master' into SPARK-1413
38160cb [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
ba09bcd [witgo] remove set the parquet log level
a63d574 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
5231ecd [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
3feb635 [witgo] parquet logger use parent handler
fa00d5d [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
8bb6ffd [witgo] enableLogForwarding note fix
edd9630 [witgo] move to
f447f50 [witgo] merging master
5ad52bd [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
76670c1 [witgo] review commit
70f3c64 [witgo] Fix SPARK-1413
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch marks some existing classes as private[spark] and adds two types of API annotations:
- `EXPERIMENTAL API` = experimental user-facing module
- `DEVELOPER API - UNSTABLE` = developer-facing API that might change
There is some discussion of the different mechanisms for doing this here:
https://issues.apache.org/jira/browse/SPARK-1081
I was pretty aggressive with marking things private. Keep in mind that if we want to open something up in the future we can, but we can never reduce visibility.
A few notes here:
- In the past we've been inconsistent with the visiblity of the X-RDD classes. This patch marks them private whenever there is an existing function in RDD that can directly creat them (e.g. CoalescedRDD and rdd.coalesce()). One trade-off here is users can't subclass them.
- Noted that compression and serialization formats don't have to be wire compatible across versions.
- Compression codecs and serialization formats are semi-private as users typically don't instantiate them directly.
- Metrics sources are made private - user only interacts with them through Spark's reflection
Author: Patrick Wendell <pwendell@gmail.com>
Author: Andrew Or <andrewor14@gmail.com>
Closes #274 from pwendell/private-apis and squashes the following commits:
44179e4 [Patrick Wendell] Merge remote-tracking branch 'apache-github/master' into private-apis
042c803 [Patrick Wendell] spark.annotations -> spark.annotation
bfe7b52 [Patrick Wendell] Adding experimental for approximate counts
8d0c873 [Patrick Wendell] Warning in SparkEnv
99b223a [Patrick Wendell] Cleaning up annotations
e849f64 [Patrick Wendell] Merge pull request #2 from andrewor14/annotations
982a473 [Andrew Or] Generalize jQuery matching for non Spark-core API docs
a01c076 [Patrick Wendell] Merge pull request #1 from andrewor14/annotations
c1bcb41 [Andrew Or] DeveloperAPI -> DeveloperApi
0d48908 [Andrew Or] Comments and new lines (minor)
f3954e0 [Andrew Or] Add identifier tags in comments to work around scaladocs bug
99192ef [Andrew Or] Dynamically add badges based on annotations
824011b [Andrew Or] Add support for injecting arbitrary JavaScript to API docs
037755c [Patrick Wendell] Some changes after working with andrew or
f7d124f [Patrick Wendell] Small fixes
c318b24 [Patrick Wendell] Use CSS styles
e4c76b9 [Patrick Wendell] Logging
f390b13 [Patrick Wendell] Better visibility for workaround constructors
d6b0afd [Patrick Wendell] Small chang to existing constructor
403ba52 [Patrick Wendell] Style fix
870a7ba [Patrick Wendell] Work around for SI-8479
7fb13b2 [Patrick Wendell] Changes to UnionRDD and EmptyRDD
4a9e90c [Patrick Wendell] EXPERIMENTAL API --> EXPERIMENTAL
c581dce [Patrick Wendell] Changes after building against Shark.
8452309 [Patrick Wendell] Style fixes
1ed27d2 [Patrick Wendell] Formatting and coloring of badges
cd7a465 [Patrick Wendell] Code review feedback
2f706f1 [Patrick Wendell] Don't use floats
542a736 [Patrick Wendell] Small fixes
cf23ec6 [Patrick Wendell] Marking GraphX as alpha
d86818e [Patrick Wendell] Another naming change
5a76ed6 [Patrick Wendell] More visiblity clean-up
42c1f09 [Patrick Wendell] Using better labels
9d48cbf [Patrick Wendell] Initial pass
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA issue: [SPARK-1402](https://issues.apache.org/jira/browse/SPARK-1402)
This PR provides 3 more compression schemes for Spark SQL in-memory columnar storage:
* `BooleanBitSet`
* `IntDelta`
* `LongDelta`
Now there are 6 compression schemes in total, including the no-op `PassThrough` scheme.
Also fixed a bug in PR #286: not all compression schemes are added as available schemes when accessing an in-memory column, and when a column is compressed with an unrecognised scheme, `ColumnAccessor` throws exception.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #330 from liancheng/moreCompressionSchemes and squashes the following commits:
1d037b8 [Cheng Lian] Fixed SPARK-1436: in-memory column byte buffer must be able to be accessed multiple times
d7c0e8f [Cheng Lian] Added test suite for IntegralDelta (IntDelta & LongDelta)
3c1ad7a [Cheng Lian] Added test suite for BooleanBitSet, refactored other test suites
44fe4b2 [Cheng Lian] Refactored CompressionScheme, added 3 more compression schemes.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
time in seconds (instead of millis).
@marmbrus @chenghao-intel
Author: Reynold Xin <rxin@apache.org>
Closes #352 from rxin/timestamp-cast and squashes the following commits:
18aacd3 [Reynold Xin] Fixed precision for double.
2adb235 [Reynold Xin] Change timestamp cast semantics. When cast to numeric types, return the unix time in seconds (instead of millis).
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@apache.org>
Closes #349 from rxin/rand and squashes the following commits:
fd11322 [Reynold Xin] Added eval for Rand (without any support for user-defined seed).
|
|
|
|
|
|
|
|
|
|
|
|
| |
of override's in classes I touched.
It is more robust to not provide a default implementation for Expression's.
Author: Reynold Xin <rxin@apache.org>
Closes #350 from rxin/eval-default and squashes the following commits:
0a83b8f [Reynold Xin] Removed the default eval implementation from Expression, and added a bunch of override's in classes I touched.
|
|
|
|
|
|
|
|
|
|
| |
of private[this] to variables.
Author: Reynold Xin <rxin@apache.org>
Closes #348 from rxin/aggregate and squashes the following commits:
f4bc36f [Reynold Xin] Rename execution/aggregates.scala Aggregate.scala, and added a bunch of private[this] to variables.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Also used this opportunity to add a bunch of override's and made some members private.
Author: Reynold Xin <rxin@apache.org>
Closes #340 from rxin/eval and squashes the following commits:
a7c7ca7 [Reynold Xin] Fixed conflicts in merge.
9069de6 [Reynold Xin] Merge branch 'master' into eval
3ccc313 [Reynold Xin] Merge branch 'master' into eval
1a47e10 [Reynold Xin] Renamed apply to eval for generators and added a bunch of override's.
ea061de [Reynold Xin] Rename Expression.apply to eval for better readability.
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #343 from marmbrus/toStringFix and squashes the following commits:
37198fe [Michael Armbrust] Fix toString for SchemaRDD NativeCommands.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Given:
```scala
case class Data(a: Int, b: Int)
val rdd =
sparkContext
.parallelize(1 to 200)
.flatMap(_ => (1 to 50000).map(i => Data(i % 100, i)))
rdd.registerAsTable("data")
cacheTable("data")
```
Before:
```
SELECT COUNT(*) FROM data:[10000000]
16795.567ms
SELECT a, SUM(b) FROM data GROUP BY a
7536.436ms
SELECT SUM(b) FROM data
10954.1ms
```
After:
```
SELECT COUNT(*) FROM data:[10000000]
1372.175ms
SELECT a, SUM(b) FROM data GROUP BY a
2070.446ms
SELECT SUM(b) FROM data
958.969ms
```
Author: Michael Armbrust <michael@databricks.com>
Closes #295 from marmbrus/hashAgg and squashes the following commits:
ec63575 [Michael Armbrust] Add comment.
d0495a9 [Michael Armbrust] Use scaladoc instead.
b4a6887 [Michael Armbrust] Address review comments.
a2d90ba [Michael Armbrust] Capture child output statically to avoid issues with generators and serialization.
7c13112 [Michael Armbrust] Rewrite Aggregate operator to stream input and use projections. Remove unused local RDD functions implicits.
5096f99 [Michael Armbrust] Make HiveUDAF fields transient since object inspectors are not serializable.
6a4b671 [Michael Armbrust] Add option to avoid binding operators expressions automatically.
92cca08 [Michael Armbrust] Always include serialization debug info when running tests.
1279df2 [Michael Armbrust] Increase default number of partitions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars.
This patch has the following features/bug fixes:
- Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar.
- Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven)
- assemble-deps fixed since we no longer use a different ASSEMBLY_DIR
- avoid adding log message in compute-classpath.sh to the classpath :)
Still TODO before mergeable:
- We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself.
- Spark SQL documentation updates.
Author: Aaron Davidson <aaron@databricks.com>
Closes #237 from aarondav/master and squashes the following commits:
5dc4329 [Aaron Davidson] Typo fixes
dd4f298 [Aaron Davidson] Doc update
dd1a365 [Aaron Davidson] Eliminate need for SPARK_HIVE at runtime by d/ling datanucleus from Maven
a9269b5 [Aaron Davidson] [WIP] Use SPARK_HIVE to determine if we include Hive in packaging
|
|
|
|
|
|
|
|
| |
Author: witgo <witgo@qq.com>
Closes #333 from witgo/SPARK-1420 and squashes the following commits:
902519e [witgo] add dependency scala-reflect to catalyst
|
|
|
|
|
|
|
|
|
|
| |
Use Java 6's recommended equivalent of Java 7's Logger.getGlobal() to retain Java 6 compatibility. See PR #195
Author: Sean Owen <sowen@cloudera.com>
Closes #334 from srowen/FixPR195ForJava6 and squashes the following commits:
f92fbd3 [Sean Owen] Use Java 6's recommended equivalent of Java 7's Logger.getGlobal() to retain Java 6 compatibility
|
|
|
|
|
|
|
|
|
|
|
| |
Now users who want to use HiveQL should explicitly say `hiveql` or `hql`.
Author: Michael Armbrust <michael@databricks.com>
Closes #319 from marmbrus/standardizeSqlHql and squashes the following commits:
de68d0e [Michael Armbrust] Fix sampling test.
fbe4a54 [Michael Armbrust] Make `sql` always use spark sql parser, users of hive context can now use hql or hiveql to run queries using HiveQL instead.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #315 from marmbrus/minorFixes and squashes the following commits:
b23a15d [Michael Armbrust] fix scaladoc
11062ac [Michael Armbrust] Fix registering "SELECT *" queries as tables and caching them. As some tests for this and self-joins.
3997dc9 [Michael Armbrust] Move Row extractor to catalyst.
208bf5e [Michael Armbrust] More idiomatic naming of DSL functions. * subquery => as * for join condition => on, i.e., `r.join(s, condition = 'a == 'b)` =>`r.join(s, on = 'a == 'b)`
87211ce [Michael Armbrust] Correctly handle self joins of in-memory cached tables.
69e195e [Michael Armbrust] Change != to !== in the DSL since != will always translate to != on Any.
01f2dd5 [Michael Armbrust] Correctly assign aliases to tables in SqlParser.
|
|
|
|
|
|
|
|
|
|
| |
The JavaAPI and the Parquet improvements PRs didn't conflict, but broke the build.
Author: Michael Armbrust <michael@databricks.com>
Closes #316 from marmbrus/hotFixJavaApi and squashes the following commits:
0b84c2d [Michael Armbrust] Fix compilation of Spark SQL Java API.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
WIP: Some work remains...
* [x] Hive support
* [x] Tests
* [x] Update docs
Feedback welcome!
Author: Michael Armbrust <michael@databricks.com>
Closes #248 from marmbrus/javaSchemaRDD and squashes the following commits:
b393913 [Michael Armbrust] @srowen 's java style suggestions.
f531eb1 [Michael Armbrust] Address matei's comments.
33a1b1a [Michael Armbrust] Ignore JavaHiveSuite.
822f626 [Michael Armbrust] improve docs.
ab91750 [Michael Armbrust] Improve Java SQL API: * Change JavaRow => Row * Add support for querying RDDs of JavaBeans * Docs * Tests * Hive support
0b859c8 [Michael Armbrust] First draft of java API.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR includes:
1) Add new data type Timestamp
2) Add more data type casting base on Hive's Rule
3) Fix bug missing data type in both parsers (HiveQl & SQLParser).
Author: Cheng Hao <hao.cheng@intel.com>
Closes #275 from chenghao-intel/timestamp and squashes the following commits:
df709e5 [Cheng Hao] Move orc_ends_with_nulls to blacklist
24b04b0 [Cheng Hao] Put 3 cases into the black lists(describe_pretty,describe_syntax,lateral_view_outer)
fc512c2 [Cheng Hao] remove the unnecessary data type equality check in data casting
d0d1919 [Cheng Hao] Add more data type for scala reflection
3259808 [Cheng Hao] Add the new Golden files
3823b97 [Cheng Hao] Update the UnitTest cases & add timestamp type for HiveQL
54a0489 [Cheng Hao] fix bug mapping to 0 (which is supposed to be null) when NumberFormatException occurs
9cb505c [Cheng Hao] Fix issues according to PR comments
e529168 [Cheng Hao] Fix bug of converting from String
6fc8100 [Cheng Hao] Update Unit Test & CodeStyle
8a1d4d6 [Cheng Hao] Add DataType for SqlParser
ce4385e [Cheng Hao] Add TimestampType Support
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A few improvements to the Parquet support for SQL queries:
- Instead of files a ParquetRelation is now backed by a directory, which simplifies importing data from other
sources
- InsertIntoParquetTable operation now supports switching between overwriting or appending (at least in
HiveQL)
- tests now use the new API
- Parquet logging can be set to WARNING level (Default)
- Default compression for Parquet files (GZIP, as in parquet-mr)
Author: Andre Schumacher <andre.schumacher@iki.fi>
Closes #195 from AndreSchumacher/spark_parquet_improvements and squashes the following commits:
54df314 [Andre Schumacher] SPARK-1383 [SQL] Improvements to ParquetRelation
|
|
|
|
|
|
|
|
|
|
| |
schema inference.
Author: Michael Armbrust <michael@databricks.com>
Closes #293 from marmbrus/reflectTypes and squashes the following commits:
f54e8e8 [Michael Armbrust] Improve datatype and test coverage for ScalaReflection schema inference.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases.
2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API.
3. Avoid breaking lineage in Limit.
4. Added a bunch of override's to execution/basicOperators.scala.
@marmbrus @liancheng
Author: Reynold Xin <rxin@apache.org>
Author: Michael Armbrust <michael@databricks.com>
Closes #233 from rxin/limit and squashes the following commits:
13eb12a [Reynold Xin] Merge pull request #1 from marmbrus/limit
92b9727 [Michael Armbrust] More hacks to make Maps serialize with Kryo.
4fc8b4e [Reynold Xin] Merge branch 'master' of github.com:apache/spark into limit
87b7d37 [Reynold Xin] Use the proper serializer in limit.
9b79246 [Reynold Xin] Updated doc for Limit.
47d3327 [Reynold Xin] Copy tuples in Limit before shuffle.
231af3a [Reynold Xin] Limit/TakeOrdered: 1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases. 2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API. 3. Avoid breaking lineage in Limit. 4. Added a bunch of override's to execution/basicOperators.scala.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373)
(Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.)
This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include:
* `CompressionScheme`
Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include:
* `RunLengthEncoding`
* `DictionaryEncoding`
Algorithms to be implemented include:
* `BooleanBitSet`
* `IntDelta`
* `LongDelta`
* `CompressibleColumnBuilder`
A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns. A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time.
Memory layout of the final byte buffer is showed below:
```
.--------------------------- Column type ID (4 bytes)
| .----------------------- Null count N (4 bytes)
| | .------------------- Null positions (4 x N bytes, empty if null count is zero)
| | | .------------- Compression scheme ID (4 bytes)
| | | | .--------- Compressed non-null elements
V V V V V
+---+---+-----+---+---------+
| | | ... | | ... ... |
+---+---+-----+---+---------+
\-----------/ \-----------/
header body
```
* `CompressibleColumnAccessor`
A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column.
* `ColumnStats`
Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information.
Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible).
A major refactoring change since PR #205 is:
* Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #285 from liancheng/memColumnarCompression and squashes the following commits:
ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus
d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance
5034453 [Cheng Lian] Bug fix, more tests, and more refactoring
c298b76 [Cheng Lian] Test suites refactored
2780d6a [Cheng Lian] [WIP] in-memory columnar compression support
211331c [Cheng Lian] WIP: in-memory columnar compression support
85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code
|
|
|
|
|
|
|
|
|
|
| |
@rxin mentioned this might cause issues on windows machines.
Author: Michael Armbrust <michael@databricks.com>
Closes #297 from marmbrus/noStars and squashes the following commits:
263122a [Michael Armbrust] Remove * from test case golden filename.
|
|
|
|
|
|
|
|
|
|
| |
Just a Scala version increment
Author: Mark Hamstra <markhamstra@gmail.com>
Closes #259 from markhamstra/scala-2.10.4 and squashes the following commits:
fbec547 [Mark Hamstra] [SPARK-1342] Bumped Scala version to 2.10.4
|
|
|
|
|
|
|
|
|
|
| |
This doesn't yet support different databases in Hive (though you can probably workaround this by calling `USE <dbname>`). However, given the time constraints for 1.0 I think its probably worth including this now and extending the functionality in the next release.
Author: Michael Armbrust <michael@databricks.com>
Closes #282 from marmbrus/cacheTables and squashes the following commits:
83785db [Michael Armbrust] Support for caching and uncaching tables in a SQLContext.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before we were materializing everything in memory. This also uses the projection interface so will be easier to plug in code gen (its ported from that branch).
@rxin @liancheng
Author: Michael Armbrust <michael@databricks.com>
Closes #250 from marmbrus/hashJoin and squashes the following commits:
1ad873e [Michael Armbrust] Change hasNext logic back to the correct version.
8e6f2a2 [Michael Armbrust] Review comments.
1e9fb63 [Michael Armbrust] style
bc0cb84 [Michael Armbrust] Rewrite join implementation to allow streaming of one relation.
|
|
|
|
|
|
|
|
| |
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes #261 from ScrapCodes/comment-style-check2 and squashes the following commits:
6cde61e [Prashant Sharma] comment style space before ending */ check.
|
|
|
|
|
|
|
|
|
|
| |
Fix attribute unresolved when query with table name as a qualifier in SQLContext with SimplCatelog, details please see [SPARK-1354](https://issues.apache.org/jira/browse/SPARK-1354?jql=project%20%3D%20SPARK).
Author: jerryshao <saisai.shao@intel.com>
Closes #272 from jerryshao/qualifier-fix and squashes the following commits:
7950170 [jerryshao] Add tableName as a qualifier for SimpleCatelogy
|
|
|
|
|
|
|
|
|
|
|
|
| |
@AndreSchumacher, please take a look.
https://spark-project.atlassian.net/browse/SPARK-1354
Author: Michael Armbrust <michael@databricks.com>
Closes #269 from marmbrus/parquetJoin and squashes the following commits:
4081e77 [Michael Armbrust] Create new instances of Parquet relation when multiple copies are in a single plan.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR includes:
1) Unify the unit test for expression evaluation
2) Add implementation of RLike & Like
Author: Cheng Hao <hao.cheng@intel.com>
Closes #224 from chenghao-intel/string_expression and squashes the following commits:
84f72e9 [Cheng Hao] fix bug in RLike/Like & Simplify the unit test
aeeb1d7 [Cheng Hao] Simplify the implementation/unit test of RLike/Like
319edb7 [Cheng Hao] change to spark code style
91cfd33 [Cheng Hao] add implementation for rlike/like
2c8929e [Cheng Hao] Update the unit test for expression evaluation
|
|
|
|
|
|
|
|
|
|
| |
...sql pom files
Author: Thomas Graves <tgraves@apache.org>
Closes #263 from tgravescs/SPARK-1345 and squashes the following commits:
b43a2a0 [Thomas Graves] SPARK-1345 adding missing dependency on avro for hadoop 0.23 to the new sql pom files
|