| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
| |
This reverts commit 08f601328ad9e7334ef7deb3a9fff1343a3c4f30.
|
|
|
|
| |
This reverts commit 54df1b8c31fa2de5b04ee4a5563706b2664f34f3.
|
| |
|
| |
|
| |
|
|
|
|
| |
This reverts commit 919c87f26a2655bfd5ae03958915b6804367c1d6.
|
|
|
|
| |
This reverts commit edbd02fc6873676e080101d407916efb64bdf71a.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@apache.org>
Closes #1583 from rxin/closureClean and squashes the following commits:
8982fe6 [Reynold Xin] [SPARK-2529] Clean closures in foreach and foreachPartition.
(cherry picked from commit eb82abd8e3d25c912fa75201cf4f429aab8d73c7)
Signed-off-by: Reynold Xin <rxin@apache.org>
|
| |
|
|
|
|
| |
This reverts commit 70ee14f76d6c3d3f162db6bbe12797c252a0295a.
|
|
|
|
| |
This reverts commit baf92a0f2119867b1be540085ebe9f1a1c411ae8.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Stopping the Twitter Receiver would call twitter4j's TwitterStream.shutdown, which in turn causes an Exception to be thrown to the listener. This exception caused the Receiver to be restarted. This patch check whether the receiver was stopped or not, and accordingly restarts on exception.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #1577 from tdas/twitter-stop and squashes the following commits:
011b525 [Tathagata Das] Fixed Twitter stream stopping bug.
(cherry picked from commit a45d5480f65d2e969fc7fbd8f358b1717fb99bef)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
collections to Scala collections JsonRDD.scala
In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to a Scala one. These two operations are pretty expensive because they read elements from a Java Map/List and then load to a Scala Map/List. We can use Scala wrappers to wrap those Java collections instead of using toMap/toList.
I did a quick test to see the performance. I had a 2.9GB cached RDD[String] storing one JSON object per record (twitter dataset). My simple test program is attached below.
```scala
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val jsonData = sc.textFile("...")
jsonData.cache.count
val jsonSchemaRDD = sqlContext.jsonRDD(jsonData)
jsonSchemaRDD.registerAsTable("jt")
sqlContext.sql("select count(*) from jt").collect
```
Stages for the schema inference and the table scan both had 48 tasks. These tasks were executed sequentially. For the current implementation, scanning the JSON dataset will materialize values of all fields of a record. The inferred schema of the dataset can be accessed at https://gist.github.com/yhuai/05fe8a57c638c6666f8d.
From the result, there was no significant difference on running `jsonRDD`. For the simple aggregation query, results are attached below.
```
Original:
Run 1: 26.1s
Run 2: 27.03s
Run 3: 27.035s
With this change:
Run 1: 21.086s
Run 2: 21.035s
Run 3: 21.029s
```
JIRA: https://issues.apache.org/jira/browse/SPARK-2603
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #1504 from yhuai/removeToMapToList and squashes the following commits:
6831b77 [Yin Huai] Fix failed tests.
09b9bca [Yin Huai] Merge remote-tracking branch 'upstream/master' into removeToMapToList
d1abdb8 [Yin Huai] Remove unnecessary toMap and toList.
(cherry picked from commit b352ef175c234a2ea86b72c2f40da2ac69658b2e)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1556 from marmbrus/fixBooleanEqualsOne and squashes the following commits:
ad8edd4 [Michael Armbrust] Add rule for true = 1 and false = 0.
(cherry picked from commit 78d18fdbaa62d8ed235c29b2e37fd6607263c639)
Signed-off-by: Reynold Xin <rxin@apache.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, the "==" in HiveQL expression will cause exception thrown, this patch will fix it.
Author: Cheng Hao <hao.cheng@intel.com>
Closes #1522 from chenghao-intel/equal and squashes the following commits:
f62a0ff [Cheng Hao] Add == Support for HiveQl
(cherry picked from commit 79fe7634f6817eb2443bc152c6790a4439721fda)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We need to use the analyzed attributes otherwise we end up with a tree that will never resolve.
Author: Michael Armbrust <michael@databricks.com>
Closes #1470 from marmbrus/fixApplySchema and squashes the following commits:
f968195 [Michael Armbrust] Use analyzed attributes when applying the schema.
4969015 [Michael Armbrust] Add test case.
(cherry picked from commit 511a7314037219c23e824ea5363bf7f1df55bab3)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this.
Author: Davies Liu <davies.liu@gmail.com>
Closes #1371 from davies/hash_of_none and squashes the following commits:
d01745f [Davies Liu] add comments, remove outdated unit tests
5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy()
b7118aa [Davies Liu] use __builtin__ instead of __builtins__
839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines
(cherry picked from commit 872538c600a452ead52638c1ccba90643a9fa41c)
Signed-off-by: Matei Zaharia <matei@databricks.com>
|
|
|
|
|
|
| |
for defined classes."
This reverts commit 6e0b7e5308263bef60120debe05577868ebaeea9.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We should fix this in branch-1.0 as well.
Author: Reynold Xin <rxin@apache.org>
Closes #1500 from rxin/rangePartitioner and squashes the following commits:
c0a94f5 [Reynold Xin] [SPARK-2598] RangePartitioner's binary search does not use the given Ordering.
(cherry picked from commit fa51b0fb5bee95a402c7b7f13dcf0b46cf5bb429)
Signed-off-by: Reynold Xin <rxin@apache.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-2524
The configuration on spark.deploy.retainedDrivers is undocumented but actually used
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L60
Author: lianhuiwang <lianhuiwang09@gmail.com>
Author: Wang Lianhui <lianhuiwang09@gmail.com>
Author: unknown <Administrator@taguswang-PC1.tencent.com>
Closes #1443 from lianhuiwang/SPARK-2524 and squashes the following commits:
64660fd [Wang Lianhui] address pwendell's comments
5f6bbb7 [Wang Lianhui] missing document about spark.deploy.retainedDrivers
44a3f50 [unknown] Merge remote-tracking branch 'upstream/master'
eacf933 [lianhuiwang] Merge remote-tracking branch 'upstream/master'
8bbfe76 [lianhuiwang] Merge remote-tracking branch 'upstream/master'
480ce94 [lianhuiwang] address aarondav comments
f2b5970 [lianhuiwang] bugfix worker DriverStateChanged state should match DriverState.FAILED
(cherry picked from commit 4da01e3813f0a0413fe691358c14278bbd5508ed)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Typo fix to the programming guide in the docs. Changed the word "distibuted" to "distributed".
Author: Cesar Arevalo <cesar@zephyrhealthinc.com>
Closes #1495 from cesararevalo/master and squashes the following commits:
0c2e3a7 [Cesar Arevalo] Typo fix to the programming guide in the docs
(cherry picked from commit 0d01e85f42f3c997df7fee942b05b509968bac4b)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Cheng Hao <hao.cheng@intel.com>
Closes #1436 from chenghao-intel/unwrapdata and squashes the following commits:
34cc21a [Cheng Hao] update the table scan accodringly since the unwrapData function changed
afc39da [Cheng Hao] Polish the code
39d6475 [Cheng Hao] Add HiveDecimal & HiveVarchar support in unwrap data
(cherry picked from commit 7f1720813793e155743b58eae5228298e894b90d)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
New t2 instance types require HVM amis, bailout assumption of pvm
causes failures when using t2 instance types.
Author: Basit Mustafa <basitmustafa@computes-things-for-basit.local>
Closes #1446 from 24601/master and squashes the following commits:
01fe128 [Basit Mustafa] Makin' it pretty
392a95e [Basit Mustafa] Added t2 instance types
Conflicts:
ec2/spark_ec2.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Exception thrown when running the example of HiveFromSpark.
Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145)
at org.apache.spark.examples.sql.hive.HiveFromSpark$.main(HiveFromSpark.scala:45)
at org.apache.spark.examples.sql.hive.HiveFromSpark.main(HiveFromSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Author: Cheng Hao <hao.cheng@intel.com>
Closes #1475 from chenghao-intel/hive_from_spark and squashes the following commits:
d4c0500 [Cheng Hao] Fix the bug of ClassCastException
(cherry picked from commit 29809a6d58bfe3700350ce1988ff7083881c4382)
Signed-off-by: Reynold Xin <rxin@apache.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
(branch-1.0 backport)
This backports #1450 into branch-1.0.
Author: Reynold Xin <rxin@apache.org>
Closes #1469 from rxin/closure-1.0 and squashes the following commits:
b474a92 [Reynold Xin] [SPARK-2534] Avoid pulling in the entire RDD in various operators
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the first pass of CoalescedRDD does not find the target number of locations AND the second pass finds new locations, an exception is thrown, as "groupHash.get(nxt_replica).get" is not valid.
The fix is just to add an ArrayBuffer to groupHash for that replica if it didn't already exist.
Author: Aaron Davidson <aaron@databricks.com>
Closes #1337 from aarondav/2412 and squashes the following commits:
f587b5d [Aaron Davidson] getOrElseUpdate
3ad8a3c [Aaron Davidson] [SPARK-2412] CoalescedRDD throws exception with certain pref locs
(cherry picked from commit 7c23c0dc3ed721c95690fc49f435d9de6952523c)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Aaron Davidson <aaron@databricks.com>
Closes #1405 from aarondav/2154 and squashes the following commits:
24e9ef9 [Aaron Davidson] [SPARK-2154] Schedule next Driver when one completes (standalone mode)
(cherry picked from commit 9c249743eaabe5fc8d961c7aa581cc0197f6e950)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We recently added this lock on 'conf' in order to prevent concurrent creation. However, it turns out that this can introduce a deadlock because Hadoop also synchronizes on the Configuration objects when creating new Configurations (and they do so via a static REGISTRY which contains all created Configurations).
This fix forces all Spark initialization of Configuration objects to occur serially by using a static lock that we control, and thus also prevents introducing the deadlock.
Author: Aaron Davidson <aaron@databricks.com>
Closes #1409 from aarondav/1054 and squashes the following commits:
7d1b769 [Aaron Davidson] SPARK-1097: Do not introduce deadlock while fixing concurrency bug
(cherry picked from commit 8867cd0bc2961fefed84901b8b14e9676ae6ab18)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a follow-up of #1428.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #1432 from ueshin/issues/SPARK-2518 and squashes the following commits:
37d1ace [Takuya UESHIN] Fix foldability of Substring expression.
(cherry picked from commit cc965eea510397642830acb21f61127b68c098d6)
Signed-off-by: Reynold Xin <rxin@apache.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Spark SQL
JIRA: https://issues.apache.org/jira/browse/SPARK-2525.
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #1444 from yhuai/SPARK-2517 and squashes the following commits:
edbac3f [Yin Huai] Removed some compiler type erasure warnings.
(cherry picked from commit df95d82da7c76c074fd4064f7c870d55d99e0d8e)
Signed-off-by: Reynold Xin <rxin@apache.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a follow-up of #1359 with nullability narrowing.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #1426 from ueshin/issues/SPARK-2504 and squashes the following commits:
5157832 [Takuya UESHIN] Remove unnecessary white spaces.
80958ac [Takuya UESHIN] Fix nullability of Substring expression.
(cherry picked from commit 632fb3d9a9ebb3d2218385403145d5b89c41c025)
Signed-off-by: Reynold Xin <rxin@apache.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`Substring` including `null` literal cases could be added to `NullPropagation`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #1428 from ueshin/issues/SPARK-2509 and squashes the following commits:
d9eb85f [Takuya UESHIN] Add Substring cases to NullPropagation.
(cherry picked from commit 9b38b7c71352bb5e6d359515111ad9ca33299127)
Signed-off-by: Reynold Xin <rxin@apache.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SchemaRDD implementations.
Author: Aaron Staple <aaron.staple@gmail.com>
Closes #1421 from staple/SPARK-2314 and squashes the following commits:
73e04dc [Aaron Staple] [SPARK-2314] Override collect and take in JavaSchemaRDD, forwarding to SchemaRDD implementations.
(cherry picked from commit 90ca532a0fd95dc85cff8c5722d371e8368b2687)
Signed-off-by: Reynold Xin <rxin@apache.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
data type objects.
JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2498
Author: Zongheng Yang <zongheng.y@gmail.com>
Closes #1423 from concretevitamin/scala-ref-catalyst and squashes the following commits:
325a149 [Zongheng Yang] Synchronize on a lock when initializing data type objects in Catalyst.
(cherry picked from commit c2048a5165b270f5baf2003fdfef7bc6c5875715)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1414 from marmbrus/exprIdResolution and squashes the following commits:
97b47bc [Michael Armbrust] Attribute equality comparisons should be done by exprId.
(cherry picked from commit 502f90782ad474e2630ed5be4d3c4be7dab09c34)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This replaces the Hive UDF for SUBSTR(ING) with an implementation in Catalyst
and adds tests to verify correct operation.
Author: William Benton <willb@redhat.com>
Closes #1359 from willb/internalSqlSubstring and squashes the following commits:
ccedc47 [William Benton] Fixed too-long line.
a30a037 [William Benton] replace view bounds with implicit parameters
ec35c80 [William Benton] Adds fixes from review:
4f3bfdb [William Benton] Added internal implementation of SQL SUBSTR()
(cherry picked from commit 61de65bc69f9a5fc396b76713193c6415436d452)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1396 from marmbrus/moreTests and squashes the following commits:
6660b60 [Michael Armbrust] Blacklist a test that requires DFS command.
8b6001c [Michael Armbrust] Add golden files.
ccd8f97 [Michael Armbrust] Whitelist more tests.
(cherry picked from commit bcd0c30c7eea4c50301cb732c733fdf4d4142060)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1411 from marmbrus/nestedRepeated and squashes the following commits:
044fa09 [Michael Armbrust] Fix parsing of repeated, nested data access.
(cherry picked from commit 0f98ef1a2c9ecf328f6c5918808fa5ca486e8afd)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1412 from marmbrus/lockHiveClient and squashes the following commits:
4bc9d5a [Michael Armbrust] protected[hive]
22e9177 [Michael Armbrust] Add comments.
7aa8554 [Michael Armbrust] Don't lock on hive's object.
a6edc5f [Michael Armbrust] Lock usage of hive client.
(cherry picked from commit c7c7ac83392b10abb011e6aead1bf92e7c73695e)
Signed-off-by: Aaron Davidson <aaron@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
groupBy()/groupByKey() is notorious for being a very convenient API that can lead to poor performance when used incorrectly.
This PR just makes it clear that users should be cautious not to rely on this API when they really want a different (more performant) one, such as reduceByKey().
(Note that one source of confusion is the name; this groupBy() is not the same as a SQL GROUP-BY, which is used for aggregation and is more similar in nature to Spark's reduceByKey().)
Author: Aaron Davidson <aaron@databricks.com>
Closes #1380 from aarondav/warning and squashes the following commits:
f60da39 [Aaron Davidson] Give better advice
d0afb68 [Aaron Davidson] Add/increase severity of warning in documentation of groupBy()
(cherry picked from commit a2aa7bebae31e1e7ec23d31aaa436283743b283b)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This fix obtains a comparable performance boost as [PR #1390](https://github.com/apache/spark/pull/1390) by moving an array update and deserializer initialization out of a potentially very long loop. Suggested by yhuai. The below results are updated for this fix.
## Benchmarks
Generated a local text file with 10M rows of simple key-value pairs. The data is loaded as a table through Hive. Results are obtained on my local machine using hive/console.
Without the fix:
Type | Non-partitioned | Partitioned (1 part)
------------ | ------------ | -------------
First run | 9.52s end-to-end (1.64s Spark job) | 36.6s (28.3s)
Stablized runs | 1.21s (1.18s) | 27.6s (27.5s)
With this fix:
Type | Non-partitioned | Partitioned (1 part)
------------ | ------------ | -------------
First run | 9.57s (1.46s) | 11.0s (1.69s)
Stablized runs | 1.13s (1.10s) | 1.23s (1.19s)
Author: Zongheng Yang <zongheng.y@gmail.com>
Closes #1408 from concretevitamin/slow-read-2 and squashes the following commits:
d86e437 [Zongheng Yang] Move update & initialization out of potentially long loop.
(cherry picked from commit d60b09bb60cff106fa0acddebf35714503b20f03)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
InMemoryRelation
Reuse byte buffers when creating unique attributes for multiple instances of an InMemoryRelation in a single query plan.
Author: Michael Armbrust <michael@databricks.com>
Closes #1332 from marmbrus/doubleCache and squashes the following commits:
4a19609 [Michael Armbrust] Clean up concurrency story by calculating buffersn the constructor.
b39c931 [Michael Armbrust] Allocations are kind of a side effect.
f67eff7 [Michael Armbrust] Reusue same byte buffers when creating new instance of InMemoryRelation
(cherry picked from commit 1a7d7cc85fb24de21f1cde67d04467171b82e845)
Signed-off-by: Reynold Xin <rxin@apache.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1366 from marmbrus/partialDistinct and squashes the following commits:
12a31ab [Michael Armbrust] Add more efficient distinct operator.
(cherry picked from commit 7e26b57615f6c1d3f9058f9c19c05ec91f017f4c)
Signed-off-by: Reynold Xin <rxin@apache.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
VertexPartition and ShippableVertexPartition are contained in RDDs but are not marked Serializable, leading to NotSerializableExceptions when using Java serialization.
The fix is simply to mark them as Serializable. This PR does that and adds a test for serializing them using Java and Kryo serialization.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #1376 from ankurdave/SPARK-2455 and squashes the following commits:
ed4a51b [Ankur Dave] Make (Shippable)VertexPartition serializable
1fd42c5 [Ankur Dave] Add failing tests for Java serialization
(cherry picked from commit 7a0135293192aaefc6ae20b57e15a90945bd8a4e)
Signed-off-by: Reynold Xin <rxin@apache.org>
|
| |
|