spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[maven-release-plugin] prepare release v1.0.2-rc1v1.0.2	Tathagata Das	2014-07-25	21	-22/+22
\|
*	Revert "[maven-release-plugin] prepare release v1.0.2-rc1"	Tathagata Das	2014-07-25	21	-22/+22
\| \| \| \|	This reverts commit 08f601328ad9e7334ef7deb3a9fff1343a3c4f30.
*	Revert "[maven-release-plugin] prepare for next development iteration"	Tathagata Das	2014-07-25	21	-22/+22
\| \| \| \|	This reverts commit 54df1b8c31fa2de5b04ee4a5563706b2664f34f3.
*	Updated CHANGES.txt	Tathagata Das	2014-07-25	2	-4/+301
\|
*	[maven-release-plugin] prepare for next development iteration	Tathagata Das	2014-07-25	21	-22/+22
\|
*	[maven-release-plugin] prepare release v1.0.2-rc1	Tathagata Das	2014-07-25	21	-22/+22
\|
*	Revert "[maven-release-plugin] prepare release v1.0.2-rc1"	Tathagata Das	2014-07-25	21	-22/+22
\| \| \| \|	This reverts commit 919c87f26a2655bfd5ae03958915b6804367c1d6.
*	Revert "[maven-release-plugin] prepare for next development iteration"	Tathagata Das	2014-07-25	21	-22/+22
\| \| \| \|	This reverts commit edbd02fc6873676e080101d407916efb64bdf71a.
*	[maven-release-plugin] prepare for next development iteration	Tathagata Das	2014-07-25	21	-22/+22
\|
*	[maven-release-plugin] prepare release v1.0.2-rc1	Tathagata Das	2014-07-25	21	-22/+22
\|
*	[SPARK-2529] Clean closures in foreach and foreachPartition.	Reynold Xin	2014-07-25	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@apache.org> Closes #1583 from rxin/closureClean and squashes the following commits: 8982fe6 [Reynold Xin] [SPARK-2529] Clean closures in foreach and foreachPartition. (cherry picked from commit eb82abd8e3d25c912fa75201cf4f429aab8d73c7) Signed-off-by: Reynold Xin <rxin@apache.org>
*	Updating versions for 1.0.2 release.	Tathagata Das	2014-07-25	6	-6/+6
\|
*	Revert "[maven-release-plugin] prepare release v1.0.1-rc3"	Tathagata Das	2014-07-25	21	-22/+22
\| \| \| \|	This reverts commit 70ee14f76d6c3d3f162db6bbe12797c252a0295a.
*	Revert "[maven-release-plugin] prepare for next development iteration"	Ubuntu	2014-07-25	21	-22/+22
\| \| \| \|	This reverts commit baf92a0f2119867b1be540085ebe9f1a1c411ae8.
*	[SPARK-2464][Streaming] Fixed Twitter stream stopping bug	Tathagata Das	2014-07-24	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	Stopping the Twitter Receiver would call twitter4j's TwitterStream.shutdown, which in turn causes an Exception to be thrown to the listener. This exception caused the Receiver to be restarted. This patch check whether the receiver was stopped or not, and accordingly restarts on exception. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #1577 from tdas/twitter-stop and squashes the following commits: 011b525 [Tathagata Das] Fixed Twitter stream stopping bug. (cherry picked from commit a45d5480f65d2e969fc7fbd8f358b1717fb99bef) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-2603][SQL] Remove unnecessary toMap and toList in converting Java ↵	Yin Huai	2014-07-24	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	collections to Scala collections JsonRDD.scala In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to a Scala one. These two operations are pretty expensive because they read elements from a Java Map/List and then load to a Scala Map/List. We can use Scala wrappers to wrap those Java collections instead of using toMap/toList. I did a quick test to see the performance. I had a 2.9GB cached RDD[String] storing one JSON object per record (twitter dataset). My simple test program is attached below. ```scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ val jsonData = sc.textFile("...") jsonData.cache.count val jsonSchemaRDD = sqlContext.jsonRDD(jsonData) jsonSchemaRDD.registerAsTable("jt") sqlContext.sql("select count(*) from jt").collect ``` Stages for the schema inference and the table scan both had 48 tasks. These tasks were executed sequentially. For the current implementation, scanning the JSON dataset will materialize values of all fields of a record. The inferred schema of the dataset can be accessed at https://gist.github.com/yhuai/05fe8a57c638c6666f8d. From the result, there was no significant difference on running `jsonRDD`. For the simple aggregation query, results are attached below. ``` Original: Run 1: 26.1s Run 2: 27.03s Run 3: 27.035s With this change: Run 1: 21.086s Run 2: 21.035s Run 3: 21.029s ``` JIRA: https://issues.apache.org/jira/browse/SPARK-2603 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1504 from yhuai/removeToMapToList and squashes the following commits: 6831b77 [Yin Huai] Fix failed tests. 09b9bca [Yin Huai] Merge remote-tracking branch 'upstream/master' into removeToMapToList d1abdb8 [Yin Huai] Remove unnecessary toMap and toList. (cherry picked from commit b352ef175c234a2ea86b72c2f40da2ac69658b2e) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2658][SQL] Add rule for true = 1.	Michael Armbrust	2014-07-23	3	-1/+24
\| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1556 from marmbrus/fixBooleanEqualsOne and squashes the following commits: ad8edd4 [Michael Armbrust] Add rule for true = 1 and false = 0. (cherry picked from commit 78d18fdbaa62d8ed235c29b2e37fd6607263c639) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2615] [SQL] Add Equal Sign "==" Support for HiveQl	Cheng Hao	2014-07-22	21	-0/+45
\| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, the "==" in HiveQL expression will cause exception thrown, this patch will fix it. Author: Cheng Hao <hao.cheng@intel.com> Closes #1522 from chenghao-intel/equal and squashes the following commits: f62a0ff [Cheng Hao] Add == Support for HiveQl (cherry picked from commit 79fe7634f6817eb2443bc152c6790a4439721fda) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2561][SQL] Fix apply schema	Michael Armbrust	2014-07-21	2	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to use the analyzed attributes otherwise we end up with a tree that will never resolve. Author: Michael Armbrust <michael@databricks.com> Closes #1470 from marmbrus/fixApplySchema and squashes the following commits: f968195 [Michael Armbrust] Use analyzed attributes when applying the schema. 4969015 [Michael Armbrust] Add test case. (cherry picked from commit 511a7314037219c23e824ea5363bf7f1df55bab3) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2494] [PySpark] make hash of None consistant cross machines	Davies Liu	2014-07-21	1	-3/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this. Author: Davies Liu <davies.liu@gmail.com> Closes #1371 from davies/hash_of_none and squashes the following commits: d01745f [Davies Liu] add comments, remove outdated unit tests 5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy() b7118aa [Davies Liu] use __builtin__ instead of __builtins__ 839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines (cherry picked from commit 872538c600a452ead52638c1ccba90643a9fa41c) Signed-off-by: Matei Zaharia <matei@databricks.com>
*	Revert "[SPARK-1199][REPL] Remove VALId and use the original import style ↵	Patrick Wendell	2014-07-21	3	-31/+11
\| \| \| \| \| \|	for defined classes." This reverts commit 6e0b7e5308263bef60120debe05577868ebaeea9.
*	[SPARK-2598] RangePartitioner's binary search does not use the given Ordering	Reynold Xin	2014-07-20	3	-5/+20
\| \| \| \| \| \| \| \| \| \| \| \| \|	We should fix this in branch-1.0 as well. Author: Reynold Xin <rxin@apache.org> Closes #1500 from rxin/rangePartitioner and squashes the following commits: c0a94f5 [Reynold Xin] [SPARK-2598] RangePartitioner's binary search does not use the given Ordering. (cherry picked from commit fa51b0fb5bee95a402c7b7f13dcf0b46cf5bb429) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2524] missing document about spark.deploy.retainedDrivers	lianhuiwang	2014-07-19	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-2524 The configuration on spark.deploy.retainedDrivers is undocumented but actually used https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L60 Author: lianhuiwang <lianhuiwang09@gmail.com> Author: Wang Lianhui <lianhuiwang09@gmail.com> Author: unknown <Administrator@taguswang-PC1.tencent.com> Closes #1443 from lianhuiwang/SPARK-2524 and squashes the following commits: 64660fd [Wang Lianhui] address pwendell's comments 5f6bbb7 [Wang Lianhui] missing document about spark.deploy.retainedDrivers 44a3f50 [unknown] Merge remote-tracking branch 'upstream/master' eacf933 [lianhuiwang] Merge remote-tracking branch 'upstream/master' 8bbfe76 [lianhuiwang] Merge remote-tracking branch 'upstream/master' 480ce94 [lianhuiwang] address aarondav comments f2b5970 [lianhuiwang] bugfix worker DriverStateChanged state should match DriverState.FAILED (cherry picked from commit 4da01e3813f0a0413fe691358c14278bbd5508ed) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	Typo fix to the programming guide in the docs	Cesar Arevalo	2014-07-19	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	Typo fix to the programming guide in the docs. Changed the word "distibuted" to "distributed". Author: Cesar Arevalo <cesar@zephyrhealthinc.com> Closes #1495 from cesararevalo/master and squashes the following commits: 0c2e3a7 [Cesar Arevalo] Typo fix to the programming guide in the docs (cherry picked from commit 0d01e85f42f3c997df7fee942b05b509968bac4b) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-2540] [SQL] Add HiveDecimal & HiveVarchar support in unwrapping data	Cheng Hao	2014-07-18	2	-11/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Cheng Hao <hao.cheng@intel.com> Closes #1436 from chenghao-intel/unwrapdata and squashes the following commits: 34cc21a [Cheng Hao] update the table scan accodringly since the unwrapData function changed afc39da [Cheng Hao] Polish the code 39d6475 [Cheng Hao] Add HiveDecimal & HiveVarchar support in unwrap data (cherry picked from commit 7f1720813793e155743b58eae5228298e894b90d) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	Added t2 instance types	Basit Mustafa	2014-07-18	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	New t2 instance types require HVM amis, bailout assumption of pvm causes failures when using t2 instance types. Author: Basit Mustafa <basitmustafa@computes-things-for-basit.local> Closes #1446 from 24601/master and squashes the following commits: 01fe128 [Basit Mustafa] Makin' it pretty 392a95e [Basit Mustafa] Added t2 instance types Conflicts: ec2/spark_ec2.py
*	[SPARK-2570] [SQL] Fix the bug of ClassCastException	Cheng Hao	2014-07-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Exception thrown when running the example of HiveFromSpark. Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145) at org.apache.spark.examples.sql.hive.HiveFromSpark$.main(HiveFromSpark.scala:45) at org.apache.spark.examples.sql.hive.HiveFromSpark.main(HiveFromSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Author: Cheng Hao <hao.cheng@intel.com> Closes #1475 from chenghao-intel/hive_from_spark and squashes the following commits: d4c0500 [Cheng Hao] Fix the bug of ClassCastException (cherry picked from commit 29809a6d58bfe3700350ce1988ff7083881c4382) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2534] Avoid pulling in the entire RDD in various operators ↵	Reynold Xin	2014-07-17	2	-29/+28
\| \| \| \| \| \| \| \| \| \| \| \|	(branch-1.0 backport) This backports #1450 into branch-1.0. Author: Reynold Xin <rxin@apache.org> Closes #1469 from rxin/closure-1.0 and squashes the following commits: b474a92 [Reynold Xin] [SPARK-2534] Avoid pulling in the entire RDD in various operators
*	[SPARK-2412] CoalescedRDD throws exception with certain pref locs	Aaron Davidson	2014-07-17	2	-2/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the first pass of CoalescedRDD does not find the target number of locations AND the second pass finds new locations, an exception is thrown, as "groupHash.get(nxt_replica).get" is not valid. The fix is just to add an ArrayBuffer to groupHash for that replica if it didn't already exist. Author: Aaron Davidson <aaron@databricks.com> Closes #1337 from aarondav/2412 and squashes the following commits: f587b5d [Aaron Davidson] getOrElseUpdate 3ad8a3c [Aaron Davidson] [SPARK-2412] CoalescedRDD throws exception with certain pref locs (cherry picked from commit 7c23c0dc3ed721c95690fc49f435d9de6952523c) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-2154] Schedule next Driver when one completes (standalone mode)	Aaron Davidson	2014-07-16	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	Author: Aaron Davidson <aaron@databricks.com> Closes #1405 from aarondav/2154 and squashes the following commits: 24e9ef9 [Aaron Davidson] [SPARK-2154] Schedule next Driver when one completes (standalone mode) (cherry picked from commit 9c249743eaabe5fc8d961c7aa581cc0197f6e950) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	SPARK-1097: Do not introduce deadlock while fixing concurrency bug	Aaron Davidson	2014-07-16	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We recently added this lock on 'conf' in order to prevent concurrent creation. However, it turns out that this can introduce a deadlock because Hadoop also synchronizes on the Configuration objects when creating new Configurations (and they do so via a static REGISTRY which contains all created Configurations). This fix forces all Spark initialization of Configuration objects to occur serially by using a static lock that we control, and thus also prevents introducing the deadlock. Author: Aaron Davidson <aaron@databricks.com> Closes #1409 from aarondav/1054 and squashes the following commits: 7d1b769 [Aaron Davidson] SPARK-1097: Do not introduce deadlock while fixing concurrency bug (cherry picked from commit 8867cd0bc2961fefed84901b8b14e9676ae6ab18) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-2518][SQL] Fix foldability of Substring expression.	Takuya UESHIN	2014-07-16	2	-3/+14
\| \| \| \| \| \| \| \| \| \| \| \| \|	This is a follow-up of #1428. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1432 from ueshin/issues/SPARK-2518 and squashes the following commits: 37d1ace [Takuya UESHIN] Fix foldability of Substring expression. (cherry picked from commit cc965eea510397642830acb21f61127b68c098d6) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2525][SQL] Remove as many compilation warning messages as possible in ↵	Yin Huai	2014-07-16	3	-19/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2525. Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1444 from yhuai/SPARK-2517 and squashes the following commits: edbac3f [Yin Huai] Removed some compiler type erasure warnings. (cherry picked from commit df95d82da7c76c074fd4064f7c870d55d99e0d8e) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2504][SQL] Fix nullability of Substring expression.	Takuya UESHIN	2014-07-15	2	-16/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a follow-up of #1359 with nullability narrowing. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1426 from ueshin/issues/SPARK-2504 and squashes the following commits: 5157832 [Takuya UESHIN] Remove unnecessary white spaces. 80958ac [Takuya UESHIN] Fix nullability of Substring expression. (cherry picked from commit 632fb3d9a9ebb3d2218385403145d5b89c41c025) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2509][SQL] Add optimization for Substring.	Takuya UESHIN	2014-07-15	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	`Substring` including `null` literal cases could be added to `NullPropagation`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1428 from ueshin/issues/SPARK-2509 and squashes the following commits: d9eb85f [Takuya UESHIN] Add Substring cases to NullPropagation. (cherry picked from commit 9b38b7c71352bb5e6d359515111ad9ca33299127) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2314][SQL] Override collect and take in JavaSchemaRDD, forwarding to ↵	Aaron Staple	2014-07-15	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \|	SchemaRDD implementations. Author: Aaron Staple <aaron.staple@gmail.com> Closes #1421 from staple/SPARK-2314 and squashes the following commits: 73e04dc [Aaron Staple] [SPARK-2314] Override collect and take in JavaSchemaRDD, forwarding to SchemaRDD implementations. (cherry picked from commit 90ca532a0fd95dc85cff8c5722d371e8368b2687) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2498] [SQL] Synchronize on a lock when using scala reflection inside ↵	Zongheng Yang	2014-07-15	1	-15/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	data type objects. JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2498 Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1423 from concretevitamin/scala-ref-catalyst and squashes the following commits: 325a149 [Zongheng Yang] Synchronize on a lock when initializing data type objects in Catalyst. (cherry picked from commit c2048a5165b270f5baf2003fdfef7bc6c5875715) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SQL] Attribute equality comparisons should be done by exprId.	Michael Armbrust	2014-07-15	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1414 from marmbrus/exprIdResolution and squashes the following commits: 97b47bc [Michael Armbrust] Attribute equality comparisons should be done by exprId. (cherry picked from commit 502f90782ad474e2630ed5be4d3c4be7dab09c34) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	SPARK-2407: Added internal implementation of SQL SUBSTR()	William Benton	2014-07-15	3	-3/+128
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This replaces the Hive UDF for SUBSTR(ING) with an implementation in Catalyst and adds tests to verify correct operation. Author: William Benton <willb@redhat.com> Closes #1359 from willb/internalSqlSubstring and squashes the following commits: ccedc47 [William Benton] Fixed too-long line. a30a037 [William Benton] replace view bounds with implicit parameters ec35c80 [William Benton] Adds fixes from review: 4f3bfdb [William Benton] Added internal implementation of SQL SUBSTR() (cherry picked from commit 61de65bc69f9a5fc396b76713193c6415436d452) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SQL] Whitelist more Hive tests.	Michael Armbrust	2014-07-15	105	-0/+163
\| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1396 from marmbrus/moreTests and squashes the following commits: 6660b60 [Michael Armbrust] Blacklist a test that requires DFS command. 8b6001c [Michael Armbrust] Add golden files. ccd8f97 [Michael Armbrust] Whitelist more tests. (cherry picked from commit bcd0c30c7eea4c50301cb732c733fdf4d4142060) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2483][SQL] Fix parsing of repeated, nested data access.	Michael Armbrust	2014-07-15	2	-6/+9
\| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1411 from marmbrus/nestedRepeated and squashes the following commits: 044fa09 [Michael Armbrust] Fix parsing of repeated, nested data access. (cherry picked from commit 0f98ef1a2c9ecf328f6c5918808fa5ca486e8afd) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2485][SQL] Lock usage of hive client.	Michael Armbrust	2014-07-15	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1412 from marmbrus/lockHiveClient and squashes the following commits: 4bc9d5a [Michael Armbrust] protected[hive] 22e9177 [Michael Armbrust] Add comments. 7aa8554 [Michael Armbrust] Don't lock on hive's object. a6edc5f [Michael Armbrust] Lock usage of hive client. (cherry picked from commit c7c7ac83392b10abb011e6aead1bf92e7c73695e) Signed-off-by: Aaron Davidson <aaron@databricks.com>
*	Add/increase severity of warning in documentation of groupBy()	Aaron Davidson	2014-07-14	2	-9/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	groupBy()/groupByKey() is notorious for being a very convenient API that can lead to poor performance when used incorrectly. This PR just makes it clear that users should be cautious not to rely on this API when they really want a different (more performant) one, such as reduceByKey(). (Note that one source of confusion is the name; this groupBy() is not the same as a SQL GROUP-BY, which is used for aggregation and is more similar in nature to Spark's reduceByKey().) Author: Aaron Davidson <aaron@databricks.com> Closes #1380 from aarondav/warning and squashes the following commits: f60da39 [Aaron Davidson] Give better advice d0afb68 [Aaron Davidson] Add/increase severity of warning in documentation of groupBy() (cherry picked from commit a2aa7bebae31e1e7ec23d31aaa436283743b283b) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-2443][SQL] Fix slow read from partitioned tables	Zongheng Yang	2014-07-14	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fix obtains a comparable performance boost as [PR #1390](https://github.com/apache/spark/pull/1390) by moving an array update and deserializer initialization out of a potentially very long loop. Suggested by yhuai. The below results are updated for this fix. ## Benchmarks Generated a local text file with 10M rows of simple key-value pairs. The data is loaded as a table through Hive. Results are obtained on my local machine using hive/console. Without the fix: Type \| Non-partitioned \| Partitioned (1 part) ------------ \| ------------ \| ------------- First run \| 9.52s end-to-end (1.64s Spark job) \| 36.6s (28.3s) Stablized runs \| 1.21s (1.18s) \| 27.6s (27.5s) With this fix: Type \| Non-partitioned \| Partitioned (1 part) ------------ \| ------------ \| ------------- First run \| 9.57s (1.46s) \| 11.0s (1.69s) Stablized runs \| 1.13s (1.10s) \| 1.23s (1.19s) Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1408 from concretevitamin/slow-read-2 and squashes the following commits: d86e437 [Zongheng Yang] Move update & initialization out of potentially long loop. (cherry picked from commit d60b09bb60cff106fa0acddebf35714503b20f03) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[maven-release-plugin] prepare for next development iteration	Ubuntu	2014-07-14	21	-22/+22
\|
*	[maven-release-plugin] prepare release v1.0.1-rc3	Ubuntu	2014-07-14	21	-22/+22
\|
*	[SPARK-2405][SQL] Reusue same byte buffers when creating new instance of ↵	Michael Armbrust	2014-07-12	2	-12/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	InMemoryRelation Reuse byte buffers when creating unique attributes for multiple instances of an InMemoryRelation in a single query plan. Author: Michael Armbrust <michael@databricks.com> Closes #1332 from marmbrus/doubleCache and squashes the following commits: 4a19609 [Michael Armbrust] Clean up concurrency story by calculating buffersn the constructor. b39c931 [Michael Armbrust] Allocations are kind of a side effect. f67eff7 [Michael Armbrust] Reusue same byte buffers when creating new instance of InMemoryRelation (cherry picked from commit 1a7d7cc85fb24de21f1cde67d04467171b82e845) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2441][SQL] Add more efficient distinct operator.	Michael Armbrust	2014-07-12	2	-3/+34
\| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1366 from marmbrus/partialDistinct and squashes the following commits: 12a31ab [Michael Armbrust] Add more efficient distinct operator. (cherry picked from commit 7e26b57615f6c1d3f9058f9c19c05ec91f017f4c) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2455] Mark (Shippable)VertexPartition serializable	Ankur Dave	2014-07-12	5	-14/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	VertexPartition and ShippableVertexPartition are contained in RDDs but are not marked Serializable, leading to NotSerializableExceptions when using Java serialization. The fix is simply to mark them as Serializable. This PR does that and adds a test for serializing them using Java and Kryo serialization. Author: Ankur Dave <ankurdave@gmail.com> Closes #1376 from ankurdave/SPARK-2455 and squashes the following commits: ed4a51b [Ankur Dave] Make (Shippable)VertexPartition serializable 1fd42c5 [Ankur Dave] Add failing tests for Java serialization (cherry picked from commit 7a0135293192aaefc6ae20b57e15a90945bd8a4e) Signed-off-by: Reynold Xin <rxin@apache.org>
*	Updating versions for branch-1.0	Patrick Wendell	2014-07-12	9	-10/+10
\|