spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SQL] Whitelist more Hive tests.	Michael Armbrust	2014-07-15	105	-0/+163
\| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1396 from marmbrus/moreTests and squashes the following commits: 6660b60 [Michael Armbrust] Blacklist a test that requires DFS command. 8b6001c [Michael Armbrust] Add golden files. ccd8f97 [Michael Armbrust] Whitelist more tests. (cherry picked from commit bcd0c30c7eea4c50301cb732c733fdf4d4142060) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2483][SQL] Fix parsing of repeated, nested data access.	Michael Armbrust	2014-07-15	2	-6/+9
\| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1411 from marmbrus/nestedRepeated and squashes the following commits: 044fa09 [Michael Armbrust] Fix parsing of repeated, nested data access. (cherry picked from commit 0f98ef1a2c9ecf328f6c5918808fa5ca486e8afd) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2485][SQL] Lock usage of hive client.	Michael Armbrust	2014-07-15	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1412 from marmbrus/lockHiveClient and squashes the following commits: 4bc9d5a [Michael Armbrust] protected[hive] 22e9177 [Michael Armbrust] Add comments. 7aa8554 [Michael Armbrust] Don't lock on hive's object. a6edc5f [Michael Armbrust] Lock usage of hive client. (cherry picked from commit c7c7ac83392b10abb011e6aead1bf92e7c73695e) Signed-off-by: Aaron Davidson <aaron@databricks.com>
*	Add/increase severity of warning in documentation of groupBy()	Aaron Davidson	2014-07-14	2	-9/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	groupBy()/groupByKey() is notorious for being a very convenient API that can lead to poor performance when used incorrectly. This PR just makes it clear that users should be cautious not to rely on this API when they really want a different (more performant) one, such as reduceByKey(). (Note that one source of confusion is the name; this groupBy() is not the same as a SQL GROUP-BY, which is used for aggregation and is more similar in nature to Spark's reduceByKey().) Author: Aaron Davidson <aaron@databricks.com> Closes #1380 from aarondav/warning and squashes the following commits: f60da39 [Aaron Davidson] Give better advice d0afb68 [Aaron Davidson] Add/increase severity of warning in documentation of groupBy() (cherry picked from commit a2aa7bebae31e1e7ec23d31aaa436283743b283b) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-2443][SQL] Fix slow read from partitioned tables	Zongheng Yang	2014-07-14	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fix obtains a comparable performance boost as [PR #1390](https://github.com/apache/spark/pull/1390) by moving an array update and deserializer initialization out of a potentially very long loop. Suggested by yhuai. The below results are updated for this fix. ## Benchmarks Generated a local text file with 10M rows of simple key-value pairs. The data is loaded as a table through Hive. Results are obtained on my local machine using hive/console. Without the fix: Type \| Non-partitioned \| Partitioned (1 part) ------------ \| ------------ \| ------------- First run \| 9.52s end-to-end (1.64s Spark job) \| 36.6s (28.3s) Stablized runs \| 1.21s (1.18s) \| 27.6s (27.5s) With this fix: Type \| Non-partitioned \| Partitioned (1 part) ------------ \| ------------ \| ------------- First run \| 9.57s (1.46s) \| 11.0s (1.69s) Stablized runs \| 1.13s (1.10s) \| 1.23s (1.19s) Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1408 from concretevitamin/slow-read-2 and squashes the following commits: d86e437 [Zongheng Yang] Move update & initialization out of potentially long loop. (cherry picked from commit d60b09bb60cff106fa0acddebf35714503b20f03) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[maven-release-plugin] prepare for next development iteration	Ubuntu	2014-07-14	21	-22/+22
\|
*	[maven-release-plugin] prepare release v1.0.1-rc3	Ubuntu	2014-07-14	21	-22/+22
\|
*	[SPARK-2405][SQL] Reusue same byte buffers when creating new instance of ↵	Michael Armbrust	2014-07-12	2	-12/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	InMemoryRelation Reuse byte buffers when creating unique attributes for multiple instances of an InMemoryRelation in a single query plan. Author: Michael Armbrust <michael@databricks.com> Closes #1332 from marmbrus/doubleCache and squashes the following commits: 4a19609 [Michael Armbrust] Clean up concurrency story by calculating buffersn the constructor. b39c931 [Michael Armbrust] Allocations are kind of a side effect. f67eff7 [Michael Armbrust] Reusue same byte buffers when creating new instance of InMemoryRelation (cherry picked from commit 1a7d7cc85fb24de21f1cde67d04467171b82e845) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2441][SQL] Add more efficient distinct operator.	Michael Armbrust	2014-07-12	2	-3/+34
\| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1366 from marmbrus/partialDistinct and squashes the following commits: 12a31ab [Michael Armbrust] Add more efficient distinct operator. (cherry picked from commit 7e26b57615f6c1d3f9058f9c19c05ec91f017f4c) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2455] Mark (Shippable)VertexPartition serializable	Ankur Dave	2014-07-12	5	-14/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	VertexPartition and ShippableVertexPartition are contained in RDDs but are not marked Serializable, leading to NotSerializableExceptions when using Java serialization. The fix is simply to mark them as Serializable. This PR does that and adds a test for serializing them using Java and Kryo serialization. Author: Ankur Dave <ankurdave@gmail.com> Closes #1376 from ankurdave/SPARK-2455 and squashes the following commits: ed4a51b [Ankur Dave] Make (Shippable)VertexPartition serializable 1fd42c5 [Ankur Dave] Add failing tests for Java serialization (cherry picked from commit 7a0135293192aaefc6ae20b57e15a90945bd8a4e) Signed-off-by: Reynold Xin <rxin@apache.org>
*	Updating versions for branch-1.0	Patrick Wendell	2014-07-12	9	-10/+10
\|
*	HOTFIX: Updating Python doc version	Patrick Wendell	2014-07-12	1	-1/+1
\|
*	[SPARK-2415] [SQL] RowWriteSupport should handle empty ArrayType correctly.	Takuya UESHIN	2014-07-10	3	-16/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	`RowWriteSupport` doesn't write empty `ArrayType` value, so the read value becomes `null`. It should write empty `ArrayType` value as it is. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1339 from ueshin/issues/SPARK-2415 and squashes the following commits: 32afc87 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2415 2f05196 [Takuya UESHIN] Fix RowWriteSupport to handle empty ArrayType correctly. (cherry picked from commit f5abd271292f5c98eb8b1974c1df31d08ed388dd) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2431][SQL] Refine StringComparison and related codes.	Takuya UESHIN	2014-07-10	2	-15/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Refine `StringComparison` and related codes as follows: - `StringComparison` could be similar to `StringRegexExpression` or `CaseConversionExpression`. - Nullability of `StringRegexExpression` could depend on children's nullabilities. - Add a case that the like condition includes no wildcard to `LikeSimplification`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1357 from ueshin/issues/SPARK-2431 and squashes the following commits: 77766f5 [Takuya UESHIN] Add a case that the like condition includes no wildcard to LikeSimplification. b9da9d2 [Takuya UESHIN] Fix nullability of StringRegexExpression. 680bb72 [Takuya UESHIN] Refine StringComparison. (cherry picked from commit f62c42728990266d5d5099abe241f699189ba025) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	SPARK-2427: Fix Scala examples that use the wrong command line arguments index	Artjom-Metro	2014-07-10	3	-6/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Scala examples HBaseTest and HdfsTest don't use the correct indexes for the command line arguments. This due to to the fix of JIRA 1565, where these examples were not correctly adapted to the new usage of the submit script. Author: Artjom-Metro <Artjom-Metro@users.noreply.github.com> Author: Artjom-Metro <artjom31415@googlemail.com> Closes #1353 from Artjom-Metro/fix_examples and squashes the following commits: 6111801 [Artjom-Metro] Reduce the default number of iterations cfaa73c [Artjom-Metro] Fix some examples that use the wrong index to access the command line arguments (cherry picked from commit ae8ca4dfbacd5a5197fb41722607ad99c190f768) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-1341] [Streaming] Throttle BlockGenerator to limit rate of data ↵	Issac Buenrostro	2014-07-10	4	-1/+118
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	consumption. Author: Issac Buenrostro <buenrostro@ooyala.com> Closes #945 from ibuenros/SPARK-1341-throttle and squashes the following commits: 5514916 [Issac Buenrostro] Formatting changes, added documentation for streaming throttling, stricter unit tests for throttling. 62f395f [Issac Buenrostro] Add comments and license to streaming RateLimiter.scala 7066438 [Issac Buenrostro] Moved throttle code to RateLimiter class, smoother pushing when throttling active ccafe09 [Issac Buenrostro] Throttle BlockGenerator to limit rate of data consumption. (cherry picked from commit 2dd67248503306bb08946b1796821e9f9ed4d00e) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-2417][MLlib] Fix DecisionTree tests	johnnywalleye	2014-07-09	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixes test failures introduced by https://github.com/apache/spark/pull/1316. For both the regression and classification cases, val stats is the InformationGainStats for the best tree split. stats.predict is the predicted value for the data, before the split is made. Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that. Author: johnnywalleye <jsondag@gmail.com> Closes #1343 from johnnywalleye/decision-tree-tests and squashes the following commits: ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests (cherry picked from commit d35e3db2325931492b64890125a70579bc3b587b) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[STREAMING] SPARK-2343: Fix QueueInputDStream with oneAtATime false	Manuel Laflamme	2014-07-09	2	-2/+92
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix QueueInputDStream which was not removing dequeued items when used with the oneAtATime flag disabled. Author: Manuel Laflamme <manuel.laflamme@gmail.com> Closes #1285 from mlaflamm/spark-2343 and squashes the following commits: 61c9e38 [Manuel Laflamme] Unit tests for queue input stream c51d029 [Manuel Laflamme] Fix QueueInputDStream with oneAtATime false (cherry picked from commit 0eb11527d13083ced215e3fda44ed849198a57cb) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-2152][MLlib] fix bin offset in DecisionTree node aggregations (also ↵	johnnywalleye	2014-07-08	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	resolves SPARK-2160) Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala. In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows: rightNodeAgg(featureIndex)(2 * (numBins - 2)) = binData(shift + (2 * numBins - 1))) Then there is a loop that sets the rest of the values, as in line 809: rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) = binData(shift + (2 (numBins - 2 - splitIndex))) + rightNodeAgg(featureIndex)(2 (numBins - 1 - splitIndex)) But since splitIndex starts at 1, this ends up skipping a set of binData values. The changes here address this issue, for both the Regression and Classification cases. Author: johnnywalleye <jsondag@gmail.com> Closes #1316 from johnnywalleye/master and squashes the following commits: 73809da [johnnywalleye] fix bin offset in DecisionTree node aggregations (cherry picked from commit 1114207cc8e4ef94cb97bbd5a2ef3ae4d51f73fa) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-2362] Fix for newFilesOnly logic in file DStream	Gabriele Nizzoli	2014-07-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	The newFilesOnly logic should be inverted: the logic should be that if the flag newFilesOnly==true then only start reading files older than current time. As the code is now if newFilesOnly==true then it will start to read files that are older than 0L (that is: every file in the directory). Author: Gabriele Nizzoli <mail@nizzoli.net> Closes #1077 from gabrielenizzoli/master and squashes the following commits: 4f1d261 [Gabriele Nizzoli] Fix for newFilesOnly logic in file DStream (cherry picked from commit e6f7bfcfbf6aff7a9f8cd8e0a2166d0bf62b0912) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-2409] Make SQLConf thread safe.	Reynold Xin	2014-07-08	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@apache.org> Closes #1334 from rxin/sqlConfThreadSafetuy and squashes the following commits: c1e0a5a [Reynold Xin] Fixed the duplicate comment. 7614372 [Reynold Xin] [SPARK-2409] Make SQLConf thread safe. (cherry picked from commit 32516f866a32d51bfaa04685ae77ba216b4202d9) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2403] Catch all errors during serialization in DAGScheduler	Daniel Darabos	2014-07-08	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-2403 Spark hangs for us whenever we forget to register a class with Kryo. This should be a simple fix for that. But let me know if you have a better suggestion. I did not write a new test for this. It would be pretty complicated and I'm not sure it's worthwhile for such a simple change. Let me know if you disagree. Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #1329 from darabos/spark-2403 and squashes the following commits: 3aceaad [Daniel Darabos] Print full stack trace for miscellaneous exceptions during serialization. 52c22ba [Daniel Darabos] Only catch NonFatal exceptions. 361e962 [Daniel Darabos] Catch all errors during serialization in DAGScheduler. (cherry picked from commit c8a2313cdf825e0191680a423d17619b5504ff89) Signed-off-by: Aaron Davidson <aaron@databricks.com>
*	[SPARK-2395][SQL] Optimize common LIKE patterns.	Michael Armbrust	2014-07-08	2	-0/+74
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1325 from marmbrus/slowLike and squashes the following commits: 023c3eb [Michael Armbrust] add comment. 8b421c2 [Michael Armbrust] Handle the case where the final % is actually escaped. d34d37e [Michael Armbrust] add periods. 3bbf35f [Michael Armbrust] Roll back changes to SparkBuild 53894b1 [Michael Armbrust] Fix grammar. 4094462 [Michael Armbrust] Fix grammar. 6d3d0a0 [Michael Armbrust] Optimize common LIKE patterns. (cherry picked from commit cc3e0a14daf756ff5c2d4e7916438e175046e5bb) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[EC2] Add default history server port to ec2 script	Andrew Or	2014-07-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Right now I have to open it manually Author: Andrew Or <andrewor14@gmail.com> Closes #1296 from andrewor14/hist-serv-port and squashes the following commits: 8895a1f [Andrew Or] Add default history server port to ec2 script (cherry picked from commit 56e009d4f05d990c60e109838fa70457f97f44aa) Conflicts: ec2/spark_ec2.py
*	[SPARK-2391][SQL] Custom take() for LIMIT queries.	Michael Armbrust	2014-07-08	1	-4/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using Spark's take can result in an entire in-memory partition to be shipped in order to retrieve a single row. Author: Michael Armbrust <michael@databricks.com> Closes #1318 from marmbrus/takeLimit and squashes the following commits: 77289a5 [Michael Armbrust] Update scala doc 32f0674 [Michael Armbrust] Custom take implementation for LIMIT queries. (cherry picked from commit 5a4063645dd7bb4cd8bda890785235729804ab09) Signed-off-by: Reynold Xin <rxin@apache.org>
*	Resolve sbt warnings during build Ⅱ	witgo	2014-07-08	10	-94/+94
\| \| \| \| \| \| \| \| \| \| \| \|	Author: witgo <witgo@qq.com> Closes #1153 from witgo/expectResult and squashes the following commits: 97541d8 [witgo] merge master ead26e7 [witgo] Resolve sbt warnings during build (cherry picked from commit 3cd5029be709307415f911236472a685e406e763) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2376][SQL] Selecting list values inside nested JSON objects raises ↵	Yin Huai	2014-07-07	2	-25/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	java.lang.IllegalArgumentException JIRA: https://issues.apache.org/jira/browse/SPARK-2376 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1320 from yhuai/SPARK-2376 and squashes the following commits: 0107417 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2376 480803d [Yin Huai] Correctly handling JSON arrays in PySpark. (cherry picked from commit 4352a2fdaa64efee7158eabef65703460ff284ec) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2375][SQL] JSON schema inference may not resolve type conflicts ↵	Yin Huai	2014-07-07	3	-8/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	correctly for a field inside an array of structs For example, for ``` {"array": [{"field":214748364700}, {"field":1}]} ``` the type of field is resolved as IntType. While, for ``` {"array": [{"field":1}, {"field":214748364700}]} ``` the type of field is resolved as LongType. JIRA: https://issues.apache.org/jira/browse/SPARK-2375 Author: Yin Huai <huaiyin.thu@gmail.com> Closes #1308 from yhuai/SPARK-2375 and squashes the following commits: 3e2e312 [Yin Huai] Update unit test. 1b2ff9f [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2375 10794eb [Yin Huai] Correctly resolve the type of a field inside an array of structs. (cherry picked from commit f0496ee10847db921a028a34f70385f9b740b3f3) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2386] [SQL] RowWriteSupport should use the exact types to cast.	Takuya UESHIN	2014-07-07	2	-3/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When execute `saveAsParquetFile` with non-primitive type, `RowWriteSupport` uses wrong type `Int` for `ByteType` and `ShortType`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1315 from ueshin/issues/SPARK-2386 and squashes the following commits: 20d89ec [Takuya UESHIN] Use None instead of null. bd88741 [Takuya UESHIN] Add a test. 323d1d2 [Takuya UESHIN] Modify RowWriteSupport to use the exact types to cast. (cherry picked from commit 4deeed17c4847f212a4fa1a8685cfe8a12179263) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2339][SQL] SQL parser in sql-core is case sensitive, but a table ↵	Yin Huai	2014-07-07	6	-30/+149
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	alias is converted to lower case when we create Subquery Reported by http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html After we get the table from the catalog, because the table has an alias, we will temporarily insert a Subquery. Then, we convert the table alias to lower case no matter if the parser is case sensitive or not. To see the issue ... ``` val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Person(name: String, age: Int) val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people") sqlContext.sql("select PEOPLE.name from people PEOPLE") ``` The plan is ... ``` == Query Plan == Project ['PEOPLE.name] ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:176 ``` You can find that `PEOPLE.name` is not resolved. This PR introduces three changes. 1. If a table has an alias, the catalog will not lowercase the alias. If a lowercase alias is needed, the analyzer will do the work. 2. A catalog has a new val caseSensitive that indicates if this catalog is case sensitive or not. For example, a SimpleCatalog is case sensitive, but 3. Corresponding unit tests. With this PR, case sensitivity of database names and table names is handled by the catalog. Case sensitivity of other identifiers are handled by the analyzer. JIRA: https://issues.apache.org/jira/browse/SPARK-2339 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1317 from yhuai/SPARK-2339 and squashes the following commits: 12d8006 [Yin Huai] Handling case sensitivity correctly. This patch introduces three changes. 1. If a table has an alias, the catalog will not lowercase the alias. If a lowercase alias is needed, the analyzer will do the work. 2. A catalog has a new val caseSensitive that indicates if this catalog is case sensitive or not. For example, a SimpleCatalog is case sensitive, but 3. Corresponding unit tests. With this patch, case sensitivity of database names and table names is handled by the catalog. Case sensitivity of other identifiers is handled by the analyzer. (cherry picked from commit c0b4cf097de50eb2c4b0f0e67da53ee92efc1f77) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-1977][MLLIB] register mutable BitSet in MovieLenseALS	Neville Li	2014-07-07	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \|	Author: Neville Li <neville@spotify.com> Closes #1319 from nevillelyh/gh/SPARK-1977 and squashes the following commits: 1f0a355 [Neville Li] [SPARK-1977][MLLIB] register mutable BitSet in MovieLenseALS (cherry picked from commit f7ce1b3b48f0354434456241188c6a5d954852e2) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-2327] [SQL] Fix nullabilities of Join/Generate/Aggregate.	Takuya UESHIN	2014-07-05	7	-21/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix nullabilities of `Join`/`Generate`/`Aggregate` because: - Output attributes of opposite side of `OuterJoin` should be nullable. - Output attributes of generater side of `Generate` should be nullable if `join` is `true` and `outer` is `true`. - `AttributeReference` of `computedAggregates` of `Aggregate` should be the same as `aggregateExpression`'s. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1266 from ueshin/issues/SPARK-2327 and squashes the following commits: 3ace83a [Takuya UESHIN] Add withNullability to Attribute and use it to change nullabilities. df1ae53 [Takuya UESHIN] Modify nullabilize to leave attribute if not resolved. 799ce56 [Takuya UESHIN] Add nullabilization to Generate of SparkPlan. a0fc9bc [Takuya UESHIN] Fix scalastyle errors. 0e31e37 [Takuya UESHIN] Fix Aggregate resultAttribute nullabilities. 09532ec [Takuya UESHIN] Fix Generate output nullabilities. f20f196 [Takuya UESHIN] Fix Join output nullabilities. (cherry picked from commit 9d5ecf8205b924dc8a3c13fed68beb78cc5c7553) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2366] [SQL] Add column pruning for the right side of LeftSemi join.	Takuya UESHIN	2014-07-05	1	-8/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The right side of `LeftSemi` join needs columns only used in join condition. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1301 from ueshin/issues/SPARK-2366 and squashes the following commits: 7677a39 [Takuya UESHIN] Update comments. 786d3a0 [Takuya UESHIN] Rename method name. e0957b1 [Takuya UESHIN] Add column pruning for the right side of LeftSemi join. (cherry picked from commit 3da8df939ec63064692ba64d9188aeea908b305c) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2370][SQL] Decrease metadata retrieved for partitioned hive queries.	Michael Armbrust	2014-07-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1305 from marmbrus/usePrunerPartitions and squashes the following commits: 744aa20 [Michael Armbrust] Use getAllPartitionsForPruner instead of getPartitions, which avoids retrieving auth data (cherry picked from commit 9d006c97371ddf357e0b821d5c6d1535d9b6fe41) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[maven-release-plugin] prepare for next development iteration	Ubuntu	2014-07-04	21	-22/+22
\|
*	[maven-release-plugin] prepare release v1.0.1-rc2v1.0.1	Ubuntu	2014-07-04	21	-22/+22
\|
*	Updating CHANGES.txt file	Patrick Wendell	2014-07-04	1	-0/+125
\|
*	HOTFIX: Merge issue with cf1d46e4.	Patrick Wendell	2014-07-04	1	-2/+2
\| \| \| \|	The tests in that patch used a newer constructor for TaskInfo.
*	[SPARK-2059][SQL] Add analysis checks	Reynold Xin	2014-07-04	2	-0/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This replaces #1263 with a test case. Author: Reynold Xin <rxin@apache.org> Author: Michael Armbrust <michael@databricks.com> Closes #1265 from rxin/sql-analysis-error and squashes the following commits: a639e01 [Reynold Xin] Added a test case for unresolved attribute analysis. 7371e1b [Reynold Xin] Merge pull request #1263 from marmbrus/analysisChecks 448c088 [Michael Armbrust] Add analysis checks (cherry picked from commit b3e768e154bd7175db44c3ffc3d8f783f15ab776) Signed-off-by: Reynold Xin <rxin@apache.org>
*	Update SQLConf.scala	baishuo(白硕)	2014-07-04	1	-6/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	use concurrent.ConcurrentHashMap instead of util.Collections.synchronizedMap Author: baishuo(白硕) <vc_java@hotmail.com> Closes #1272 from baishuo/master and squashes the following commits: 51ec55d [baishuo(白硕)] Update SQLConf.scala 63da043 [baishuo(白硕)] Update SQLConf.scala 36b6dbd [baishuo(白硕)] Update SQLConf.scala 864faa0 [baishuo(白硕)] Update SQLConf.scala 593096b [baishuo(白硕)] Update SQLConf.scala 7304d9b [baishuo(白硕)] Update SQLConf.scala 843581c [baishuo(白硕)] Update SQLConf.scala 1d3e4a2 [baishuo(白硕)] Update SQLConf.scala 0740f28 [baishuo(白硕)] Update SQLConf.scala (cherry picked from commit 0bbe61223eda3f33bbf8992d2a8f0d47813f4873) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-1199][REPL] Remove VALId and use the original import style for ↵	Prashant Sharma	2014-07-04	3	-11/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	defined classes. This is an alternate solution to #1176. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #1179 from ScrapCodes/SPARK-1199/repl-fix-second-approach and squashes the following commits: 820b34b [Prashant Sharma] Here we generate two kinds of import wrappers based on whether it is a class or not. (cherry picked from commit d43415075b3468fe8aa56de5d2907d409bb96347) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-2059][SQL] Don't throw TreeNodeException in `execution.ExplainCommand`	Cheng Lian	2014-07-03	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a fix for the problem revealed by PR #1265. Currently `HiveComparisonSuite` ignores output of `ExplainCommand` since Catalyst query plan is quite different from Hive query plan. But exceptions throw from `CheckResolution` still breaks test cases. This PR catches any `TreeNodeException` and reports it as part of the query explanation. After merging this PR, PR #1265 can also be merged safely. For a normal query: ``` scala> hql("explain select key from src").foreach(println) ... [Physical execution plan:] [HiveTableScan [key#9], (MetastoreRelation default, src, None), None] ``` For a wrong query with unresolved attribute(s): ``` scala> hql("explain select kay from src").foreach(println) ... [Error occurred during query planning: ] [Unresolved attributes: 'kay, tree:] [Project ['kay]] [ LowerCaseSchema ] [ MetastoreRelation default, src, None] ``` Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1294 from liancheng/safe-explain and squashes the following commits: 4318911 [Cheng Lian] Don't throw TreeNodeException in `execution.ExplainCommand` (cherry picked from commit 544880457de556d1ad52e8cb7e1eca19da95f517) Signed-off-by: Reynold Xin <rxin@apache.org>
*	SPARK-2282: Reuse PySpark Accumulator sockets to avoid crashing Spark	Aaron Davidson	2014-07-03	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-2282 This issue is caused by a buildup of sockets in the TIME_WAIT stage of TCP, which is a stage that lasts for some period of time after the communication closes. This solution simply allows us to reuse sockets that are in TIME_WAIT, to avoid issues with the buildup of the rapid creation of these sockets. Author: Aaron Davidson <aaron@databricks.com> Closes #1220 from aarondav/SPARK-2282 and squashes the following commits: 2e5cab3 [Aaron Davidson] SPARK-2282: Reuse PySpark Accumulator sockets to avoid crashing Spark (cherry picked from commit 97a0bfe1c0261384f09d53f9350de52fb6446d59) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-2307][Reprise] Correctly report RDD blocks on SparkUI	Andrew Or	2014-07-03	6	-23/+184
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem. The existing code in `ExecutorPage.scala` requires a linear scan through all the blocks to filter out the uncached ones. Every refresh could be expensive if there are many blocks and many executors. Solution. The proper semantics should be the following: `StorageStatusListener` should contain only block statuses that are cached. This means as soon as a block is unpersisted by any mean, its status should be removed. This is reflected in the changes made in `StorageStatusListener.scala`. Further, the `StorageTab` must stop relying on the `StorageStatusListener` changing a dropped block's status to `StorageLevel.NONE` (which no longer happens). This is reflected in the changes made in `StorageTab.scala` and `StorageUtils.scala`. ---------- If you have been following this chain of PRs like pwendell, you will quickly notice that this reverts the changes in #1249, which reverts the changes in #1080. In other words, we are adding back the changes from #1080, and fixing SPARK-2307 on top of those changes. Please ask questions if you are confused. Author: Andrew Or <andrewor14@gmail.com> Closes #1255 from andrewor14/storage-ui-fix-reprise and squashes the following commits: 45416fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into storage-ui-fix-reprise a82ea25 [Andrew Or] Add tests for StorageStatusListener 8773b01 [Andrew Or] Update comment / minor changes 3afde3f [Andrew Or] Correctly report the number of blocks on SparkUI (cherry picked from commit 3894a49be9b532cc026d908a0f49bca850504498) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-2350] Don't NPE while launching drivers	Aaron Davidson	2014-07-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Prior to this change, we could throw a NPE if we launch a driver while another one is waiting, because removing from an iterator while iterating over it is not safe. Author: Aaron Davidson <aaron@databricks.com> Closes #1289 from aarondav/master-fail and squashes the following commits: 1cf1cf4 [Aaron Davidson] SPARK-2350: Don't NPE while launching drivers (cherry picked from commit 586feb5c9528042420f678f78bacb6c254a5eaf8) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-1097] Workaround Hadoop conf ConcurrentModification issue	Raymond Liu	2014-07-03	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Workaround Hadoop conf ConcurrentModification issue Author: Raymond Liu <raymond.liu@intel.com> Closes #1273 from colorant/hadoopRDD and squashes the following commits: 994e98b [Raymond Liu] Address comments e2cda3d [Raymond Liu] Workaround Hadoop conf ConcurrentModification issue (cherry picked from commit 5fa0a05763ab1d527efe20e3b10539ac5ffc36de) Signed-off-by: Aaron Davidson <aaron@databricks.com>
*	Streaming programming guide typos	Clément MATHIEU	2014-07-03	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix a bad Java code sample and a broken link in the streaming programming guide. Author: Clément MATHIEU <clement@unportant.info> Closes #1286 from cykl/streaming-programming-guide-typos and squashes the following commits: b0908cb [Clément MATHIEU] Fix broken URL 9d3c535 [Clément MATHIEU] Spark streaming requires at least two working threads (scala version was OK) (cherry picked from commit fdc4c112e7c2ac585d108d03209a642aa8bab7c8) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-2109] Setting SPARK_MEM for bin/pyspark does not work.	Prashant Sharma	2014-07-03	4	-19/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Trivial fix. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #1050 from ScrapCodes/SPARK-2109/pyspark-script-bug and squashes the following commits: 77072b9 [Prashant Sharma] Changed echos to redirect to STDERR. 13f48a0 [Prashant Sharma] [SPARK-2109] Setting SPARK_MEM for bin/pyspark does not work. (cherry picked from commit 731f683b1bd8abbb83030b6bae14876658bbf098) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-2342] Evaluation helper's output type doesn't conform to input ty...	Yijie Shen	2014-07-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	The function cast doesn't conform to the intention of "Those expressions are supposed to be in the same data type, and also the return type." comment Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #1283 from yijieshen/master and squashes the following commits: c7aaa4b [Yijie Shen] [SPARK-2342] Evaluation helper's output type doesn't conform to input type (cherry picked from commit a9b52e5623f7fc77fca96b095f9eeaef76e35d54) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK] Fix NPE for ExternalAppendOnlyMap	Andrew Or	2014-07-03	2	-11/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	It did not handle null keys very gracefully before. Author: Andrew Or <andrewor14@gmail.com> Closes #1288 from andrewor14/fix-external and squashes the following commits: 312b8d8 [Andrew Or] Abstract key hash code ed5adf9 [Andrew Or] Fix NPE for ExternalAppendOnlyMap (cherry picked from commit c480537739f9329ebfd580f09c69778e6c976366) Signed-off-by: Aaron Davidson <aaron@databricks.com>