spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-2412] CoalescedRDD throws exception with certain pref locs	Aaron Davidson	2014-07-17	2	-2/+16
\| \| \| \| \| \| \| \| \| \| \| \| \|	If the first pass of CoalescedRDD does not find the target number of locations AND the second pass finds new locations, an exception is thrown, as "groupHash.get(nxt_replica).get" is not valid. The fix is just to add an ArrayBuffer to groupHash for that replica if it didn't already exist. Author: Aaron Davidson <aaron@databricks.com> Closes #1337 from aarondav/2412 and squashes the following commits: f587b5d [Aaron Davidson] getOrElseUpdate 3ad8a3c [Aaron Davidson] [SPARK-2412] CoalescedRDD throws exception with certain pref locs
*	[SPARK-2154] Schedule next Driver when one completes (standalone mode)	Aaron Davidson	2014-07-16	1	-0/+1
\| \| \| \| \| \| \| \|	Author: Aaron Davidson <aaron@databricks.com> Closes #1405 from aarondav/2154 and squashes the following commits: 24e9ef9 [Aaron Davidson] [SPARK-2154] Schedule next Driver when one completes (standalone mode)
*	SPARK-1097: Do not introduce deadlock while fixing concurrency bug	Aaron Davidson	2014-07-16	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \|	We recently added this lock on 'conf' in order to prevent concurrent creation. However, it turns out that this can introduce a deadlock because Hadoop also synchronizes on the Configuration objects when creating new Configurations (and they do so via a static REGISTRY which contains all created Configurations). This fix forces all Spark initialization of Configuration objects to occur serially by using a static lock that we control, and thus also prevents introducing the deadlock. Author: Aaron Davidson <aaron@databricks.com> Closes #1409 from aarondav/1054 and squashes the following commits: 7d1b769 [Aaron Davidson] SPARK-1097: Do not introduce deadlock while fixing concurrency bug
*	[SPARK-2317] Improve task logging.	Reynold Xin	2014-07-16	10	-76/+78
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We use TID to indicate task logging. However, TID itself does not capture stage or retries, making it harder to correlate with the application itself. This pull request changes all logging messages for tasks to include both the TID and the stage id, stage attempt, task id, and task attempt. I've consulted various people but unfortunately this is a really hard task. Driver log looks like: ``` 14/06/28 18:53:29 INFO DAGScheduler: Submitting 10 missing tasks from Stage 0 (MappedRDD[1] at map at <console>:13) 14/06/28 18:53:29 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks 14/06/28 18:53:29 INFO TaskSetManager: Re-computing pending task lists. 14/07/15 19:44:40 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost, PROCESS_LOCAL, 1855 bytes) 14/07/15 19:44:40 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1855 bytes) 14/07/15 19:44:40 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 1855 bytes) 14/07/15 19:44:40 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 1855 bytes) 14/07/15 19:44:40 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 4, localhost, PROCESS_LOCAL, 1855 bytes) 14/07/15 19:44:40 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 5, localhost, PROCESS_LOCAL, 1855 bytes) 14/07/15 19:44:40 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 6, localhost, PROCESS_LOCAL, 1855 bytes) ... 14/07/15 19:44:40 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1) in 64 ms on localhost (4/10) 14/07/15 19:44:40 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 4) in 63 ms on localhost (5/10) 14/07/15 19:44:40 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 2) in 63 ms on localhost (6/10) 14/07/15 19:44:40 INFO TaskSetManager: Finished task 7.0 in stage 1.0 (TID 7) in 62 ms on localhost (7/10) 14/07/15 19:44:40 INFO TaskSetManager: Finished task 6.0 in stage 1.0 (TID 6) in 63 ms on localhost (8/10) 14/07/15 19:44:40 INFO TaskSetManager: Finished task 9.0 in stage 1.0 (TID 9) in 8 ms on localhost (9/10) 14/07/15 19:44:40 INFO TaskSetManager: Finished task 8.0 in stage 1.0 (TID 8) in 9 ms on localhost (10/10) ``` Executor log looks like ``` 14/07/15 19:44:40 INFO Executor: Running task 0.0 in stage 1.0 (TID 0) 14/07/15 19:44:40 INFO Executor: Running task 3.0 in stage 1.0 (TID 3) 14/07/15 19:44:40 INFO Executor: Running task 1.0 in stage 1.0 (TID 1) 14/07/15 19:44:40 INFO Executor: Running task 4.0 in stage 1.0 (TID 4) 14/07/15 19:44:40 INFO Executor: Running task 2.0 in stage 1.0 (TID 2) 14/07/15 19:44:40 INFO Executor: Running task 5.0 in stage 1.0 (TID 5) 14/07/15 19:44:40 INFO Executor: Running task 6.0 in stage 1.0 (TID 6) 14/07/15 19:44:40 INFO Executor: Running task 7.0 in stage 1.0 (TID 7) 14/07/15 19:44:40 INFO Executor: Finished task 3.0 in stage 1.0 (TID 3). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 2.0 in stage 1.0 (TID 2). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 1.0 in stage 1.0 (TID 1). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 5.0 in stage 1.0 (TID 5). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 4.0 in stage 1.0 (TID 4). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 6.0 in stage 1.0 (TID 6). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 7.0 in stage 1.0 (TID 7). 847 bytes result sent to driver ``` Author: Reynold Xin <rxin@apache.org> Closes #1259 from rxin/betterTaskLogging and squashes the following commits: c28ada1 [Reynold Xin] Fix unit test failure. 987d043 [Reynold Xin] Updated log messages. c6cfd46 [Reynold Xin] Merge branch 'master' into betterTaskLogging b7b1bcc [Reynold Xin] Fixed a typo. f9aba3c [Reynold Xin] Made it compile. f8a5c06 [Reynold Xin] Merge branch 'master' into betterTaskLogging 07264e6 [Reynold Xin] Defensive check against unknown TaskEndReason. 76bbd18 [Reynold Xin] FailureSuite not serializable reporting. 4659b20 [Reynold Xin] Remove unused variable. 53888e3 [Reynold Xin] [SPARK-2317] Improve task logging.
*	fix compile error of streaming project	James Z.M. Gao	2014-07-16	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	explicit return type for implicit function Author: James Z.M. Gao <gaozhm@mediav.com> Closes #153 from gzm55/work/streaming-compile and squashes the following commits: 11e9c8d [James Z.M. Gao] fix style error fe88109 [James Z.M. Gao] fix compile error of streaming project
*	[SPARK-2522] set default broadcast factory to torrent	Xiangrui Meng	2014-07-16	2	-2/+2
\| \| \| \| \| \| \| \| \| \|	HttpBroadcastFactory is the current default broadcast factory. It sends the broadcast data to each worker one by one, which is slow when the cluster is big. TorrentBroadcastFactory scales much better than http. Maybe we should make torrent the default broadcast method. Author: Xiangrui Meng <meng@databricks.com> Closes #1437 from mengxr/bt-broadcast and squashes the following commits: ed492fe [Xiangrui Meng] set default broadcast factory to torrent
*	[SPARK-2517] Remove some compiler warnings.	Reynold Xin	2014-07-16	5	-24/+25
\| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@apache.org> Closes #1433 from rxin/compile-warning and squashes the following commits: 8d0b890 [Reynold Xin] Remove some compiler warnings.
*	[SPARK-2518][SQL] Fix foldability of Substring expression.	Takuya UESHIN	2014-07-16	2	-3/+14
\| \| \| \| \| \| \| \| \| \|	This is a follow-up of #1428. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1432 from ueshin/issues/SPARK-2518 and squashes the following commits: 37d1ace [Takuya UESHIN] Fix foldability of Substring expression.
*	SPARK-2519. Eliminate pattern-matching on Tuple2 in performance-critical...	Sandy Ryza	2014-07-16	2	-9/+11
\| \| \| \| \| \| \| \| \| \|	... aggregation code Author: Sandy Ryza <sandy@cloudera.com> Closes #1435 from sryza/sandy-spark-2519 and squashes the following commits: 640706a [Sandy Ryza] SPARK-2519. Eliminate pattern-matching on Tuple2 in performance-critical aggregation code
*	[SQL] Cleaned up ConstantFolding slightly.	Reynold Xin	2014-07-16	1	-17/+28
\| \| \| \| \| \| \| \| \| \| \|	Moved couple rules out of NullPropagation and added more comments. Author: Reynold Xin <rxin@apache.org> Closes #1430 from rxin/sql-folding-rule and squashes the following commits: 7f9a197 [Reynold Xin] Updated documentation for ConstantFolding. 7f8cf61 [Reynold Xin] [SQL] Cleaned up ConstantFolding slightly.
*	[SPARK-2525][SQL] Remove as many compilation warning messages as possible in ↵	Yin Huai	2014-07-16	3	-19/+19
\| \| \| \| \| \| \| \| \| \| \| \|	Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2525. Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1444 from yhuai/SPARK-2517 and squashes the following commits: edbac3f [Yin Huai] Removed some compiler type erasure warnings.
*	Tightening visibility for various Broadcast related classes.	Reynold Xin	2014-07-16	5	-35/+36
\| \| \| \| \| \| \| \| \| \|	In preparation for SPARK-2521. Author: Reynold Xin <rxin@apache.org> Closes #1438 from rxin/broadcast and squashes the following commits: 432f1cc [Reynold Xin] Tightening visibility for various Broadcast related classes.
*	SPARK-2277: make TaskScheduler track hosts on rack	Rui Li	2014-07-16	3	-5/+83
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Hi mateiz, I've created [SPARK-2277](https://issues.apache.org/jira/browse/SPARK-2277) to make TaskScheduler track hosts on each rack. Please help to review, thanks. Author: Rui Li <rui.li@intel.com> Closes #1212 from lirui-intel/trackHostOnRack and squashes the following commits: 2b4bd0f [Rui Li] SPARK-2277: refine UT fbde838 [Rui Li] SPARK-2277: add UT 7bbe658 [Rui Li] SPARK-2277: rename the method 5e4ef62 [Rui Li] SPARK-2277: remove unnecessary import 79ac750 [Rui Li] SPARK-2277: make TaskScheduler track hosts on rack
*	[SPARK-2119][SQL] Improved Parquet performance when reading off S3	Cheng Lian	2014-07-16	3	-50/+125
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA issue: [SPARK-2119](https://issues.apache.org/jira/browse/SPARK-2119) Essentially this PR fixed three issues to gain much better performance when reading large Parquet file off S3. 1. When reading the schema, fetching Parquet metadata from a part-file rather than the `_metadata` file The `_metadata` file contains metadata of all row groups, and can be very large if there are many row groups. Since schema information and row group metadata are coupled within a single Thrift object, we have to read the whole `_metadata` to fetch the schema. On the other hand, schema is replicated among footers of all part-files, which are fairly small. 1. Only add the root directory of the Parquet file rather than all the part-files to input paths HDFS API can automatically filter out all hidden files and underscore files (`_SUCCESS` & `_metadata`), there's no need to filter out all part-files and add them individually to input paths. What make it much worse is that, `FileInputFormat.listStatus()` calls `FileSystem.globStatus()` on each individual input path sequentially, each results a blocking remote S3 HTTP request. 1. Worked around [PARQUET-16](https://issues.apache.org/jira/browse/PARQUET-16) Essentially PARQUET-16 is similar to the above issue, and results lots of sequential `FileSystem.getFileStatus()` calls, which are further translated into a bunch of remote S3 HTTP requests. `FilteringParquetRowInputFormat` should be cleaned up once PARQUET-16 is fixed. Below is the micro benchmark result. The dataset used is a S3 Parquet file consists of 3,793 partitions, about 110MB per partition in average. The benchmark is done with a 9-node AWS cluster. - Creating a Parquet `SchemaRDD` (Parquet schema is fetched) ```scala val tweets = parquetFile(uri) ``` - Before: 17.80s - After: 8.61s - Fetching partition information ```scala tweets.getPartitions ``` - Before: 700.87s - After: 21.47s - Counting the whole file (both steps above are executed altogether) ```scala parquetFile(uri).count() ``` - Before: ??? (haven't test yet) - After: 53.26s Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1370 from liancheng/faster-parquet and squashes the following commits: 94a2821 [Cheng Lian] Added comments about schema consistency d2c4417 [Cheng Lian] Worked around PARQUET-16 to improve Parquet performance 1c0d1b9 [Cheng Lian] Accelerated Parquet schema retrieving 5bd3d29 [Cheng Lian] Fixed Parquet log level
*	[SPARK-2504][SQL] Fix nullability of Substring expression.	Takuya UESHIN	2014-07-15	2	-16/+22
\| \| \| \| \| \| \| \| \| \| \|	This is a follow-up of #1359 with nullability narrowing. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1426 from ueshin/issues/SPARK-2504 and squashes the following commits: 5157832 [Takuya UESHIN] Remove unnecessary white spaces. 80958ac [Takuya UESHIN] Fix nullability of Substring expression.
*	[SPARK-2509][SQL] Add optimization for Substring.	Takuya UESHIN	2014-07-15	1	-0/+3
\| \| \| \| \| \| \| \| \| \|	`Substring` including `null` literal cases could be added to `NullPropagation`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1428 from ueshin/issues/SPARK-2509 and squashes the following commits: d9eb85f [Takuya UESHIN] Add Substring cases to NullPropagation.
*	[SPARK-2314][SQL] Override collect and take in JavaSchemaRDD, forwarding to ↵	Aaron Staple	2014-07-15	1	-0/+16
\| \| \| \| \| \| \| \| \| \|	SchemaRDD implementations. Author: Aaron Staple <aaron.staple@gmail.com> Closes #1421 from staple/SPARK-2314 and squashes the following commits: 73e04dc [Aaron Staple] [SPARK-2314] Override collect and take in JavaSchemaRDD, forwarding to SchemaRDD implementations.
*	follow pep8 None should be compared using is or is not	Ken Takagiwa	2014-07-15	4	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \|	http://legacy.python.org/dev/peps/pep-0008/ ## Programming Recommendations - Comparisons to singletons like None should always be done with is or is not, never the equality operators. Author: Ken Takagiwa <ken@Kens-MacBook-Pro.local> Closes #1422 from giwa/apache_master and squashes the following commits: 7b361f3 [Ken Takagiwa] follow pep8 None should be checked using is or is not
*	[SPARK-2500] Move the logInfo for registering BlockManager to ↵	Henry Saputra	2014-07-15	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	BlockManagerMasterActor.register method PR for SPARK-2500 Move the logInfo call for BlockManager to BlockManagerMasterActor.register instead of BlockManagerInfo constructor. Previously the loginfo call for registering the registering a BlockManager is happening in the BlockManagerInfo constructor. This kind of confusing because the code could call "new BlockManagerInfo" without actually registering a BlockManager and could confuse when reading the log files. Author: Henry Saputra <henry.saputra@gmail.com> Closes #1424 from hsaputra/move_registerblockmanager_log_to_registration_method and squashes the following commits: 3370b4a [Henry Saputra] Move the loginfo for BlockManager to BlockManagerMasterActor.register instead of BlockManagerInfo constructor.
*	[SPARK-2469] Use Snappy (instead of LZF) for default shuffle compression codec	Reynold Xin	2014-07-15	2	-3/+3
\| \| \| \| \| \| \| \| \| \|	This reduces shuffle compression memory usage by 3x. Author: Reynold Xin <rxin@apache.org> Closes #1415 from rxin/snappy and squashes the following commits: 06c1a01 [Reynold Xin] SPARK-2469: Use Snappy (instead of LZF) for default shuffle compression codec.
*	[SPARK-2498] [SQL] Synchronize on a lock when using scala reflection inside ↵	Zongheng Yang	2014-07-15	1	-15/+19
\| \| \| \| \| \| \| \| \| \| \| \|	data type objects. JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2498 Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1423 from concretevitamin/scala-ref-catalyst and squashes the following commits: 325a149 [Zongheng Yang] Synchronize on a lock when initializing data type objects in Catalyst.
*	[SQL] Attribute equality comparisons should be done by exprId.	Michael Armbrust	2014-07-15	1	-1/+5
\| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1414 from marmbrus/exprIdResolution and squashes the following commits: 97b47bc [Michael Armbrust] Attribute equality comparisons should be done by exprId.
*	SPARK-2407: Added internal implementation of SQL SUBSTR()	William Benton	2014-07-15	3	-3/+128
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This replaces the Hive UDF for SUBSTR(ING) with an implementation in Catalyst and adds tests to verify correct operation. Author: William Benton <willb@redhat.com> Closes #1359 from willb/internalSqlSubstring and squashes the following commits: ccedc47 [William Benton] Fixed too-long line. a30a037 [William Benton] replace view bounds with implicit parameters ec35c80 [William Benton] Adds fixes from review: 4f3bfdb [William Benton] Added internal implementation of SQL SUBSTR()
*	[SPARK-2474][SQL] For a registered table in OverrideCatalog, the Analyzer ↵	Yin Huai	2014-07-15	2	-1/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	failed to resolve references in the format of "tableName.fieldName" Please refer to JIRA (https://issues.apache.org/jira/browse/SPARK-2474) for how to reproduce the problem and my understanding of the root cause. Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1406 from yhuai/SPARK-2474 and squashes the following commits: 96b1627 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2474 af36d65 [Yin Huai] Fix comment. be86ba9 [Yin Huai] Correct SQL console settings. c43ad00 [Yin Huai] Wrap the relation in a Subquery named by the table name in OverrideCatalog.lookupRelation. a5c2145 [Yin Huai] Support sql/console.
*	[SQL] Whitelist more Hive tests.	Michael Armbrust	2014-07-15	105	-0/+163
\| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1396 from marmbrus/moreTests and squashes the following commits: 6660b60 [Michael Armbrust] Blacklist a test that requires DFS command. 8b6001c [Michael Armbrust] Add golden files. ccd8f97 [Michael Armbrust] Whitelist more tests.
*	[SPARK-2483][SQL] Fix parsing of repeated, nested data access.	Michael Armbrust	2014-07-15	2	-6/+9
\| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1411 from marmbrus/nestedRepeated and squashes the following commits: 044fa09 [Michael Armbrust] Fix parsing of repeated, nested data access.
*	[SPARK-2471] remove runtime scope for jets3t	Xiangrui Meng	2014-07-15	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \|	The assembly jar (built by sbt) doesn't include jets3t if we set it to runtime only, but I don't know whether it was set this way for a particular reason. CC: srowen ScrapCodes Author: Xiangrui Meng <meng@databricks.com> Closes #1402 from mengxr/jets3t and squashes the following commits: bfa2d17 [Xiangrui Meng] remove runtime scope for jets3t
*	Added LZ4 to compression codec in configuration page.	Reynold Xin	2014-07-15	1	-5/+4
\| \| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@apache.org> Closes #1417 from rxin/lz4 and squashes the following commits: 472f6a1 [Reynold Xin] Set the proper default. 9cf0b2f [Reynold Xin] Added LZ4 to compression codec in configuration page.
*	SPARK-1291: Link the spark UI to RM ui in yarn-client mode	witgo	2014-07-15	6	-7/+71
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: witgo <witgo@qq.com> Closes #1112 from witgo/SPARK-1291 and squashes the following commits: 6022bcd [witgo] review commit 1fbb925 [witgo] add addAmIpFilter to yarn alpha 210299c [witgo] review commit 1b92a07 [witgo] review commit 6896586 [witgo] Add comments to addWebUIFilter 3e9630b [witgo] review commit 142ee29 [witgo] review commit 1fe7710 [witgo] Link the spark UI to RM ui in yarn-client mode
*	SPARK-2480: Resolve sbt warnings "NOTE: SPARK_YARN is deprecated, please use ↵	witgo	2014-07-15	4	-10/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	-Pyarn flag" Author: witgo <witgo@qq.com> Closes #1404 from witgo/run-tests and squashes the following commits: f703aee [witgo] fix Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type 2944f51 [witgo] Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag" ef59c70 [witgo] fix Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type 6cefee5 [witgo] Remove "NOTE: SPARK_YARN is deprecated, please use -Pyarn flag"
*	Reformat multi-line closure argument.	William Benton	2014-07-15	1	-2/+3
\| \| \| \| \| \| \| \|	Author: William Benton <willb@redhat.com> Closes #1419 from willb/reformat-2486 and squashes the following commits: 2676231 [William Benton] Reformat multi-line closure argument.
*	[MLLIB] [SPARK-2222] Add multiclass evaluation metrics	Alexander Ulanov	2014-07-15	2	-0/+280
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adding two classes: 1) MulticlassMetrics implements various multiclass evaluation metrics 2) MulticlassMetricsSuite implements unit tests for MulticlassMetrics Author: Alexander Ulanov <nashb@yandex.ru> Author: unknown <ulanov@ULANOV1.emea.hpqcorp.net> Author: Xiangrui Meng <meng@databricks.com> Closes #1155 from avulanov/master and squashes the following commits: 2eae80f [Alexander Ulanov] Merge pull request #1 from mengxr/avulanov-master 5ebeb08 [Xiangrui Meng] minor updates 79c3555 [Alexander Ulanov] Addressing reviewers comments mengxr 0fa9511 [Alexander Ulanov] Addressing reviewers comments mengxr f0dadc9 [Alexander Ulanov] Addressing reviewers comments mengxr 4811378 [Alexander Ulanov] Removing println 87fb11f [Alexander Ulanov] Addressing reviewers comments mengxr. Added confusion matrix e3db569 [Alexander Ulanov] Addressing reviewers comments mengxr. Added true positive rate and false positive rate. Test suite code style. a7e8bf0 [Alexander Ulanov] Addressing reviewers comments mengxr c3a77ad [Alexander Ulanov] Addressing reviewers comments mengxr e2c91c3 [Alexander Ulanov] Fixes to mutliclass metics d5ce981 [unknown] Comments about Double a5c8ba4 [unknown] Unit tests. Class rename fcee82d [unknown] Unit tests. Class rename d535d62 [unknown] Multiclass evaluation
*	README update: added "for Big Data".	Reynold Xin	2014-07-15	1	-1/+1
\|
*	Update README.md to include a slightly more informative project description.	Reynold Xin	2014-07-15	1	-1/+8
\| \| \| \| \|	(cherry picked from commit 401083be9f010f95110a819a49837ecae7d9c4ec) Signed-off-by: Reynold Xin <rxin@apache.org>
*	[SPARK-2477][MLlib] Using appendBias for adding intercept in ↵	DB Tsai	2014-07-15	1	-16/+5
\| \| \| \| \| \| \| \| \| \| \| \|	GeneralizedLinearAlgorithm Instead of using prependOne currently in GeneralizedLinearAlgorithm, we would like to use appendBias for 1) keeping the indices of original training set unchanged by adding the intercept into the last element of vector and 2) using the same public API for consistently adding intercept. Author: DB Tsai <dbtsai@alpinenow.com> Closes #1410 from dbtsai/SPARK-2477_intercept_with_appendBias and squashes the following commits: 011432c [DB Tsai] From Alpine Data Labs
*	[SPARK-2399] Add support for LZ4 compression.	Reynold Xin	2014-07-15	5	-1/+46
\| \| \| \| \| \| \| \| \| \| \|	Based on Greg Bowyer's patch from JIRA https://issues.apache.org/jira/browse/SPARK-2399 Author: Reynold Xin <rxin@apache.org> Closes #1416 from rxin/lz4 and squashes the following commits: 6c8fefe [Reynold Xin] Fixed typo. 8a14d38 [Reynold Xin] [SPARK-2399] Add support for LZ4 compression.
*	discarded exceeded completedDrivers	lianhuiwang	2014-07-15	1	-0/+5
\| \| \| \| \| \| \| \| \| \|	When completedDrivers number exceeds the threshold, the first Max(spark.deploy.retainedDrivers, 1) will be discarded. Author: lianhuiwang <lianhuiwang09@gmail.com> Closes #1114 from lianhuiwang/retained-drivers and squashes the following commits: 8789418 [lianhuiwang] discarded exceeded completedDrivers
*	[SPARK-2485][SQL] Lock usage of hive client.	Michael Armbrust	2014-07-15	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1412 from marmbrus/lockHiveClient and squashes the following commits: 4bc9d5a [Michael Armbrust] protected[hive] 22e9177 [Michael Armbrust] Add comments. 7aa8554 [Michael Armbrust] Don't lock on hive's object. a6edc5f [Michael Armbrust] Lock usage of hive client.
*	[SPARK-2390] Files in staging directory cannot be deleted and wastes the ↵	Kousuke Saruta	2014-07-14	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	space of HDFS When running jobs with YARN Cluster mode and using HistoryServer, the files in the Staging Directory (~/.sparkStaging on HDFS) cannot be deleted. HistoryServer uses directory where event log is written, and the directory is represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get. On the other hand, ApplicationMaster has a instance named fs, which also created by using FileSystem.get. FileSystem.get returns cached same instance when URI passed to the method represents same file system and the method is called by same user. Because of the behavior, when the directory for event log is on HDFS, fs of ApplicationMaster and fileSystem of FileLogger is same instance. When shutting down ApplicationMaster, fileSystem.close is called in FileLogger#stop, which is invoked by SparkContext#stop indirectly. And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In this method, fs.delete(stagingDirPath) is invoked. Because fs.delete in ApplicationMaster is called after fileSystem.close in FileLogger, fs.delete fails and results not deleting files in the staging directory. I think, calling fileSystem.delete is not needed. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #1326 from sarutak/SPARK-2390 and squashes the following commits: 10e1a88 [Kousuke Saruta] Removed fileSystem.close from FileLogger.scala not to prevent any other FileSystem operation
*	Add/increase severity of warning in documentation of groupBy()	Aaron Davidson	2014-07-14	2	-9/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	groupBy()/groupByKey() is notorious for being a very convenient API that can lead to poor performance when used incorrectly. This PR just makes it clear that users should be cautious not to rely on this API when they really want a different (more performant) one, such as reduceByKey(). (Note that one source of confusion is the name; this groupBy() is not the same as a SQL GROUP-BY, which is used for aggregation and is more similar in nature to Spark's reduceByKey().) Author: Aaron Davidson <aaron@databricks.com> Closes #1380 from aarondav/warning and squashes the following commits: f60da39 [Aaron Davidson] Give better advice d0afb68 [Aaron Davidson] Add/increase severity of warning in documentation of groupBy()
*	SPARK-2486: Utils.getCallSite is now resilient to bogus frames	William Benton	2014-07-14	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When running Spark under certain instrumenting profilers, Utils.getCallSite could crash with an NPE. This commit makes it more resilient to failures occurring while inspecting stack frames. Author: William Benton <willb@redhat.com> Closes #1413 from willb/spark-2486 and squashes the following commits: b7c0274 [William Benton] Use explicit null checks instead of Try() 0f0c1ae [William Benton] Utils.getCallSite is now resilient to bogus frames
*	[SPARK-2467] Revert SparkBuild to publish-local to both .m2 and .ivy2.	Takuya UESHIN	2014-07-14	1	-1/+13
\| \| \| \| \| \| \| \|	Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1398 from ueshin/issues/SPARK-2467 and squashes the following commits: 7f01d58 [Takuya UESHIN] Revert SparkBuild to publish-local to both .m2 and .ivy2.
*	[SPARK-2446][SQL] Add BinaryType support to Parquet I/O.	Takuya UESHIN	2014-07-14	5	-45/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Note that this commit changes the semantics when loading in data that was created with prior versions of Spark SQL. Before, we were writing out strings as Binary data without adding any other annotations. Thus, when data is read in from prior versions, data that was StringType will now become BinaryType. Users that need strings can CAST that column to a String. It was decided that while this breaks compatibility, it does make us compatible with other systems (Hive, Thrift, etc) and adds support for Binary data, so this is the right decision long term. To support `BinaryType`, the following changes are needed: - Make `StringType` use `OriginalType.UTF8` - Add `BinaryType` using `PrimitiveTypeName.BINARY` without `OriginalType` Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1373 from ueshin/issues/SPARK-2446 and squashes the following commits: ecacb92 [Takuya UESHIN] Add BinaryType support to Parquet I/O. 616e04a [Takuya UESHIN] Make StringType use OriginalType.UTF8.
*	[SPARK-1946] Submit tasks after (configured ratio) executors have been ↵	li-zhihui	2014-07-14	13	-2/+127
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	registered Because submitting tasks and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality. A simple solution is sleeping few seconds in application, so that executors have enough time to register. The PR add 2 configuration properties to make TaskScheduler submit tasks after a few of executors have been registered. \# Submit tasks only after (registered executors / total executors) arrived the ratio, default value is 0 spark.scheduler.minRegisteredExecutorsRatio = 0.8 \# Whatever minRegisteredExecutorsRatio is arrived, submit tasks after the maxRegisteredWaitingTime(millisecond), default value is 30000 spark.scheduler.maxRegisteredExecutorsWaitingTime = 5000 Author: li-zhihui <zhihui.li@intel.com> Closes #900 from li-zhihui/master and squashes the following commits: b9f8326 [li-zhihui] Add logs & edit docs 1ac08b1 [li-zhihui] Add new configs to user docs 22ead12 [li-zhihui] Move waitBackendReady to postStartHook c6f0522 [li-zhihui] Bug fix: numExecutors wasn't set & use constant DEFAULT_NUMBER_EXECUTORS 4d6d847 [li-zhihui] Move waitBackendReady to TaskSchedulerImpl.start & some code refactor 0ecee9a [li-zhihui] Move waitBackendReady from DAGScheduler.submitStage to TaskSchedulerImpl.submitTasks 4261454 [li-zhihui] Add docs for new configs & code style ce0868a [li-zhihui] Code style, rename configuration property name of minRegisteredRatio & maxRegisteredWaitingTime 6cfb9ec [li-zhihui] Code style, revert default minRegisteredRatio of yarn to 0, driver get --num-executors in yarn/alpha 812c33c [li-zhihui] Fix driver lost --num-executors option in yarn-cluster mode e7b6272 [li-zhihui] support yarn-cluster 37f7dc2 [li-zhihui] support yarn mode(percentage style) 3f8c941 [li-zhihui] submit stage after (configured ratio of) executors have been registered
*	[SPARK-2443][SQL] Fix slow read from partitioned tables	Zongheng Yang	2014-07-14	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fix obtains a comparable performance boost as [PR #1390](https://github.com/apache/spark/pull/1390) by moving an array update and deserializer initialization out of a potentially very long loop. Suggested by yhuai. The below results are updated for this fix. ## Benchmarks Generated a local text file with 10M rows of simple key-value pairs. The data is loaded as a table through Hive. Results are obtained on my local machine using hive/console. Without the fix: Type \| Non-partitioned \| Partitioned (1 part) ------------ \| ------------ \| ------------- First run \| 9.52s end-to-end (1.64s Spark job) \| 36.6s (28.3s) Stablized runs \| 1.21s (1.18s) \| 27.6s (27.5s) With this fix: Type \| Non-partitioned \| Partitioned (1 part) ------------ \| ------------ \| ------------- First run \| 9.57s (1.46s) \| 11.0s (1.69s) Stablized runs \| 1.13s (1.10s) \| 1.23s (1.19s) Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1408 from concretevitamin/slow-read-2 and squashes the following commits: d86e437 [Zongheng Yang] Move update & initialization out of potentially long loop.
*	move some test file to match src code	Daoyuan	2014-07-14	5	-25/+19
\| \| \| \| \| \| \| \| \| \|	Just move some test suite to corresponding package Author: Daoyuan <daoyuan.wang@intel.com> Closes #1401 from adrian-wang/movetestfiles and squashes the following commits: d1a6803 [Daoyuan] move some test file to match src code
*	Made rdd.py pep8 complaint by using Autopep8 and a little manual editing.	Prashant Sharma	2014-07-14	1	-58/+92
\| \| \| \| \| \| \| \| \| \|	Author: Prashant Sharma <prashant.s@imaginea.com> Closes #1354 from ScrapCodes/pep8-comp-1 and squashes the following commits: 9858ea8 [Prashant Sharma] Code Review d8851b7 [Prashant Sharma] Found # noqa works even inside comment blocks. Not sure if it works with all versions of python. 10c0cef [Prashant Sharma] Made rdd.py pep8 complaint by using Autopep8 and a little manual tweaking.
*	SPARK-2363. Clean MLlib's sample data files	Sean Owen	2014-07-13	18	-16/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(Just made a PR for this, mengxr was the reporter of:) MLlib has sample data under serveral folders: 1) data/mllib 2) data/ 3) mllib/data/* Per previous discussion with Matei Zaharia, we want to put them under `data/mllib` and clean outdated files. Author: Sean Owen <sowen@cloudera.com> Closes #1394 from srowen/SPARK-2363 and squashes the following commits: 54313dd [Sean Owen] Move ML example data from /mllib/data/ and /data/ into /data/mllib/
*	SPARK-2462. Make Vector.apply public.	Sandy Ryza	2014-07-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Apologies if there's an already-discussed reason I missed for why this doesn't make sense. Author: Sandy Ryza <sandy@cloudera.com> Closes #1389 from sryza/sandy-spark-2462 and squashes the following commits: 2e5e201 [Sandy Ryza] SPARK-2462. Make Vector.apply public.
*	[SPARK-2405][SQL] Reusue same byte buffers when creating new instance of ↵	Michael Armbrust	2014-07-12	2	-12/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	InMemoryRelation Reuse byte buffers when creating unique attributes for multiple instances of an InMemoryRelation in a single query plan. Author: Michael Armbrust <michael@databricks.com> Closes #1332 from marmbrus/doubleCache and squashes the following commits: 4a19609 [Michael Armbrust] Clean up concurrency story by calculating buffersn the constructor. b39c931 [Michael Armbrust] Allocations are kind of a side effect. f67eff7 [Michael Armbrust] Reusue same byte buffers when creating new instance of InMemoryRelation