spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-2410][SQL] Merging Hive Thrift/JDBC server	Cheng Lian	2014-07-25	54	-96/+1772
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA issue: - Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410) - Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678) Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc). (Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.) TODO - [x] Use `spark-submit` to launch the server, the CLI and beeline - [x] Migration guideline draft for Shark users ---- Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example: ```bash $ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help ``` This actually shows usage information of `SparkSubmit` rather than `BeeLine`. ~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~ UPDATE The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1399 from liancheng/thriftserver and squashes the following commits: 090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR 21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd] 199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver 1083e9d [Cheng Lian] Fixed failed test suites 7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic 9cc0f06 [Cheng Lian] Starts beeline with spark-submit cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile 061880f [Cheng Lian] Addressed all comments by @pwendell 7755062 [Cheng Lian] Adapts test suites to spark-submit settings 40bafef [Cheng Lian] Fixed more license header issues e214aab [Cheng Lian] Added missing license headers b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft 3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit 61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit 2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
*	[SPARK-2683] unidoc failed because org.apache.spark.util.CallSite uses Java ↵	Yin Huai	2014-07-25	6	-14/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	keywords as value names Renaming `short` to `shortForm` and `long` to `longForm`. JIRA: https://issues.apache.org/jira/browse/SPARK-2683 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1585 from yhuai/SPARK-2683 and squashes the following commits: 5ddb843 [Yin Huai] "short" and "long" are Java keyworks. In order to generate javadoc, renaming "short" to "shortForm" and "long" to "longForm".
*	replace println to log4j	fireflyc	2014-07-25	1	-5/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Our program needs to receive a large amount of data and run for a long time. We set the log level to WARN but "Storing iterator" "received single" as such message written to the log file. (over yarn) Author: fireflyc <fireflyc@126.com> Closes #1372 from fireflyc/fix-replace-stdout-log and squashes the following commits: e684140 [fireflyc] 'info' modified into the 'debug' fa22a38 [fireflyc] replace println to log4j
*	[SPARK-2665] [SQL] Add EqualNS & Unit Tests	Cheng Hao	2014-07-25	58	-3/+683
\| \| \| \| \| \| \| \| \| \| \| \| \|	Hive Supports the operator "<=>", which returns same result with EQUAL(=) operator for non-null operands, but returns TRUE if both are NULL, FALSE if one of the them is NULL. Author: Cheng Hao <hao.cheng@intel.com> Closes #1570 from chenghao-intel/equalns and squashes the following commits: 8d6c789 [Cheng Hao] Remove the test case orc_predicate_pushdown 5b2ca88 [Cheng Hao] Add cases into whitelist 8e66cdd [Cheng Hao] Rename the EqualNSTo ==> EqualNullSafe 7af4b0b [Cheng Hao] Add EqualNS & Unit Tests
*	[SPARK-2529] Clean closures in foreach and foreachPartition.	Reynold Xin	2014-07-25	1	-2/+4
\| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@apache.org> Closes #1583 from rxin/closureClean and squashes the following commits: 8982fe6 [Reynold Xin] [SPARK-2529] Clean closures in foreach and foreachPartition.
*	SPARK-2657 Use more compact data structures than ArrayBuffer in groupBy & ↵	Matei Zaharia	2014-07-25	9	-43/+334
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cogroup JIRA: https://issues.apache.org/jira/browse/SPARK-2657 Our current code uses ArrayBuffers for each group of values in groupBy, as well as for the key's elements in CoGroupedRDD. ArrayBuffers have a lot of overhead if there are few values in them, which is likely to happen in cases such as join. In particular, they have a pointer to an Object[] of size 16 by default, which is 24 bytes for the array header + 128 for the pointers in there, plus at least 32 for the ArrayBuffer data structure. This patch replaces the per-group buffers with a CompactBuffer class that can store up to 2 elements more efficiently (in fields of itself) and acts like an ArrayBuffer beyond that. For a key's elements in CoGroupedRDD, we use an Array of CompactBuffers instead of an ArrayBuffer of ArrayBuffers. There are some changes throughout the code to deal with CoGroupedRDD returning Array instead. We can also decide not to do that but CoGroupedRDD is a `DeveloperAPI` so I think it's okay to change it here. Author: Matei Zaharia <matei@databricks.com> Closes #1555 from mateiz/compact-groupby and squashes the following commits: 845a356 [Matei Zaharia] Lower initial size of CompactBuffer's vector to 8 07621a7 [Matei Zaharia] Review comments 0c1cd12 [Matei Zaharia] Don't use varargs in CompactBuffer.apply bdc8a39 [Matei Zaharia] Small tweak to +=, and typos f61f040 [Matei Zaharia] Fix line lengths 59da88b0 [Matei Zaharia] Fix line lengths 197cde8 [Matei Zaharia] Make CompactBuffer extend Seq to make its toSeq more efficient 775110f [Matei Zaharia] Change CoGroupedRDD to give (K, Array[Iterable[_]]) to avoid wrappers 9b4c6e8 [Matei Zaharia] Use CompactBuffer in CoGroupedRDD ed577ab [Matei Zaharia] Use CompactBuffer in groupByKey 10f0de1 [Matei Zaharia] A CompactBuffer that's more memory-efficient than ArrayBuffer for small buffers
*	[SPARK-2656] Python version of stratified sampling	Doris Xin	2014-07-24	3	-6/+51
\| \| \| \| \| \| \| \| \| \| \| \|	exact sample size not supported for now. Author: Doris Xin <doris.s.xin@gmail.com> Closes #1554 from dorx/pystratified and squashes the following commits: 4ba927a [Doris Xin] use rel diff (+- 50%) instead of abs diff (+- 50) bdc3f8b [Doris Xin] updated unit to check sample holistically 7713c7b [Doris Xin] Python version of stratified sampling
*	[SPARK-2538] [PySpark] Hash based disk spilling aggregation	Davies Liu	2014-07-24	9	-25/+611
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation. It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition). Author: Davies Liu <davies.liu@gmail.com> Closes #1460 from davies/spill and squashes the following commits: cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible. 37d71f7 [Davies Liu] balance the partitions 902f036 [Davies Liu] add shuffle.py into run-tests dcf03a9 [Davies Liu] fix memory_info() of psutil 67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge: e74b785 [Davies Liu] fix code style and change next_limit to memory_limit 400be01 [Davies Liu] address all the comments 6178844 [Davies Liu] refactor and improve docs fdd0a49 [Davies Liu] add long doc string for ExternalMerger 1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy() e6cc7f9 [Davies Liu] Merge branch 'master' into spill 3652583 [Davies Liu] address comments e78a0a0 [Davies Liu] fix style 24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR 57ee7ef [Davies Liu] update docs 286aaff [Davies Liu] let spilled aggregation in Python configurable e9a40f6 [Davies Liu] recursive merger 6edbd1f [Davies Liu] Hash based disk spilling aggregation
*	[SPARK-2014] Make PySpark store RDDs in MEMORY_ONLY_SER with compression by ↵	Prashant Sharma	2014-07-24	3	-3/+9
\| \| \| \| \| \| \| \| \| \| \|	default Author: Prashant Sharma <prashant.s@imaginea.com> Closes #1051 from ScrapCodes/SPARK-2014/pyspark-cache and squashes the following commits: f192df7 [Prashant Sharma] Code Review 2a2f43f [Prashant Sharma] [SPARK-2014] Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default
*	[SPARK-2464][Streaming] Fixed Twitter stream stopping bug	Tathagata Das	2014-07-24	1	-2/+7
\| \| \| \| \| \| \| \| \| \|	Stopping the Twitter Receiver would call twitter4j's TwitterStream.shutdown, which in turn causes an Exception to be thrown to the listener. This exception caused the Receiver to be restarted. This patch check whether the receiver was stopped or not, and accordingly restarts on exception. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #1577 from tdas/twitter-stop and squashes the following commits: 011b525 [Tathagata Das] Fixed Twitter stream stopping bug.
*	SPARK-2250: show stage RDDs in UI	Neville Li	2014-07-24	1	-4/+16
\| \| \| \| \| \| \| \| \|	Author: Neville Li <neville@spotify.com> Closes #1188 from nevillelyh/neville/ui and squashes the following commits: d3ac425 [Neville Li] SPARK-2250: show persisted RDD in stage UI f075db9 [Neville Li] SPARK-2035: show call stack even when description is available
*	[SPARK-2037]: yarn client mode doesn't support spark.yarn.max.executor.failures	GuoQiang Li	2014-07-24	3	-38/+115
\| \| \| \| \| \| \| \| \| \|	Author: GuoQiang Li <witgo@qq.com> Closes #1180 from witgo/SPARK-2037 and squashes the following commits: 3d52411 [GuoQiang Li] review commit 7058f4d [GuoQiang Li] Correctly stop SparkContext 6d0561f [GuoQiang Li] Fix: yarn client mode doesn't support spark.yarn.max.executor.failures
*	[SPARK-2479 (partial)][MLLIB] fix binary metrics unit tests	Xiangrui Meng	2014-07-24	1	-9/+27
\| \| \| \| \| \| \| \| \| \| \| \|	Allow small errors in comparison. @dbtsai , this unit test blocks https://github.com/apache/spark/pull/1562 . I may need to merge this one first. We can change it to use the tools in https://github.com/apache/spark/pull/1425 after that PR gets merged. Author: Xiangrui Meng <meng@databricks.com> Closes #1576 from mengxr/fix-binary-metrics-unit-tests and squashes the following commits: 5076a7f [Xiangrui Meng] fix binary metrics unit tests
*	[SPARK-2603][SQL] Remove unnecessary toMap and toList in converting Java ↵	Yin Huai	2014-07-24	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	collections to Scala collections JsonRDD.scala In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to a Scala one. These two operations are pretty expensive because they read elements from a Java Map/List and then load to a Scala Map/List. We can use Scala wrappers to wrap those Java collections instead of using toMap/toList. I did a quick test to see the performance. I had a 2.9GB cached RDD[String] storing one JSON object per record (twitter dataset). My simple test program is attached below. ```scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ val jsonData = sc.textFile("...") jsonData.cache.count val jsonSchemaRDD = sqlContext.jsonRDD(jsonData) jsonSchemaRDD.registerAsTable("jt") sqlContext.sql("select count(*) from jt").collect ``` Stages for the schema inference and the table scan both had 48 tasks. These tasks were executed sequentially. For the current implementation, scanning the JSON dataset will materialize values of all fields of a record. The inferred schema of the dataset can be accessed at https://gist.github.com/yhuai/05fe8a57c638c6666f8d. From the result, there was no significant difference on running `jsonRDD`. For the simple aggregation query, results are attached below. ``` Original: Run 1: 26.1s Run 2: 27.03s Run 3: 27.035s With this change: Run 1: 21.086s Run 2: 21.035s Run 3: 21.029s ``` JIRA: https://issues.apache.org/jira/browse/SPARK-2603 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1504 from yhuai/removeToMapToList and squashes the following commits: 6831b77 [Yin Huai] Fix failed tests. 09b9bca [Yin Huai] Merge remote-tracking branch 'upstream/master' into removeToMapToList d1abdb8 [Yin Huai] Remove unnecessary toMap and toList.
*	[Build] SPARK-2619: Configurable filemode for the spark/bin folder in debian ↵	tzolov	2014-07-24	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	package Add a `<deb.bin.filemode>744</deb.bin.filemode>` property to the `assembly/pom.xml` that defaults to `744`. Use this property for ../bin folder <filemode>. This patch doesn't change the current default modes but allows one override the modes at build time: `-Ddeb.bin.filemode=<new mode>` Author: tzolov <christian.tzolov@gmail.com> Closes #1531 from tzolov/SPARK-2619 and squashes the following commits: 6d95343 [tzolov] [Build] SPARK-2619: Configurable filemode for the spark/bin folder in the .deb package
*	SPARK-2150: Provide direct link to finished application UI in yarn resou...	Rahul Singhal	2014-07-24	11	-13/+46
\| \| \| \| \| \| \| \| \| \| \| \| \|	...rce manager UI Use the event logger directory to provide a direct link to finished application UI in yarn resourcemanager UI. Author: Rahul Singhal <rahul.singhal@guavus.com> Closes #1094 from rahulsinghaliitd/SPARK-2150 and squashes the following commits: 95f230c [Rahul Singhal] SPARK-2150: Provide direct link to finished application UI in yarn resource manager UI
*	[SPARK-2661][bagel]unpersist old processed rdd	Daoyuan	2014-07-24	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \|	Unpersist useless rdd during bagel iteration to make full use of memory. Author: Daoyuan <daoyuan.wang@intel.com> Closes #1519 from adrian-wang/bagelunpersist and squashes the following commits: 182c9dd [Daoyuan] rename var nextUseless to lastRDD 87fd3a4 [Daoyuan] bagel unpersist old processed rdd
*	SPARK-2310. Support arbitrary Spark properties on the command line with ...	Sandy Ryza	2014-07-23	5	-4/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...spark-submit The PR allows invocations like spark-submit --class org.MyClass --spark.shuffle.spill false myjar.jar Author: Sandy Ryza <sandy@cloudera.com> Closes #1253 from sryza/sandy-spark-2310 and squashes the following commits: 1dc9855 [Sandy Ryza] More doc and cleanup 00edfb9 [Sandy Ryza] Review comments 91b244a [Sandy Ryza] Change format to --conf PROP=VALUE 8fabe77 [Sandy Ryza] SPARK-2310. Support arbitrary Spark properties on the command line with spark-submit
*	[SPARK-2658][SQL] Add rule for true = 1.	Michael Armbrust	2014-07-23	3	-1/+24
\| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #1556 from marmbrus/fixBooleanEqualsOne and squashes the following commits: ad8edd4 [Michael Armbrust] Add rule for true = 1 and false = 0.
*	SPARK-2662: Fix NPE for JsonProtocol	GuoQiang Li	2014-07-23	1	-1/+2
\| \| \| \| \| \| \| \|	Author: GuoQiang Li <witgo@qq.com> Closes #1511 from witgo/JsonProtocol and squashes the following commits: 2b6227f [GuoQiang Li] Fix NPE for JsonProtocol
*	Replace RoutingTableMessage with pair	Ankur Dave	2014-07-23	4	-30/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	RoutingTableMessage was used to construct routing tables to enable joining VertexRDDs with partitioned edges. It stored three elements: the destination vertex ID, the source edge partition, and a byte specifying the position in which the edge partition referenced the vertex to enable join elimination. However, this was incompatible with sort-based shuffle (SPARK-2045). It was also slightly wasteful, because partition IDs are usually much smaller than 2^32, though this was mitigated by a custom serializer that used variable-length encoding. This commit replaces RoutingTableMessage with a pair of (VertexId, Int) where the Int encodes both the source partition ID (in the lower 30 bits) and the position (in the top 2 bits). Author: Ankur Dave <ankurdave@gmail.com> Closes #1553 from ankurdave/remove-RoutingTableMessage and squashes the following commits: 697e17b [Ankur Dave] Replace RoutingTableMessage with pair
*	[SPARK-2484][SQL] Build should not run hivecompatibility tests by default.	witgo	2014-07-23	2	-0/+30
\| \| \| \| \| \| \| \|	Author: witgo <witgo@qq.com> Closes #1403 from witgo/hive_compatibility and squashes the following commits: 4e5ecdb [witgo] The default does not run hive compatibility tests
*	[SPARK-2549] Functions defined inside of other functions trigger failures	Prashant Sharma	2014-07-23	2	-106/+98
\| \| \| \| \| \| \| \| \|	Author: Prashant Sharma <prashant.s@imaginea.com> Closes #1510 from ScrapCodes/SPARK-2549/fun-in-fun and squashes the following commits: 9458bc5 [Prashant Sharma] Tested by removing an inner function from excludes. bc03b1c [Prashant Sharma] SPARK-2549 Functions defined inside of other functions trigger failures
*	[SPARK-2102][SQL][CORE] Add option for kryo registration required and use a ↵	Ian O Connell	2014-07-23	3	-17/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	resource pool in Spark SQL for Kryo instances. Author: Ian O Connell <ioconnell@twitter.com> Closes #1377 from ianoc/feature/SPARK-2102 and squashes the following commits: 5498566 [Ian O Connell] Docs update suggested by Patrick 20e8555 [Ian O Connell] Slight style change f92c294 [Ian O Connell] Add docs for new KryoSerializer option f3735c8 [Ian O Connell] Add using a kryo resource pool for the SqlSerializer 4e5c342 [Ian O Connell] Register the SparkConf for kryo, it gets swept into serialization 665805a [Ian O Connell] Add a spark.kryo.registrationRequired option for configuring the Kryo Serializer
*	[SPARK-2569][SQL] Fix shipping of TEMPORARY hive UDFs.	Michael Armbrust	2014-07-23	3	-235/+261
\| \| \| \| \| \| \| \| \| \|	Instead of shipping just the name and then looking up the info on the workers, we now ship the whole classname. Also, I refactored the file as it was getting pretty large to move out the type conversion code to its own file. Author: Michael Armbrust <michael@databricks.com> Closes #1552 from marmbrus/fixTempUdfs and squashes the following commits: b695904 [Michael Armbrust] Make add jar execute with Hive. Ship the whole function class name since sometimes we cannot lookup temporary functions on the workers.
*	SPARK-2226: [SQL] transform HAVING clauses with aggregate expressions that ↵	William Benton	2014-07-23	12	-1/+869
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	aren't in the aggregation list This change adds an analyzer rule to 1. find expressions in `HAVING` clause filters that depend on unresolved attributes, 2. push these expressions down to the underlying aggregates, and then 3. project them away above the filter. It also enables the `HAVING` queries in the Hive compatibility suite. Author: William Benton <willb@redhat.com> Closes #1497 from willb/spark-2226 and squashes the following commits: 92c9a93 [William Benton] Removed unnecessary import f1d4f34 [William Benton] Cleanups missed in prior commit 0e1624f [William Benton] Incorporated suggestions from @marmbrus; thanks! 541d4ee [William Benton] Cleanups from review 5a12647 [William Benton] Explanatory comments and stylistic cleanups. c7f2b2c [William Benton] Whitelist HAVING queries. 29a26e3 [William Benton] Added rule to handle unresolved attributes in HAVING clauses (SPARK-2226)
*	SPARK-2277: clear host->rack info properly	Rui Li	2014-07-23	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \|	Hi mridulm, I just think of this issue of [#1212](https://github.com/apache/spark/pull/1212): I added FakeRackUtil to hold the host -> rack mapping. It should be cleaned up after use so that it won't mess up with test cases others may add later. Really sorry about this. Author: Rui Li <rui.li@intel.com> Closes #1454 from lirui-intel/SPARK-2277-fix-UT and squashes the following commits: f8ea25c [Rui Li] SPARK-2277: clear host->rack info properly
*	[SPARK-2588][SQL] Add some more DSLs.	Takuya UESHIN	2014-07-23	3	-33/+70
\| \| \| \| \| \| \| \| \| \|	Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1491 from ueshin/issues/SPARK-2588 and squashes the following commits: 43d0a46 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2588 1023ea0 [Takuya UESHIN] Modify tests to use DSLs. 2310bf1 [Takuya UESHIN] Add some more DSLs.
*	[CORE] SPARK-2640: In "local[N]", free cores of the only executor should be ↵	woshilaiceshide	2014-07-23	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	touched by "spark.task.cpus" for every finish/start-up of tasks. Make spark's "local[N]" better. In our company, we use "local[N]" in production. It works exellentlly. It's our best choice. Author: woshilaiceshide <woshilaiceshide@qq.com> Closes #1544 from woshilaiceshide/localX and squashes the following commits: 6c85154 [woshilaiceshide] [CORE] SPARK-2640: In "local[N]", free cores of the only executor should be touched by "spark.task.cpus" for every finish/start-up of tasks.
*	[SPARK-2609] Log thread ID when spilling ExternalAppendOnlyMap	Andrew Or	2014-07-23	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's useful to know whether one thread is constantly spilling or multiple threads are spilling relatively infrequently. Right now everything looks a little jumbled and we can't tell which lines belong to the same thread. For instance: ``` 06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (194 times so far) 06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (198 times so far) 06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (198 times so far) 06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 10 MB to disk (197 times so far) 06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 9 MB to disk (45 times so far) 06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 23 MB to disk (198 times so far) 06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 38 MB to disk (25 times so far) 06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 161 MB to disk (25 times so far) 06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 0 MB to disk (199 times so far) 06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (166 times so far) 06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (199 times so far) 06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (200 times so far) ``` Author: Andrew Or <andrewor14@gmail.com> Closes #1517 from andrewor14/external-log and squashes the following commits: 90e48bb [Andrew Or] Log thread ID when spilling
*	[SPARK-2617] Correct doc and usages of preservesPartitioning	Xiangrui Meng	2014-07-23	8	-15/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages. This PR 1. adds notes in `maPartitions*`, 2. makes `RDD.sample` preserve partitioner, 3. changes `preservesPartitioning` to false in `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD, 4. fixes some wrong usages in MLlib. Author: Xiangrui Meng <meng@databricks.com> Closes #1526 from mengxr/preserve-partitioner and squashes the following commits: b361e65 [Xiangrui Meng] update doc based on pwendell's comments 3b1ba19 [Xiangrui Meng] update doc 357575c [Xiangrui Meng] fix unit test 20b4816 [Xiangrui Meng] Merge branch 'master' into preserve-partitioner d1caa65 [Xiangrui Meng] add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning
*	Remove GraphX MessageToPartition for compatibility with sort-based shuffle	Ankur Dave	2014-07-22	5	-228/+2
\| \| \| \| \| \| \| \| \| \| \|	MessageToPartition was used in `Graph#partitionBy`. Unlike a Tuple2, it marked the key as transient to avoid sending it over the network. However, it was incompatible with sort-based shuffle (SPARK-2045) and represented only a minor optimization: for partitionBy, it improved performance by 6.3% (30.4 s to 28.5 s) and reduced communication by 5.6% (114.2 MB to 107.8 MB). Author: Ankur Dave <ankurdave@gmail.com> Closes #1537 from ankurdave/remove-MessageToPartition and squashes the following commits: f9d0054 [Ankur Dave] Remove MessageToPartition ab71364 [Ankur Dave] Remove unused VertexBroadcastMsg
*	[YARN] SPARK-2577: File upload to viewfs is broken due to mount point re...	Gera Shegalov	2014-07-22	1	-1/+2
\| \| \| \| \| \| \| \| \| \|	Opting to the option 2 defined in SPARK-2577, i.e., retrieve and pass the correct file system object to addResource. Author: Gera Shegalov <gera@twitter.com> Closes #1483 from gerashegalov/master and squashes the following commits: 90c9087 [Gera Shegalov] [YARN] SPARK-2577: File upload to viewfs is broken due to mount point resolution
*	[YARN][SPARK-2606]:In some cases,the spark UI pages display incorrect	GuoQiang Li	2014-07-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	The issue is caused by #1112 . Author: GuoQiang Li <witgo@qq.com> Closes #1501 from witgo/webui_style and squashes the following commits: 4b34998 [GuoQiang Li] In some cases, pages display incorrect in WebUI
*	Graphx example	CrazyJvm	2014-07-22	2	-5/+5
\| \| \| \| \| \| \| \| \| \| \|	fix examples Author: CrazyJvm <crazyjvm@gmail.com> Closes #1523 from CrazyJvm/graphx-example and squashes the following commits: 663457a [CrazyJvm] outDegrees does not take parameters 7cfff1d [CrazyJvm] fix example for joinVertices
*	[SPARK-2615] [SQL] Add Equal Sign "==" Support for HiveQl	Cheng Hao	2014-07-22	21	-0/+45
\| \| \| \| \| \| \| \| \| \|	Currently, the "==" in HiveQL expression will cause exception thrown, this patch will fix it. Author: Cheng Hao <hao.cheng@intel.com> Closes #1522 from chenghao-intel/equal and squashes the following commits: f62a0ff [Cheng Hao] Add == Support for HiveQl
*	SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage	Aaron Davidson	2014-07-22	7	-32/+1222
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	### Why and what? Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead. This patch adds a Sorter API, intended for in memory sorts, which simply ports the Android Timsort implementation (available under Apache v2) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point. Please compare our port of the Android Timsort sort with the original implementation: http://www.diffchecker.com/wiwrykcl ### Memory implications An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOPS](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default. Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes). This results in a worst-case sorting overhead of 24N + 2N = 26N bytes (for Java 7). The Sorter does not require allocating any tuples, but since it uses Timsort, it may copy up to half the entire array in the worst case. This results in a worst-case sorting overhead of 4N bytes. Thus, we have reduced the worst-case overhead of the sort by roughly 22 bytes times the number of elements. ### Performance implications As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. However, because it implements Timsort, it also brings a substantial performance boost over our prior implementation. Here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run only on the keys, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()" while the new implementation is "KV-array using Sorter". <table> <tr><th>Test</th><th>First run (JDK6)</th><th>Average of 10 (JDK6)</th><th>First run (JDK7)</th><th>Average of 10 (JDK7)</th></tr> <tr><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td><td>2724 ms</td><td>131 ms (!!)</td></tr> <tr><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td><td>13201 ms</td><td>878 ms</td></tr> <tr><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td><td>20990 ms</td><td>1919 ms</td></tr> <tr><td><b>KV-array using Sorter</b></td><td></td><td></td><td><b>15020 ms</b></td><td><b>834 ms</b></td></tr> </table> The results show that this Sorter performs exactly as expected (after the first run) -- it is as fast as the Java 7 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6 or 7. In short, this patch should significantly improve performance for users running either Java 6 or 7. Author: Aaron Davidson <aaron@databricks.com> Closes #1502 from aarondav/sort and squashes the following commits: 652d936 [Aaron Davidson] Update license, move Sorter to java src a7b5b1c [Aaron Davidson] fix licenses 5c0efaf [Aaron Davidson] Update tmpLength ec395c8 [Aaron Davidson] Ignore benchmark (again) and fix docs 034bf10 [Aaron Davidson] Change to Apache v2 Timsort b97296c [Aaron Davidson] Don't try to run benchmark on Jenkins + private[spark] 6307338 [Aaron Davidson] SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage
*	[MLLIB] make Mima ignore updateFeatures (private) in ALS	Xiangrui Meng	2014-07-22	1	-1/+3
\| \| \| \| \| \| \| \| \| \|	Fix Mima issues in #1521. Author: Xiangrui Meng <meng@databricks.com> Closes #1533 from mengxr/mima-als and squashes the following commits: 78386e1 [Xiangrui Meng] make Mima ignore updateFeatures (private) in ALS
*	[SPARK-2612] [mllib] Fix data skew in ALS	peng.zhang	2014-07-22	1	-6/+5
\| \| \| \| \| \| \| \| \|	Author: peng.zhang <peng.zhang@xiaomi.com> Closes #1521 from renozhang/fix-als and squashes the following commits: b5727a4 [peng.zhang] Remove no need argument 1a4f7a0 [peng.zhang] Fix data skew in ALS
*	[SPARK-2452] Create a new valid for each instead of using lineId.	Prashant Sharma	2014-07-22	2	-2/+17
\| \| \| \| \| \| \| \| \|	Author: Prashant Sharma <prashant@apache.org> Closes #1441 from ScrapCodes/SPARK-2452/multi-statement and squashes the following commits: 26c5c72 [Prashant Sharma] Added a test case. 7e8d28d [Prashant Sharma] SPARK-2452, create a new valid for each instead of using lineId, because Line ids can be same sometimes.
*	[SPARK-2470] PEP8 fixes to PySpark	Nicholas Chammas	2014-07-21	18	-97/+127
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This pull request aims to resolve all outstanding PEP8 violations in PySpark. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Author: nchammas <nicholas.chammas@gmail.com> Closes #1505 from nchammas/master and squashes the following commits: 98171af [Nicholas Chammas] [SPARK-2470] revert PEP 8 fixes to cloudpickle cba7768 [Nicholas Chammas] [SPARK-2470] wrap expression list in parentheses e178dbe [Nicholas Chammas] [SPARK-2470] style - change position of line break 9127d2b [Nicholas Chammas] [SPARK-2470] wrap expression lists in parentheses 22132a4 [Nicholas Chammas] [SPARK-2470] wrap conditionals in parentheses 24639bc [Nicholas Chammas] [SPARK-2470] fix whitespace for doctest 7d557b7 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to tests.py 8f8e4c0 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to storagelevel.py b3b96cf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to statcounter.py d644477 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to worker.py aa3a7b6 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to sql.py 1916859 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to shell.py 95d1d95 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to serializers.py a0fec2e [Nicholas Chammas] [SPARK-2470] PEP8 fixes to mllib c85e1e5 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to join.py d14f2f1 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to __init__.py 81fcb20 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to resultiterable.py 1bde265 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to java_gateway.py 7fc849c [Nicholas Chammas] [SPARK-2470] PEP8 fixes to daemon.py ca2d28b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to context.py f4e0039 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to conf.py a6d5e4b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to cloudpickle.py f0a7ebf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to rddsampler.py 4dd148f [nchammas] Merge pull request #5 from apache/master f7e4581 [Nicholas Chammas] unrelated pep8 fix a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently de7292a [nchammas] Merge pull request #4 from apache/master 2e4fe00 [nchammas] Merge pull request #3 from apache/master 89fde08 [nchammas] Merge pull request #2 from apache/master 69f6e22 [Nicholas Chammas] PEP8 fixes 2627247 [Nicholas Chammas] broke up lines before they hit 100 chars 6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names 69da6cf [nchammas] Merge pull request #1 from apache/master
*	[SPARK-2086] Improve output of toDebugString to make shuffle boundaries more ↵	Gregory Owen	2014-07-21	2	-4/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	clear Changes RDD.toDebugString() to show hierarchy and shuffle transformations more clearly New output: ``` (3) FlatMappedValuesRDD[325] at apply at Transformer.scala:22 \| MappedValuesRDD[324] at apply at Transformer.scala:22 \| CoGroupedRDD[323] at apply at Transformer.scala:22 +-(5) MappedRDD[320] at apply at Transformer.scala:22 \| \| MappedRDD[319] at apply at Transformer.scala:22 \| \| MappedValuesRDD[318] at apply at Transformer.scala:22 \| \| MapPartitionsRDD[317] at apply at Transformer.scala:22 \| \| ShuffledRDD[316] at apply at Transformer.scala:22 \| +-(10) MappedRDD[315] at apply at Transformer.scala:22 \| \| ParallelCollectionRDD[314] at apply at Transformer.scala:22 +-(100) MappedRDD[322] at apply at Transformer.scala:22 \| ParallelCollectionRDD[321] at apply at Transformer.scala:22 ``` Author: Gregory Owen <greowen@gmail.com> Closes #1364 from GregOwen/to-debug-string and squashes the following commits: 08f5c78 [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly 1603f7b [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly
*	[SPARK-2561][SQL] Fix apply schema	Michael Armbrust	2014-07-21	2	-1/+7
\| \| \| \| \| \| \| \| \| \| \|	We need to use the analyzed attributes otherwise we end up with a tree that will never resolve. Author: Michael Armbrust <michael@databricks.com> Closes #1470 from marmbrus/fixApplySchema and squashes the following commits: f968195 [Michael Armbrust] Use analyzed attributes when applying the schema. 4969015 [Michael Armbrust] Add test case.
*	[SPARK-2434][MLlib]: Warning messages that point users to original MLlib ↵	Burak	2014-07-21	11	-1/+141
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	implementations added to Examples [SPARK-2434][MLlib]: Warning messages that refer users to the original MLlib implementations of some popular example machine learning algorithms added both in the comments and the code. The following examples have been modified: Scala: * LocalALS * LocalFileLR * LocalKMeans * LocalLP * SparkALS * SparkHdfsLR * SparkKMeans * SparkLR Python: * kmeans.py * als.py * logistic_regression.py Author: Burak <brkyvz@gmail.com> Closes #1515 from brkyvz/SPARK-2434 and squashes the following commits: 7505da9 [Burak] [SPARK-2434][MLlib]: Warning messages added, scalastyle errors fixed, and added missing punctuation b96b522 [Burak] [SPARK-2434][MLlib]: Warning messages added and scalastyle errors fixed 4762f39 [Burak] [SPARK-2434]: Warning messages added 17d3d83 [Burak] SPARK-2434: Added warning messages to the naive implementations of the example algorithms 2cb5301 [Burak] SPARK-2434: Warning messages redirecting to original implementaions added.
*	Fix flakey HiveQuerySuite test	Aaron Davidson	2014-07-21	1	-23/+22
\| \| \| \| \| \| \| \| \| \|	Result may not be returned in the expected order, so relax that constraint. Author: Aaron Davidson <aaron@databricks.com> Closes #1514 from aarondav/flakey and squashes the following commits: e5af823 [Aaron Davidson] Fix flakey HiveQuerySuite test
*	[SPARK-2494] [PySpark] make hash of None consistant cross machines	Davies Liu	2014-07-21	1	-3/+32
\| \| \| \| \| \| \| \| \| \| \| \| \|	In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this. Author: Davies Liu <davies.liu@gmail.com> Closes #1371 from davies/hash_of_none and squashes the following commits: d01745f [Davies Liu] add comments, remove outdated unit tests 5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy() b7118aa [Davies Liu] use __builtin__ instead of __builtins__ 839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines
*	SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler	Sandy Ryza	2014-07-21	6	-99/+11
\| \| \| \| \| \| \| \| \| \| \|	Author: Sandy Ryza <sandy@cloudera.com> Closes #634 from sryza/sandy-spark-1707 and squashes the following commits: 2f6e358 [Sandy Ryza] Default min registered executors ratio to .8 for YARN 354c630 [Sandy Ryza] Remove outdated comments c744ef3 [Sandy Ryza] Take out waitForInitialAllocations 2a4329b [Sandy Ryza] SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler
*	[SPARK-2190][SQL] Specialized ColumnType for Timestamp	Cheng Lian	2014-07-21	133	-77/+287
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA issue: [SPARK-2190](https://issues.apache.org/jira/browse/SPARK-2190) Added specialized in-memory column type for `Timestamp`. Whitelisted all timestamp related Hive tests except `timestamp_udf`, which is timezone sensitive. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1440 from liancheng/timestamp-column-type and squashes the following commits: e682175 [Cheng Lian] Enabled more timezone sensitive Hive tests. 53a358f [Cheng Lian] Fixed failed test suites 01b592d [Cheng Lian] Fixed SimpleDateFormat thread safety issue 2a59343 [Cheng Lian] Removed timezone sensitive Hive timestamp tests 45dd05d [Cheng Lian] Added Timestamp specific in-memory columnar representation
*	[SPARK-1945][MLLIB] Documentation Improvements for Spark 1.0	Michael Giannakopoulos	2014-07-20	5	-8/+465
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Standalone application examples are added to 'mllib-linear-methods.md' file written in Java. This commit is related to the issue [Add full Java Examples in MLlib docs](https://issues.apache.org/jira/browse/SPARK-1945). Also I changed the name of the sigmoid function from 'logit' to 'f'. This is because the logit function is the inverse of sigmoid. Thanks, Michael Author: Michael Giannakopoulos <miccagiann@gmail.com> Closes #1311 from miccagiann/master and squashes the following commits: 8ffe5ab [Michael Giannakopoulos] Update code so as to comply with code standards. f7ad5cc [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master' 38d92c7 [Michael Giannakopoulos] Adding PCA, SVD and LBFGS examples in Java. Performing minor updates in the already committed examples so as to eradicate the call of 'productElement' function whenever is possible. cc0a089 [Michael Giannakopoulos] Modyfied Java examples so as to comply with coding standards. b1141b2 [Michael Giannakopoulos] Added Java examples for Clustering and Collaborative Filtering [mllib-clustering.md & mllib-collaborative-filtering.md]. 837f7a8 [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master' 15f0eb4 [Michael Giannakopoulos] Java examples included in 'mllib-linear-methods.md' file.
*	Improve scheduler delay tooltip.	Kay Ousterhout	2014-07-20	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	As a result of shivaram's experience debugging long scheduler delay, I think we should improve the tooltip to point people in the right direction if scheduler delay is large. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #1488 from kayousterhout/better_tooltips and squashes the following commits: 22176fd [Kay Ousterhout] Improve scheduler delay tooltip.