spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge remote-tracking branch 'apache/master' into error-handling	Tathagata Das	2014-01-11	30	-174/+1347
\|\
\| *	Merge pull request #389 from rxin/clone-writables	Reynold Xin	2014-01-11	3	-41/+71
\| \|\ \| \| \| \| \| \| \| \| \|	Minor update for clone writables and more documentation.
\| \| *	Renamed cloneKeyValues to cloneRecords; updated docs.	Reynold Xin	2014-01-11	3	-44/+45
\| \| \|
\| \| *	Minor update for clone writables and more documentation.	Reynold Xin	2014-01-11	3	-12/+41
\| \| \|
\| * \|	Merge pull request #388 from pwendell/master	Reynold Xin	2014-01-11	1	-1/+1
\| \|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix UI bug introduced in #244. The 'duration' field was incorrectly renamed to 'task time' in the table that lists stages.
\| \| * \|	Fix UI bug introduced in #244.	Patrick Wendell	2014-01-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The 'duration' field was incorrectly renamed to 'task time' in the table that lists stages.
\| * \| \|	Merge pull request #393 from pwendell/revert-381	Patrick Wendell	2014-01-11	2	-2/+2
\| \|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Revert PR 381 This PR missed a bunch of test cases that require "spark.cleaner.ttl". I think it is what is causing test failures on Jenkins right now (though it's a bit hard to tell because the DNS for cs.berkeley.edu is down). I'm submitting this to see if it fixes jeknins. I did try just patching various tests but it was taking a really long time because there are a bunch of them, so for now I'm just seeing if a revert works.
\| \| * \| \|	Revert "Fix default TTL for metadata cleaner"	Patrick Wendell	2014-01-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This reverts commit 669ba4caa95014f4511f842206c3e506f1a41a7a.
\| \| * \| \|	Revert "Fix one unit test that was not setting spark.cleaner.ttl"	Patrick Wendell	2014-01-11	1	-1/+1
\| \|/ / / \| \| \| \| \| \| \| \| \| \| \| \|	This reverts commit 942c80b34c4642de3b0237761bc1aaeb8cbdd546.
\| * \| \|	Merge pull request #387 from jerryshao/conf-fix	Reynold Xin	2014-01-11	1	-7/+8
\| \|\ \ \ \| \| \|_\|/ \| \|/\| \| \| \| \| \|	Fix configure didn't work small problem in ALS
\| \| * \|	Fix configure didn't work small problem in ALS	jerryshao	2014-01-11	1	-7/+8
\| \| \| \|
\| * \| \|	Merge pull request #359 from ScrapCodes/clone-writables	Reynold Xin	2014-01-11	4	-49/+106
\| \|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We clone hadoop key and values by default and reuse objects if asked to. We try to clone for most common types of writables and we call WritableUtils.clone otherwise intention is to optimize, for example for NullWritable there is no need and for Long, int and String creating a new object with value set would be faster than doing copy on object hopefully. There is another way to do this PR where we ask for both key and values whether to clone them or not, but could not think of a use case for it except either of them is actually a NullWritable for which I have already worked around. So thought that would be unnecessary.
\| \| * \| \|	Fixes corresponding to Reynolds feedback comments	Prashant Sharma	2014-01-09	4	-32/+43
\| \| \| \| \|
\| \| * \| \|	we clone hadoop key and values by default and reuse if specified.	Prashant Sharma	2014-01-08	4	-41/+87
\| \| \| \| \|
\| * \| \| \|	Merge pull request #373 from jerryshao/kafka-upgrade	Patrick Wendell	2014-01-11	2	-11/+11
\| \|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Upgrade Kafka dependecy to 0.8.0 release version
\| \| * \| \| \|	Upgrade Kafka dependecy to 0.8.0 release version	jerryshao	2014-01-10	2	-11/+11
\| \| \| \| \| \|
\| * \| \| \| \|	Merge pull request #376 from prabeesh/master	Reynold Xin	2014-01-10	1	-1/+1
\| \|\ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Change clientId to random clientId The client identifier should be unique across all clients connecting to the same server. A convenience method is provided to generate a random client id that should satisfy this criteria - generateClientId(). Returns a randomly generated client identifier based on the current user's login name and the system time. As the client identifier is used by the server to identify a client when it reconnects, the client must use the same identifier between connections if durable subscriptions are to be used.
\| \| * \| \| \| \|	Change clientId to random clientId	Prabeesh K	2014-01-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Returns a randomly generated client identifier based on the current user's login name and the system time.
\| * \| \| \| \| \|	Merge pull request #386 from pwendell/typo-fix	Reynold Xin	2014-01-10	1	-1/+1
\| \|\ \ \ \ \ \ \| \| \|_\|_\|_\|/ / \| \|/\| \| \| \| \| \| \| \| \| \| \| \|	Small typo fix
\| \| * \| \| \| \|	Small typo fix	Patrick Wendell	2014-01-10	1	-1/+1
\| \| \| \|_\|_\|/ \| \| \|/\| \| \|
\| * \| \| \| \|	Merge pull request #381 from mateiz/default-ttl	Matei Zaharia	2014-01-10	2	-2/+2
\| \|\ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix default TTL for metadata cleaner It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default.
\| \| * \| \| \| \|	Fix one unit test that was not setting spark.cleaner.ttl	Matei Zaharia	2014-01-10	1	-1/+1
\| \| \| \| \| \| \|
\| \| * \| \| \| \|	Fix default TTL for metadata cleaner	Matei Zaharia	2014-01-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default
\| * \| \| \| \| \|	Merge pull request #382 from RongGu/master	Patrick Wendell	2014-01-10	1	-1/+1
\| \|\ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix a type error in comment lines Fix a type error in comment lines
\| \| * \| \| \| \| \|	fix a type error in comment lines	RongGu	2014-01-11	1	-1/+1
\| \| \|/ / / / /
\| * \| \| \| \| \|	Merge pull request #385 from shivaram/add-i2-instances	Patrick Wendell	2014-01-10	1	-2/+10
\| \|\ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add i2 instance types to Spark EC2. Using data from http://aws.amazon.com/amazon-linux-ami/instance-type-matrix/ and http://www.ec2instances.info/
\| \| * \| \| \| \| \|	Add i2 instance types to Spark EC2.	Shivaram Venkataraman	2014-01-10	1	-2/+10
\| \| \|/ / / / /
\| * \| \| \| \| \|	Merge pull request #383 from tdas/driver-test	Patrick Wendell	2014-01-10	15	-205/+600
\| \|\ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	API for automatic driver recovery for streaming programs and other bug fixes 1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory. Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext See the RecoverableNetworkWordCount below as an example of how to use it. 2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint. 3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery. 4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp). 5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared. This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test. Edit: Java example to be added later, unit test added.
\| * \ \ \ \ \ \	Merge pull request #377 from andrewor14/master	Patrick Wendell	2014-01-10	17	-93/+1118
\| \|\ \ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	External Sorting for Aggregator and CoGroupedRDDs (Revisited) (This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving) The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted. The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order. Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.
\| \| * \| \| \| \| \| \|	Update documentation for externalSorting	Andrew Or	2014-01-10	1	-3/+2
\| \| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \| \|	Address Patrick's and Reynold's comments	Andrew Or	2014-01-10	5	-47/+73
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Aside from trivial formatting changes, use nulls instead of Options for DiskMapIterator, and add documentation for spark.shuffle.externalSorting and spark.shuffle.memoryFraction. Also, set spark.shuffle.memoryFraction to 0.3, and spark.storage.memoryFraction = 0.6.
\| \| * \| \| \| \| \| \|	Fix wonky imports from merge	Andrew Or	2014-01-09	1	-8/+1
\| \| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \| \|	Defensively allocate memory from global pool	Andrew Or	2014-01-09	5	-47/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is an alternative to the existing approach, which evenly distributes the collective shuffle memory among all running tasks. In the new approach, each thread requests a chunk of memory whenever its map is about to multiplicatively grow. If there is sufficient memory in the global pool, the thread allocates it and grows its map. Otherwise, it spills. A danger with the previous approach is that a new task may quickly fill up its map before old tasks finish spilling, potentially causing an OOM. This approach prevents this scenario as it favors existing tasks over new tasks; any thread that may step over the boundary of other threads defensively backs off and starts spilling. Testing through spark-perf reveals: (1) When no spills have occured, the performance of external sorting using this memory management approach is essentially the same as without external sorting. (2) When one or more spills have occured, the performance of external sorting is a small multiple (3x) worse
\| \| * \| \| \| \| \| \|	Merge github.com:apache/incubator-spark	Andrew Or	2014-01-09	293	-2974/+5557
\| \| \|\ \ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/SparkEnv.scala streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java
\| \| * \| \| \| \| \| \| \|	Get SparkConf from SparkEnv, rather than creating new ones	Andrew Or	2014-01-07	3	-6/+6
\| \| \| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \| \| \|	Use AtomicInteger for numRunningTasks	Andrew Or	2014-01-04	1	-12/+7
\| \| \| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \| \| \|	Address Mark's comments	Andrew Or	2014-01-04	3	-18/+13
\| \| \| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \| \| \|	Assign spill threshold as a fraction of maximum memory	Andrew Or	2014-01-04	5	-33/+81
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Further, divide this threshold by the number of tasks running concurrently. Note that this does not guard against the following scenario: a new task quickly fills up its share of the memory before old tasks finish spilling their contents, in which case the total memory used by such maps may exceed what was specified. Currently, spark.shuffle.safetyFraction mitigates the effect of this.
\| \| * \| \| \| \| \| \| \|	Remove unnecessary ClassTag's	Andrew Or	2014-01-03	2	-7/+4
\| \| \| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \| \| \|	Refactor using SparkConf	Andrew Or	2014-01-03	4	-19/+21
\| \| \| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \| \| \|	Merge remote-tracking branch 'spark/master'	Andrew Or	2014-01-02	182	-1764/+3172
\| \| \|\ \ \ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala
\| \| * \| \| \| \| \| \| \| \|	TempBlockId takes UUID and is explicitly non-serializable	Aaron Davidson	2014-01-02	2	-5/+6
\| \| \| \| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \| \| \| \|	Simplify ExternalAppendOnlyMap on the assumption that the mergeCombiners ↵	Andrew Or	2014-01-01	3	-135/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	function is specified
\| \| * \| \| \| \| \| \| \| \|	Merge branch 'master' of github.com:andrewor14/incubator-spark	Andrew Or	2013-12-31	4	-9/+9
\| \| \|\ \ \ \ \ \ \ \ \
\| \| \| * \| \| \| \| \| \| \| \|	Rename IntermediateBlockId to TempBlockId	Aaron Davidson	2013-12-31	4	-9/+9
\| \| \| \| \| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \| \| \| \| \|	Address Patrick's and Reynold's comments	Andrew Or	2013-12-31	1	-49/+71
\| \| \|/ / / / / / / / /
\| \| * \| \| \| \| \| \| \| \|	Merge branch 'master' of github.com:andrewor14/incubator-spark	Andrew Or	2013-12-31	3	-97/+71
\| \| \|\ \ \ \ \ \ \ \ \
\| \| \| * \| \| \| \| \| \| \| \|	Add new line at end of file	Aaron Davidson	2013-12-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|
\| \| \| * \| \| \| \| \| \| \| \|	Refactor SamplingSizeTracker into SizeTrackingAppendOnlyMap	Aaron Davidson	2013-12-30	3	-97/+71
\| \| \| \| \| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \| \| \| \| \|	Add support and test for null keys in ExternalAppendOnlyMap	Andrew Or	2013-12-31	4	-32/+139
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also add safeguard against use of destructively sorted AppendOnlyMap