spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge branch 'master' into scala-2.10	Raymond Liu	2013-11-13	259	-5146/+10776
\|\
\| *	Merge pull request #148 from squito/include_appId	Reynold Xin	2013-11-07	3	-2/+22
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Include appId in executor cmd line args add the appId back into the executor cmd line args. I also made a pretty lame regression test, just to make sure it doesn't get dropped in the future. not sure it will run on the build server, though, b/c `ExecutorRunner.buildCommandSeq()` expects to be abel to run the scripts in `bin`.
\| \| *	fix formatting	Imran Rashid	2013-11-07	1	-3/+5
\| \| \|
\| \| *	very basic regression test to make sure appId doesnt get dropped in future	Imran Rashid	2013-11-07	1	-0/+18
\| \| \|
\| \| *	include the appid in the cmd line arguments to Executors	Imran Rashid	2013-11-07	2	-2/+2
\| \| \|
\| * \|	Merge pull request #23 from jerryshao/multi-user	Reynold Xin	2013-11-06	5	-379/+417
\| \|\ \ \| \| \|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add Spark multi-user support for standalone mode and Mesos This PR add multi-user support for Spark both standalone mode and Mesos (coarse and fine grained ) mode, user can specify the user name who submit app through environment variable `SPARK_USER` or use default one. Executor will communicate with Hadoop using specified user name. Also I fixed one bug in JobLogger when different user wrote job log to specified folder which has no right file permission. I separate previous [PR750](https://github.com/mesos/spark/pull/750) into two PRs, in this PR I only solve multi-user support problem. I will try to solve security auth problem in subsequent PR because security auth is a complicated problem especially for Shark Server like long-run app (both Kerberos TGT and HDFS delegation token should be renewed or re-created through app's run time).
\| \| *	Add Spark multi-user support for standalone mode and Mesos	jerryshao	2013-11-07	5	-379/+417
\| \|/
\| *	Merge pull request #144 from liancheng/runjob-clean	Reynold Xin	2013-11-06	1	-2/+1
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Removed unused return value in SparkContext.runJob Return type of this `runJob` version is `Unit`: def runJob[T, U: ClassManifest]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], allowLocal: Boolean, resultHandler: (Int, U) => Unit) { ... } It's obviously unnecessary to "return" `result`.
\| \| *	Removed unused return value in SparkContext.runJob	Lian, Cheng	2013-11-06	1	-2/+1
\| \| \|
\| * \|	Merge pull request #145 from aarondav/sls-fix	Reynold Xin	2013-11-06	1	-1/+1
\| \|\ \ \| \| \|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Attempt to fix SparkListenerSuite breakage Could not reproduce locally, but this test could've been flaky if the build machine was too fast, due to typo. (index 0 is intentionally slowed down to ensure total time is >= 1 ms) This should be merged into branch-0.8 as well.
\| \| *	Attempt to fix SparkListenerSuite breakage	Aaron Davidson	2013-11-06	1	-1/+1
\| \|/ \| \| \| \| \| \| \| \|	Could not reproduce locally, but this test could've been flaky if the build machine was too fast.
\| *	Merge pull request #143 from rxin/scheduler-hang	Reynold Xin	2013-11-05	1	-3/+11
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ignore a task update status if the executor doesn't exist anymore. Otherwise if the scheduler receives a task update message when the executor's been removed, the scheduler would hang. It is pretty hard to add unit tests for these right now because it is hard to mock the cluster scheduler. We should do that once @kayousterhout finishes merging the local scheduler and the cluster scheduler.
\| \| *	Ignore a task update status if the executor doesn't exist anymore.	Reynold Xin	2013-11-05	1	-3/+11
\| \|/
\| *	Merge pull request #142 from liancheng/dagscheduler-pattern-matching	Reynold Xin	2013-11-05	1	-7/+6
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using case class deep match to simplify code in DAGScheduler.processEvent Since all `XxxEvent` pushed in `DAGScheduler.eventQueue` are case classes, deep pattern matching is more convenient to extract event object components.
\| \| *	Using compact case class pattern matching syntax to simplify code in ↵	Lian, Cheng	2013-11-05	1	-7/+6
\| \|/ \| \| \| \| \| \|	DAGScheduler.processEvent
\| *	Merge pull request #139 from aarondav/shuffle-next	Reynold Xin	2013-11-04	3	-36/+4
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Never store shuffle blocks in BlockManager After the BlockId refactor (PR #114), it became very clear that ShuffleBlocks are of no use within BlockManager (they had a no-arg constructor!). This patch completely eliminates them, saving us around 100-150 bytes per shuffle block. The total, system-wide overhead per shuffle block is now a flat 8 bytes, excluding state saved by the MapOutputTracker. Note: This should not be merged directly into 0.8.0 -- see #138
\| \| *	Never store shuffle blocks in BlockManager	Aaron Davidson	2013-11-04	3	-36/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After the BlockId refactor (PR #114), it became very clear that ShuffleBlocks are of no use within BlockManager (they had a no-arg constructor!). This patch completely eliminates them, saving us around 100-150 bytes per shuffle block. The total, system-wide overhead per shuffle block is now a flat 8 bytes, excluding state saved by the MapOutputTracker.
\| * \|	Merge pull request #128 from shimingfei/joblogger-doc	Reynold Xin	2013-11-04	2	-26/+119
\| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	add javadoc to JobLogger, and some small fix against Spark-941 add javadoc to JobLogger, output more info for RDD, modify recordStageDepGraph to avoid output duplicate stage dependency information (cherry picked from commit 518cf22eb2436d019e4f7087a38080ad4a20df58) Signed-off-by: Reynold Xin <rxin@apache.org>
\| *	Merge pull request #130 from aarondav/shuffle	Reynold Xin	2013-11-04	11	-110/+333
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Memory-optimized shuffle file consolidation Reduces overhead of each shuffle block for consolidation from >300 bytes to 8 bytes (1 primitive Long). Verified via profiler testing with 1 mil shuffle blocks, net overhead was ~8,400,000 bytes. Despite the memory-optimized implementation incurring extra CPU overhead, the runtime of the shuffle phase in this test was only around 2% slower, while the reduce phase was 40% faster, when compared to not using any shuffle file consolidation. This is accomplished by replacing the map from ShuffleBlockId to FileSegment (i.e., block id to where it's located), which had high overhead due to being a gigantic, timestamped, concurrent map with a more space-efficient structure. Namely, the following are introduced (I have omitted the word "Shuffle" from some names for clarity): ShuffleFile - there is one ShuffleFile per consolidated shuffle file on disk. We store an array of offsets into the physical shuffle file for each ShuffleMapTask that wrote into the file. This is sufficient to reconstruct FileSegments for mappers that are in the file. FileGroup - contains a set of ShuffleFiles, one per reducer, that a MapTask can use to write its output. There is one FileGroup created per _concurrent_ MapTask. The FileGroup contains an array of the mapIds that have been written to all files in the group. The positions of elements in this array map directly onto the positions in each ShuffleFile's offsets array. In order to locate the FileSegment associated with a BlockId, we have another structure which maps each reducer to the set of ShuffleFiles that were created for it. (There will be as many ShuffleFiles per reducer as there are FileGroups.) To lookup a given ShuffleBlockId (shuffleId, reducerId, mapId), we thus search through all ShuffleFiles associated with that reducer. As a time optimization, we ensure that FileGroups are only reused for MapTasks with monotonically increasing mapIds. This allows us to perform a binary search to locate a mapId inside a group, and also enables potential future optimization (based on the usual monotonic access order).
\| \| *	Minor cleanup in ShuffleBlockManager	Aaron Davidson	2013-11-04	1	-4/+4
\| \| \|
\| \| *	Refactor ShuffleBlockManager to reduce public interface	Aaron Davidson	2013-11-04	3	-178/+123
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- ShuffleBlocks has been removed and replaced by ShuffleWriterGroup. - ShuffleWriterGroup no longer contains a reference to a ShuffleFileGroup. - ShuffleFile has been removed and its contents are now within ShuffleFileGroup. - ShuffleBlockManager.forShuffle has been replaced by a more stateful forMapTask.
\| \| *	Add javadoc and remove unused code	Aaron Davidson	2013-11-03	2	-1/+1
\| \| \|
\| \| *	Clean up test files properly	Aaron Davidson	2013-11-03	1	-5/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For some reason, even calling java.nio.Files.createTempDirectory().getFile.deleteOnExit() does not delete the directory on exit. Guava's analagous function seems to work, however.
\| \| *	use OpenHashMap, remove monotonicity requirement, fix failure bug	Aaron Davidson	2013-11-03	4	-41/+26
\| \| \|
\| \| *	Address Reynold's comments	Aaron Davidson	2013-11-03	1	-12/+16
\| \| \|
\| \| *	Fix test breakage	Aaron Davidson	2013-11-03	1	-1/+1
\| \| \|
\| \| *	Add documentation and address other comments	Aaron Davidson	2013-11-03	2	-26/+35
\| \| \|
\| \| *	Fix weird bug with specialized PrimitiveVector	Aaron Davidson	2013-11-03	1	-1/+5
\| \| \|
\| \| *	Address minor comments	Aaron Davidson	2013-11-03	3	-8/+9
\| \| \|
\| \| *	Memory-optimized shuffle file consolidation	Aaron Davidson	2013-11-03	8	-77/+348
\| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Overhead of each shuffle block for consolidation has been reduced from >300 bytes to 8 bytes (1 primitive Long). Verified via profiler testing with 1 mil shuffle blocks, net overhead was ~8,400,000 bytes. Despite the memory-optimized implementation incurring extra CPU overhead, the runtime of the shuffle phase in this test was only around 2% slower, while the reduce phase was 40% faster, when compared to not using any shuffle file consolidation.
\| *	Merge pull request #70 from rxin/hash1	Reynold Xin	2013-11-03	9	-7/+1108
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fast, memory-efficient hash set, hash table implementations optimized for primitive data types. This pull request adds two hash table implementations optimized for primitive data types. For primitive types, the new hash tables are much faster than the current Spark AppendOnlyMap (3X faster - note that the current AppendOnlyMap is already much better than the Java map) while uses much less space (1/4 of the space). Details: This PR first adds a open hash set implementation (OpenHashSet) optimized for primitive types (using Scala's specialization feature). This OpenHashSet is designed to serve as building blocks for more advanced structures. It is currently used to build the following two hash tables, but can be used in the future to build multi-valued hash tables as well (GraphX has this use case). Note that there are some peculiarities in the code for working around some Scala compiler bugs. Building on top of OpenHashSet, this PR adds two different hash tables implementations: 1. OpenHashSet: for nullable keys, optional specialization for primitive values 2. PrimitiveKeyOpenHashMap: for primitive keys that are not nullable, and optional specialization for primitive values I tested the update speed of these two implementations using the changeValue function (which is what Aggregator and cogroup would use). Runtime relative to AppendOnlyMap for inserting 10 million items: Int to Int: ~30% java.lang.Integer to java.lang.Integer: ~100% Int to java.lang.Integer: ~50% java.lang.Integer to Int: ~85%
\| \| *	Code review feedback.	Reynold Xin	2013-11-03	7	-25/+100
\| \| \|
\| \| *	Fixed a bug that uses twice amount of memory for the primitive arrays due to ↵	Reynold Xin	2013-11-02	9	-30/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	a scala compiler bug. Also addressed Matei's code review comment.
\| \| *	Merge branch 'master' into hash1	Reynold Xin	2013-11-02	147	-3776/+4261
\| \| \|\ \| \| \|/ \| \|/\|
\| * \|	Merge pull request #133 from Mistobaan/link_fix	Reynold Xin	2013-11-02	1	-1/+1
\| \|\ \ \| \| \| \| \| \| \| \| \| \| \| \|	update default github
\| \| * \|	update default github	Fabrizio (Misto) Milo	2013-11-01	1	-1/+1
\| \| \| \|
\| * \| \|	Merge pull request #134 from rxin/readme	Reynold Xin	2013-11-02	1	-1/+1
\| \|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixed a typo in Hadoop version in README.
\| \| * \| \|	Fixed a typo in Hadoop version in README.	Reynold Xin	2013-11-02	1	-1/+1
\| \|/ / /
\| * \| \|	Merge pull request #132 from Mistobaan/doc_fix	Reynold Xin	2013-11-01	1	-1/+1
\| \|\ \ \ \| \| \|/ / \| \|/\| \| \| \| \| \|	fix persistent-hdfs
\| \| * \|	fix persistent-hdfs	Fabrizio (Misto) Milo	2013-11-01	1	-1/+1
\| \|/ /
\| * \|	Merge pull request #129 from velvia/2013-11/document-local-uris	Matei Zaharia	2013-11-01	2	-2/+15
\| \|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Document & finish support for local: URIs Review all the supported URI schemes for addJar / addFile to the Cluster Overview page. Add support for local: URI to addFile.
\| \| * \|	Add local: URI support to addFile as well	Evan Chan	2013-11-01	1	-1/+2
\| \| \| \|
\| \| * \|	Document all the URIs for addJar/addFile	Evan Chan	2013-11-01	1	-1/+13
\| \|/ /
\| * \|	Merge pull request #117 from stephenh/avoid_concurrent_modification_exception	Matei Zaharia	2013-10-30	2	-3/+12
\| \|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Handle ConcurrentModificationExceptions in SparkContext init. System.getProperties.toMap will fail-fast when concurrently modified, and it seems like some other thread started by SparkContext does a System.setProperty during it's initialization. Handle this by just looping on ConcurrentModificationException, which seems the safest, since the non-fail-fast methods (Hastable.entrySet) have undefined behavior under concurrent modification.
\| \| * \|	Avoid match errors when filtering for spark.hadoop settings.	Stephen Haberman	2013-10-30	1	-2/+4
\| \| \| \|
\| \| * \|	Use Properties.clone() instead.	Stephen Haberman	2013-10-29	1	-5/+2
\| \| \| \|
\| \| * \|	Handle ConcurrentModificationExceptions in SparkContext init.	Stephen Haberman	2013-10-27	2	-3/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	System.getProperties.toMap will fail-fast when concurrently modified, and it seems like some other thread started by SparkContext does a System.setProperty during it's initialization. Handle this by just looping on ConcurrentModificationException, which seems the safest, since the non-fail-fast methods (Hastable.entrySet) have undefined behavior under concurrent modification.
\| * \| \|	Merge pull request #126 from kayousterhout/local_fix	Matei Zaharia	2013-10-30	1	-1/+1
\| \|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixed incorrect log message in local scheduler This change is especially relevant at the moment, because some users are seeing this failure, and the log message is misleading/incorrect (because for the tests, the max failures is set to 0, not 4)
\| \| * \| \|	Fixed incorrect log message in local scheduler	Kay Ousterhout	2013-10-30	1	-1/+1
\| \| \| \| \|
\| * \| \| \|	Merge pull request #124 from tgravescs/sparkHadoopUtilFix	Matei Zaharia	2013-10-30	8	-38/+43
\| \|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull SparkHadoopUtil out of SparkEnv (jira SPARK-886) Having the logic to initialize the correct SparkHadoopUtil in SparkEnv prevents it from being used until after the SparkContext is initialized. This causes issues like https://spark-project.atlassian.net/browse/SPARK-886. It also makes it hard to use in singleton objects. For instance I want to use it in the security code.