spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge pull request #228 from pwendell/master	Patrick Wendell	2013-12-05	2	-3/+13
\|\ \| \| \| \| \| \|	Document missing configs and set shuffle consolidation to false.
\| *	Forcing shuffle consolidation in DiskBlockManagerSuite	Patrick Wendell	2013-12-05	1	-2/+12
\| \|
\| *	Document missing configs and set shuffle consolidation to false.	Patrick Wendell	2013-12-04	1	-1/+1
\| \|
* \|	Merge pull request #199 from harveyfeng/yarn-2.2	Matei Zaharia	2013-12-04	2	-8/+4
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Hadoop 2.2 migration Includes support for the YARN API stabilized in the Hadoop 2.2 release, and a few style patches. Short description for each set of commits: a98f5a0 - "Misc style changes in the 'yarn' package" a67ebf4 - "A few more style fixes in the 'yarn' package" Both of these are some minor style changes, such as fixing lines over 100 chars, to the existing YARN code. ab8652f - "Add a 'new-yarn' directory ... " Copies everything from `SPARK_HOME/yarn` to `SPARK_HOME/new-yarn`. No actual code changes here. 4f1c3fa - "Hadoop 2.2 YARN API migration ..." API patches to code in the `SPARK_HOME/new-yarn` directory. There are a few more small style changes mixed in, too. Based on @colorant's Hadoop 2.2 support for the scala-2.10 branch in #141. a1a1c62 - "Add optional Hadoop 2.2 settings in sbt build ... " If Spark should be built against Hadoop 2.2, then: a) the `org.apache.spark.deploy.yarn` package will be compiled from the `new-yarn` directory. b) Protobuf v2.5 will be used as a Spark dependency, since Hadoop 2.2 depends on it. Also, Spark will be built against a version of Akka v2.0.5 that's built against Protobuf 2.5, named `akka-2.0.5-protobuf-2.5`. The patched Akka is here: https://github.com/harveyfeng/akka/tree/2.0.5-protobuf-2.5, and was published to local Ivy during testing. There's also a new boolean environment variable, `SPARK_IS_NEW_HADOOP`, that users can manually set if their `SPARK_HADOOP_VERSION` specification does not start with `2.2`, which is how the build file tries to detect a 2.2 version. Not sure if this is necessary or done in the best way, though...
\| * \|	Fix pom.xml for maven build	Raymond Liu	2013-12-03	1	-7/+3
\| \| \|
\| * \|	Merge remote-tracking branch 'origin/master' into yarn-2.2	Harvey Feng	2013-11-26	23	-190/+984
\| \|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
\| * \| \|	Hadoop 2.2 YARN API migration for `SPARK_HOME/new-yarn`	Harvey Feng	2013-11-23	1	-1/+1
\| \| \| \|
* \| \| \|	Merge pull request #227 from pwendell/master	Patrick Wendell	2013-12-04	1	-16/+13
\|\ \ \ \ \| \| \|_\|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix small bug in web UI and minor clean-up. There was a bug where sorting order didn't work correctly for write time metrics. I also cleaned up some earlier code that fixed the same issue for read and write bytes.
\| * \| \|	Fix small bug in web UI and minor clean-up.	Patrick Wendell	2013-12-04	1	-16/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There was a bug where sorting order didn't work correctly for write time metrics. I also cleaned up some earlier code that fixed the same issue for read and write bytes.
* \| \| \|	Add missing space after "Serialized" in StorageLevel	Andrew Ash	2013-12-04	1	-1/+1
\|/ / / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current code creates outputs like: scala> res0.getStorageLevel.description res2: String = Serialized1x Replicated
* \| \|	Merge pull request #223 from rxin/transient	Matei Zaharia	2013-12-04	1	-5/+5
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Mark partitioner, name, and generator field in RDD as @transient. As part of the effort to reduce serialized task size.
\| * \| \|	Marked doCheckpointCalled as transient.	Reynold Xin	2013-12-03	1	-2/+2
\| \| \| \|
\| * \| \|	Mark partitioner, name, and generator field in RDD as @transient.	Reynold Xin	2013-12-02	1	-3/+3
\| \| \| \|
* \| \| \|	Merge pull request #217 from aarondav/mesos-urls	Reynold Xin	2013-12-02	3	-115/+260
\|\ \ \ \ \| \|/ / / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Re-enable zk:// urls for Mesos SparkContexts This was broken in PR #71 when we explicitly disallow anything that didn't fit a mesos:// url. Although it is not really clear that a zk:// url should match Mesos, it is what the docs say and it is necessary for backwards compatibility. Additionally added a unit test for the creation of all types of TaskSchedulers. Since YARN and Mesos are not necessarily available in the system, they are allowed to pass as long as the YARN/Mesos code paths are exercised.
\| * \| \|	Add spaces between tests	Aaron Davidson	2013-11-29	1	-0/+5
\| \| \| \|
\| * \| \|	Add unit test for SparkContext scheduler creation	Aaron Davidson	2013-11-28	3	-116/+255
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since YARN and Mesos are not necessarily available in the system, they are allowed to pass as long as the YARN/Mesos code paths are exercised.
\| * \| \|	Re-enable zk:// urls for Mesos SparkContexts	Aaron Davidson	2013-11-28	1	-5/+6
\| \| \|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This was broken in PR #71 when we explicitly disallow anything that didn't fit a mesos:// url. Although it is not really clear that a zk:// url should match Mesos, it is what the docs say and it is necessary for backwards compatibility.
* \| \|	Merge pull request #219 from sundeepn/schedulerexception	Reynold Xin	2013-12-01	1	-1/+11
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Scheduler quits when newStage fails The current scheduler thread does not handle exceptions from newStage stage while launching new jobs. The thread fails on any exception that gets triggered at that level, leaving the cluster hanging with no schduler.
\| * \| \|	Log exception in scheduler in addition to passing it to the caller.	Sundeep Narravula	2013-12-01	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Code Styling changes.
\| * \| \|	Scheduler quits when createStage fails.	Sundeep Narravula	2013-11-30	1	-1/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current scheduler thread does not handle exceptions from createStage stage while launching new jobs. The thread fails on any exception that gets triggered at that level, leaving the cluster hanging with no schduler.
* \| \| \|	More comments	Lian, Cheng	2013-11-29	1	-0/+3
\| \| \| \|
* \| \| \|	Updated some inline comments in DAGScheduler	Lian, Cheng	2013-11-29	1	-5/+26
\| \| \| \|
* \| \| \|	Bugfix: SPARK-965 & SPARK-966	Lian, Cheng	2013-11-28	3	-25/+40
\|/ / / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-965: https://spark-project.atlassian.net/browse/SPARK-965 SPARK-966: https://spark-project.atlassian.net/browse/SPARK-966 * Add back DAGScheduler.start(), eventProcessActor is created and started here. Notice that function is only called by SparkContext. * Cancel the scheduled stage resubmission task when stopping eventProcessActor * Add a new DAGSchedulerEvent ResubmitFailedStages This event message is sent by the scheduled stage resubmission task to eventProcessActor. In this way, DAGScheduler.resubmitFailedStages is guaranteed to be executed from the same thread that runs DAGScheduler.processEvent. Please refer to discussion in SPARK-966 for details.
* \| \|	Merge pull request #210 from haitaoyao/http-timeout	Matei Zaharia	2013-11-27	1	-2/+8
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	add http timeout for httpbroadcast While pulling task bytecode from HttpBroadcast server, there's no timeout value set. This may cause spark executor code hang and other task in the same executor process wait for the lock. I have encountered the issue in my cluster. Here's the stacktrace I captured : https://gist.github.com/haitaoyao/7655830 So add a time out value to ensure the task fail fast.
\| * \| \|	add http timeout for httpbroadcast	haitao.yao	2013-11-26	1	-2/+8
\| \| \| \|
* \| \| \|	Merge pull request #146 from JoshRosen/pyspark-custom-serializers	Matei Zaharia	2013-11-26	1	-104/+45
\|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Custom Serializers for PySpark This pull request adds support for custom serializers to PySpark. For now, all Python-transformed (or parallelize()d RDDs) are serialized with the same serializer that's specified when creating SparkContext. For now, PySpark includes `PickleSerDe` and `MarshalSerDe` classes for using Python's `pickle` and `marshal` serializers. It's pretty easy to add support for other serializers, although I still need to add instructions on this. A few notable changes: - The Scala `PythonRDD` class no longer manipulates Pickled objects; data from `textFile` is written to Python as MUTF-8 strings. The Python code performs the appropriate bookkeeping to track which deserializer should be used when reading an underlying JavaRDD. This mechanism could also be used to support other data exchange formats, such as MsgPack. - Several magic numbers were refactored into constants. - Batching is implemented by wrapping / decorating an unbatched SerDe.
\| * \| \| \|	Send PySpark commands as bytes insetad of strings.	Josh Rosen	2013-11-10	1	-20/+4
\| \| \| \| \|
\| * \| \| \|	Add custom serializer support to PySpark.	Josh Rosen	2013-11-10	1	-22/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For now, this only adds MarshalSerializer, but it lays the groundwork for other supporting custom serializers. Many of these mechanisms can also be used to support deserialization of different data formats sent by Java, such as data encoded by MsgPack. This also fixes a bug in SparkContext.union().
\| * \| \| \|	Remove Pickle-wrapping of Java objects in PySpark.	Josh Rosen	2013-11-03	1	-67/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we support custom serializers, the Python worker will know what type of input to expect, so we won't need to wrap Tuple2 and Strings into pickled tuples and strings.
\| * \| \| \|	Replace magic lengths with constants in PySpark.	Josh Rosen	2013-11-03	1	-10/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Write the length of the accumulators section up-front rather than terminating it with a negative length. I find this easier to read.
* \| \| \| \|	Merge pull request #207 from henrydavidge/master	Matei Zaharia	2013-11-26	4	-0/+19
\|\ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Log a warning if a task's serialized size is very big As per Reynold's instructions, we now create a warning level log entry if a task's serialized size is too big. "Too big" is currently defined as 100kb. This warning message is generated at most once for each stage.
\| * \| \| \| \|	Emit warning when task size > 100KB	hhd	2013-11-26	4	-0/+19
\| \| \| \| \| \|
* \| \| \| \| \|	[SPARK-963] Wait for SparkListenerBus eventQueue to be empty before checking ↵	Mark Hamstra	2013-11-26	1	-1/+6
\| \|_\|_\|/ / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \|	jobLogger state
* \| \| \| \|	Merge pull request #209 from pwendell/better-docs	Reynold Xin	2013-11-26	1	-10/+13
\|\ \ \ \ \ \| \|_\|_\|/ / \|/\| \| \| \| \| \| \| \| \|	Improve docs for shuffle instrumentation
\| * \| \| \|	Improve docs for shuffle instrumentation	Patrick Wendell	2013-11-25	1	-10/+13
\| \| \| \| \|
* \| \| \| \|	Merge pull request #86 from holdenk/master	Matei Zaharia	2013-11-26	4	-0/+451
\|\ \ \ \ \ \| \|_\|/ / / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add histogram functionality to DoubleRDDFunctions This pull request add histogram functionality to the DoubleRDDFunctions.
\| * \| \| \|	Fix the test	Holden Karau	2013-11-25	2	-5/+5
\| \| \| \| \|
\| * \| \| \|	Add spaces	Holden Karau	2013-11-18	1	-0/+14
\| \| \| \| \|
\| * \| \| \|	Remove explicit boxing	Holden Karau	2013-11-18	1	-2/+2
\| \| \| \| \|
\| * \| \| \|	Remove extranious type declerations	Holden Karau	2013-10-21	1	-2/+2
\| \| \| \| \|
\| * \| \| \|	Remove extranious type definitions from inside of tests	Holden Karau	2013-10-21	1	-86/+86
\| \| \| \| \|
\| * \| \| \|	CR feedback	Holden Karau	2013-10-21	3	-101/+125
\| \| \| \| \|
\| * \| \| \|	Add tests for the Java implementation.	Holden Karau	2013-10-20	1	-0/+14
\| \| \| \| \|
\| * \| \| \|	Initial commit of adding histogram functionality to the DoubleRDDFunctions.	Holden Karau	2013-10-19	3	-0/+399
\| \| \| \| \|
* \| \| \| \|	Merge pull request #204 from rxin/hash	Matei Zaharia	2013-11-25	4	-54/+103
\|\ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	OpenHashSet fixes Incorporated ideas from pull request #200. - Use Murmur Hash 3 finalization step to scramble the bits of HashCode instead of the simpler version in java.util.HashMap; the latter one had trouble with ranges of consecutive integers. Murmur Hash 3 is used by fastutil. - Don't check keys for equality when re-inserting due to growing the table; the keys will already be unique. - Remember the grow threshold instead of recomputing it on each insert Also added unit tests for size estimation for specialized hash sets and maps.
\| * \| \| \| \|	Incorporated ideas from pull request #200.	Reynold Xin	2013-11-25	1	-50/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Use Murmur Hash 3 finalization step to scramble the bits of HashCode instead of the simpler version in java.util.HashMap; the latter one had trouble with ranges of consecutive integers. Murmur Hash 3 is used by fastutil. - Don't check keys for equality when re-inserting due to growing the table; the keys will already be unique - Remember the grow threshold instead of recomputing it on each insert
\| * \| \| \| \|	Added unit tests for size estimation for specialized hash sets and maps.	Reynold Xin	2013-11-25	3	-4/+46
\| \| \|/ / / \| \|/\| \| \|
* \| \| \| \|	Merge pull request #201 from rxin/mappartitions	Matei Zaharia	2013-11-25	4	-70/+22
\|\ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the proper partition index in mapPartitionsWIthIndex mapPartitionsWithIndex uses TaskContext.partitionId as the partition index. TaskContext.partitionId used to be identical to the partition index in a RDD. However, pull request #186 introduced a scenario (with partition pruning) that the two can be different. This pull request uses the right partition index in all mapPartitionsWithIndex related calls. Also removed the extra MapPartitionsWIthContextRDD and put all the mapPartitions related functionality in MapPartitionsRDD.
\| * \| \| \| \|	Consolidated both mapPartitions related RDDs into a single MapPartitionsRDD.	Reynold Xin	2013-11-24	4	-70/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also changed the semantics of the index parameter in mapPartitionsWithIndex from the partition index of the output partition to the partition index in the current RDD.
* \| \| \| \| \|	Merge pull request #101 from colorant/yarn-client-scheduler	Matei Zaharia	2013-11-25	1	-0/+25
\|\ \ \ \ \ \ \| \|_\|/ / / / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For SPARK-527, Support spark-shell when running on YARN sync to trunk and resubmit here In current YARN mode approaching, the application is run in the Application Master as a user program thus the whole spark context is on remote. This approaching won't support application that involve local interaction and need to be run on where it is launched. So In this pull request I have a YarnClientClusterScheduler and backend added. With this scheduler, the user application is launched locally,While the executor will be launched by YARN on remote nodes with a thin AM which only launch the executor and monitor the Driver Actor status, so that when client app is done, it can finish the YARN Application as well. This enables spark-shell to run upon YARN. This also enable other Spark applications to have the spark context to run locally with a master-url "yarn-client". Thus e.g. SparkPi could have the result output locally on console instead of output in the log of the remote machine where AM is running on. Docs also updated to show how to use this yarn-client mode.