spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge branch 'apache-master' into transform	Tathagata Das	2013-10-25	6	-0/+87
\|\
\| *	Spacing fix	Patrick Wendell	2013-10-24	1	-2/+2
\| \|
\| *	Adding Java versions and associated tests	Patrick Wendell	2013-10-24	5	-1/+55
\| \|
\| *	Adding tests	Patrick Wendell	2013-10-24	1	-13/+0
\| \|
\| *	Always use a shuffle	Patrick Wendell	2013-10-24	1	-15/+7
\| \|
\| *	Add a `repartition` operator.	Patrick Wendell	2013-10-24	2	-0/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds an operator called repartition with more straightforward semantics than the current `coalesce` operator. There are a few use cases where this operator is useful: 1. If a user wants to increase the number of partitions in the RDD. This is more common now with streaming. E.g. a user is ingesting data on one node but they want to add more partitions to ensure parallelism of subsequent operations across threads or the cluster. Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's super confusing. 2. If a user has input data where the number of partitions is not known. E.g. > sc.textFile("some file").coalesce(50).... This is both vague semantically (am I growing or shrinking this RDD) but also, may not work correctly if the base RDD has fewer than 50 partitions. The new operator forces shuffles every time, so it will always produce exactly the number of new partitions. It also throws an exception rather than silently not-working if a bad input is passed. I am currently adding streaming tests (requires refactoring some of the test suite to allow testing at partition granularity), so this is not ready for merge yet. But feedback is welcome.
* \|	Merge branch 'apache-master' into transform	Tathagata Das	2013-10-24	13	-8/+164
\|\\|
\| *	Merge pull request #93 from kayousterhout/ui_new_state	Matei Zaharia	2013-10-23	10	-8/+130
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Show "GETTING_RESULTS" state in UI. This commit adds a set of calls using the SparkListener interface that indicate when a task is remotely fetching results, so that we can display this (potentially time-consuming) phase of execution to users through the UI.
\| \| *	Clear akka frame size property in tests	Kay Ousterhout	2013-10-23	1	-2/+6
\| \| \|
\| \| *	Fixed broken tests	Kay Ousterhout	2013-10-23	1	-4/+3
\| \| \|
\| \| *	Merge remote-tracking branch 'upstream/master' into ui_new_state	Kay Ousterhout	2013-10-23	44	-668/+854
\| \| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
\| \| * \|	Shorten GETTING_RESULT to GET_RESULT	Kay Ousterhout	2013-10-22	1	-1/+1
\| \| \| \|
\| \| * \|	Show "GETTING_RESULTS" state in UI.	Kay Ousterhout	2013-10-21	10	-5/+124
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit adds a set of calls using the SparkListener interface that indicate when a task is remotely fetching results, so that we can display this (potentially time-consuming) phase of execution to users through the UI.
\| * \| \|	Add unpersist() to JavaDoubleRDD and JavaPairRDD.	Josh Rosen	2013-10-23	3	-0/+34
\| \| \|/ \| \|/\| \| \| \| \| \| \|	Also add support for new optional `blocking` argument.
* \| \|	Added JavaStreamingContext.transform	Tathagata Das	2013-10-24	1	-0/+11
\| \| \|
* \| \|	Removed Function3.call() based on Josh's comment.	Tathagata Das	2013-10-23	1	-2/+0
\| \| \|
* \| \|	Merge branch 'apache-master' into transform	Tathagata Das	2013-10-22	57	-3105/+1565
\|\\| \|
\| * \|	Remove redundant Java Function call() definitions	Josh Rosen	2013-10-22	8	-32/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This should fix SPARK-902, an issue where some Java API Function classes could cause AbstractMethodErrors when user code is compiled using the Eclipse compiler. Thanks to @MartinWeindel for diagnosing this problem. (This PR subsumes / closes #30)
\| * \|	Formatting cleanup	Patrick Wendell	2013-10-22	1	-3/+5
\| \| \|
\| * \|	Minor clean-up in review	Patrick Wendell	2013-10-22	3	-8/+3
\| \| \|
\| * \|	Response to code review and adding some more tests	Patrick Wendell	2013-10-22	10	-74/+107
\| \| \|
\| * \|	Fix for Spark-870.	Patrick Wendell	2013-10-22	7	-13/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes a bug where the Spark UI didn't display the correct number of total tasks if the number of tasks in a Stage doesn't equal the number of RDD partitions. It also cleans up the listener API a bit by embedding this information in the StageInfo class rather than passing it seperately.
\| * \|	SPARK-940: Do not directly pass Stage objects to SparkListener.	Patrick Wendell	2013-10-22	9	-59/+75
\| \| \|
\| * \|	Merge pull request #82 from JoshRosen/map-output-tracker-refactoring	Matei Zaharia	2013-10-22	5	-98/+109
\| \|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Split MapOutputTracker into Master/Worker classes Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances. This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers. I also renamed a few methods and made others protected/private.
\| \| * \|	Unwrap a long line that actually fits.	Josh Rosen	2013-10-20	1	-2/+1
\| \| \| \|
\| \| * \|	Fix test failures in local mode due to updateEpoch	Josh Rosen	2013-10-20	1	-4/+1
\| \| \| \|
\| \| * \|	Split MapOutputTracker into Master/Worker classes.	Josh Rosen	2013-10-19	5	-98/+113
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances. This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers. I also renamed a few methods and made others protected/private.
\| * \| \|	Merge ShufflePerfTester patch into shuffle block consolidation	Aaron Davidson	2013-10-21	16	-446/+491
\| \|\ \ \
\| \| * \ \	Merge pull request #95 from aarondav/perftest	Reynold Xin	2013-10-21	1	-0/+0
\| \| \|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Minor: Put StoragePerfTester in org/apache/
\| \| \| * \| \|	Put StoragePerfTester in org/apache/	Aaron Davidson	2013-10-21	1	-0/+0
\| \| \| \| \| \|
\| \| * \| \| \|	Fix mesos urls	Aaron Davidson	2013-10-21	1	-2/+2
\| \| \|/ / / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71 Previously, we explicitly removed the mesos:// part; with PR 71, this no longer occured.
\| \| * \| \|	Merge pull request #88 from rxin/clean	Patrick Wendell	2013-10-21	7	-101/+113
\| \| \|\ \ \ \| \| \| \|_\|/ \| \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Made the following traits/interfaces/classes non-public: Made the following traits/interfaces/classes non-public: SparkHadoopWriter SparkHadoopMapRedUtil SparkHadoopMapReduceUtil SparkHadoopUtil PythonAccumulatorParam BlockManagerSlaveActor
\| \| \| * \|	Made JobLogger public again and some minor cleanup.	Reynold Xin	2013-10-20	1	-68/+54
\| \| \| \| \|
\| \| \| * \|	Made the following traits/interfaces/classes non-public:	Reynold Xin	2013-10-20	7	-39/+65
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SparkHadoopWriter SparkHadoopMapRedUtil SparkHadoopMapReduceUtil SparkHadoopUtil PythonAccumulatorParam JobLogger BlockManagerSlaveActor
\| \| * \| \|	Merge pull request #41 from pwendell/shuffle-benchmark	Patrick Wendell	2013-10-20	6	-4/+141
\| \| \|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Provide Instrumentation for Shuffle Write Performance Shuffle write performance can have a major impact on the performance of jobs. This patch adds a few pieces of instrumentation related to shuffle writes. They are: 1. A listing of the time spent performing blocking writes for each task. This is implemented by keeping track of the aggregate delay seen by many individual writes. 2. An undocumented option `spark.shuffle.sync` which forces shuffle data to sync to disk. This is necessary for measuring shuffle performance in the absence of the OS buffer cache. 3. An internal utility which micro-benchmarks write throughput for simulated shuffle outputs. I'm going to do some performance testing on this to see whether these small timing calls add overhead. From a feature perspective, however, I consider this complete. Any feedback is appreciated.
\| \| \| * \| \|	Making the timing block more narrow for the sync	Patrick Wendell	2013-10-07	1	-2/+2
\| \| \| \| \| \|
\| \| \| * \| \|	Minor cleanup	Patrick Wendell	2013-10-07	3	-66/+45
\| \| \| \| \| \|
\| \| \| * \| \|	Perf benchmark	Patrick Wendell	2013-10-07	1	-0/+89
\| \| \| \| \| \|
\| \| \| * \| \|	Trying new approach with writes	Patrick Wendell	2013-10-07	2	-7/+7
\| \| \| \| \| \|
\| \| \| * \| \|	Adding option to force sync to the filesystem	Patrick Wendell	2013-10-07	1	-3/+17
\| \| \| \| \| \|
\| \| \| * \| \|	Track and report write throughput for shuffle tasks.	Patrick Wendell	2013-10-07	5	-2/+57
\| \| \| \| \| \|
\| \| * \| \| \|	Merge pull request #89 from rxin/executor	Reynold Xin	2013-10-20	1	-20/+23
\| \| \|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Don't setup the uncaught exception handler in local mode. This avoids unit test failures for Spark streaming. java.util.concurrent.RejectedExecutionException: Task org.apache.spark.streaming.JobManager$JobHandler@38cf728d rejected from java.util.concurrent.ThreadPoolExecutor@3b69a41e[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) at org.apache.spark.streaming.JobManager.runJob(JobManager.scala:54) at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108) at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.streaming.Scheduler.generateJobs(Scheduler.scala:108) at org.apache.spark.streaming.Scheduler$$anonfun$1.apply$mcVJ$sp(Scheduler.scala:41) at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:66) at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:34)
\| \| \| * \| \| \|	Don't setup the uncaught exception handler in local mode.	Reynold Xin	2013-10-20	1	-20/+23
\| \| \| \| \|/ / \| \| \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This avoids unit test failures for Spark streaming.
\| \| * \| \| \|	Merge pull request #75 from JoshRosen/block-manager-cleanup	Matei Zaharia	2013-10-20	1	-292/+166
\| \| \|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Code de-duplication in BlockManager The BlockManager has a few methods that duplicate most of their code. This pull request extracts the duplicated code into private doPut(), doGetLocal(), and doGetRemote() methods that unify the storing/reading of bytes or objects. I believe that I preserved the logic of the original code, but I'd appreciate some help in reviewing this.
\| \| \| * \| \| \|	Minor cleanup based on @aarondav's code review.	Josh Rosen	2013-10-20	1	-11/+5
\| \| \| \| \| \| \|
\| \| \| * \| \| \|	De-duplicate code in dropOld[Non]BroadcastBlocks.	Josh Rosen	2013-10-19	1	-20/+6
\| \| \| \| \| \| \|
\| \| \| * \| \| \|	Code de-duplication in put() and putBytes().	Josh Rosen	2013-10-19	1	-140/+89
\| \| \| \| \| \| \|
\| \| \| * \| \| \|	De-duplication in getRemote() and getRemoteBytes().	Josh Rosen	2013-10-19	1	-32/+18
\| \| \| \| \| \| \|
\| \| \| * \| \| \|	De-duplication in getLocal() and getLocalBytes().	Josh Rosen	2013-10-19	1	-100/+59
\| \| \| \| \|_\|/ \| \| \| \|/\| \|
\| \| * \| \| \|	Merge pull request #84 from rxin/kill1	Reynold Xin	2013-10-20	1	-21/+38
\| \| \|\ \ \ \ \| \| \| \|_\|/ / \| \| \|/\| \| \| \| \| \| \| \| \|	Added documentation for setJobGroup. Also some minor cleanup in SparkContext.