aboutsummaryrefslogtreecommitdiff
path: root/core
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'apache-master' into transformTathagata Das2013-10-256-0/+87
|\
| * Spacing fixPatrick Wendell2013-10-241-2/+2
| |
| * Adding Java versions and associated testsPatrick Wendell2013-10-245-1/+55
| |
| * Adding testsPatrick Wendell2013-10-241-13/+0
| |
| * Always use a shufflePatrick Wendell2013-10-241-15/+7
| |
| * Add a `repartition` operator.Patrick Wendell2013-10-242-0/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds an operator called repartition with more straightforward semantics than the current `coalesce` operator. There are a few use cases where this operator is useful: 1. If a user wants to increase the number of partitions in the RDD. This is more common now with streaming. E.g. a user is ingesting data on one node but they want to add more partitions to ensure parallelism of subsequent operations across threads or the cluster. Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's super confusing. 2. If a user has input data where the number of partitions is not known. E.g. > sc.textFile("some file").coalesce(50).... This is both vague semantically (am I growing or shrinking this RDD) but also, may not work correctly if the base RDD has fewer than 50 partitions. The new operator forces shuffles every time, so it will always produce exactly the number of new partitions. It also throws an exception rather than silently not-working if a bad input is passed. I am currently adding streaming tests (requires refactoring some of the test suite to allow testing at partition granularity), so this is not ready for merge yet. But feedback is welcome.
* | Merge branch 'apache-master' into transformTathagata Das2013-10-2413-8/+164
|\|
| * Merge pull request #93 from kayousterhout/ui_new_stateMatei Zaharia2013-10-2310-8/+130
| |\ | | | | | | | | | | | | | | | | | | | | | | | | Show "GETTING_RESULTS" state in UI. This commit adds a set of calls using the SparkListener interface that indicate when a task is remotely fetching results, so that we can display this (potentially time-consuming) phase of execution to users through the UI.
| | * Clear akka frame size property in testsKay Ousterhout2013-10-231-2/+6
| | |
| | * Fixed broken testsKay Ousterhout2013-10-231-4/+3
| | |
| | * Merge remote-tracking branch 'upstream/master' into ui_new_stateKay Ousterhout2013-10-2344-668/+854
| | |\ | | | | | | | | | | | | | | | | Conflicts: core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
| | * | Shorten GETTING_RESULT to GET_RESULTKay Ousterhout2013-10-221-1/+1
| | | |
| | * | Show "GETTING_RESULTS" state in UI.Kay Ousterhout2013-10-2110-5/+124
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a set of calls using the SparkListener interface that indicate when a task is remotely fetching results, so that we can display this (potentially time-consuming) phase of execution to users through the UI.
| * | | Add unpersist() to JavaDoubleRDD and JavaPairRDD.Josh Rosen2013-10-233-0/+34
| | |/ | |/| | | | | | | Also add support for new optional `blocking` argument.
* | | Added JavaStreamingContext.transformTathagata Das2013-10-241-0/+11
| | |
* | | Removed Function3.call() based on Josh's comment.Tathagata Das2013-10-231-2/+0
| | |
* | | Merge branch 'apache-master' into transformTathagata Das2013-10-2257-3105/+1565
|\| |
| * | Remove redundant Java Function call() definitionsJosh Rosen2013-10-228-32/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This should fix SPARK-902, an issue where some Java API Function classes could cause AbstractMethodErrors when user code is compiled using the Eclipse compiler. Thanks to @MartinWeindel for diagnosing this problem. (This PR subsumes / closes #30)
| * | Formatting cleanupPatrick Wendell2013-10-221-3/+5
| | |
| * | Minor clean-up in reviewPatrick Wendell2013-10-223-8/+3
| | |
| * | Response to code review and adding some more testsPatrick Wendell2013-10-2210-74/+107
| | |
| * | Fix for Spark-870.Patrick Wendell2013-10-227-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a bug where the Spark UI didn't display the correct number of total tasks if the number of tasks in a Stage doesn't equal the number of RDD partitions. It also cleans up the listener API a bit by embedding this information in the StageInfo class rather than passing it seperately.
| * | SPARK-940: Do not directly pass Stage objects to SparkListener.Patrick Wendell2013-10-229-59/+75
| | |
| * | Merge pull request #82 from JoshRosen/map-output-tracker-refactoringMatei Zaharia2013-10-225-98/+109
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split MapOutputTracker into Master/Worker classes Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances. This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers. I also renamed a few methods and made others protected/private.
| | * | Unwrap a long line that actually fits.Josh Rosen2013-10-201-2/+1
| | | |
| | * | Fix test failures in local mode due to updateEpochJosh Rosen2013-10-201-4/+1
| | | |
| | * | Split MapOutputTracker into Master/Worker classes.Josh Rosen2013-10-195-98/+113
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances. This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers. I also renamed a few methods and made others protected/private.
| * | | Merge ShufflePerfTester patch into shuffle block consolidationAaron Davidson2013-10-2116-446/+491
| |\ \ \
| | * \ \ Merge pull request #95 from aarondav/perftestReynold Xin2013-10-211-0/+0
| | |\ \ \ | | | | | | | | | | | | | | | | | | Minor: Put StoragePerfTester in org/apache/
| | | * | | Put StoragePerfTester in org/apache/Aaron Davidson2013-10-211-0/+0
| | | | | |
| | * | | | Fix mesos urlsAaron Davidson2013-10-211-2/+2
| | |/ / / | | | | | | | | | | | | | | | | | | | | This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71 Previously, we explicitly removed the mesos:// part; with PR 71, this no longer occured.
| | * | | Merge pull request #88 from rxin/cleanPatrick Wendell2013-10-217-101/+113
| | |\ \ \ | | | |_|/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Made the following traits/interfaces/classes non-public: Made the following traits/interfaces/classes non-public: SparkHadoopWriter SparkHadoopMapRedUtil SparkHadoopMapReduceUtil SparkHadoopUtil PythonAccumulatorParam BlockManagerSlaveActor
| | | * | Made JobLogger public again and some minor cleanup.Reynold Xin2013-10-201-68/+54
| | | | |
| | | * | Made the following traits/interfaces/classes non-public:Reynold Xin2013-10-207-39/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SparkHadoopWriter SparkHadoopMapRedUtil SparkHadoopMapReduceUtil SparkHadoopUtil PythonAccumulatorParam JobLogger BlockManagerSlaveActor
| | * | | Merge pull request #41 from pwendell/shuffle-benchmarkPatrick Wendell2013-10-206-4/+141
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide Instrumentation for Shuffle Write Performance Shuffle write performance can have a major impact on the performance of jobs. This patch adds a few pieces of instrumentation related to shuffle writes. They are: 1. A listing of the time spent performing blocking writes for each task. This is implemented by keeping track of the aggregate delay seen by many individual writes. 2. An undocumented option `spark.shuffle.sync` which forces shuffle data to sync to disk. This is necessary for measuring shuffle performance in the absence of the OS buffer cache. 3. An internal utility which micro-benchmarks write throughput for simulated shuffle outputs. I'm going to do some performance testing on this to see whether these small timing calls add overhead. From a feature perspective, however, I consider this complete. Any feedback is appreciated.
| | | * | | Making the timing block more narrow for the syncPatrick Wendell2013-10-071-2/+2
| | | | | |
| | | * | | Minor cleanupPatrick Wendell2013-10-073-66/+45
| | | | | |
| | | * | | Perf benchmarkPatrick Wendell2013-10-071-0/+89
| | | | | |
| | | * | | Trying new approach with writesPatrick Wendell2013-10-072-7/+7
| | | | | |
| | | * | | Adding option to force sync to the filesystemPatrick Wendell2013-10-071-3/+17
| | | | | |
| | | * | | Track and report write throughput for shuffle tasks.Patrick Wendell2013-10-075-2/+57
| | | | | |
| | * | | | Merge pull request #89 from rxin/executorReynold Xin2013-10-201-20/+23
| | |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't setup the uncaught exception handler in local mode. This avoids unit test failures for Spark streaming. java.util.concurrent.RejectedExecutionException: Task org.apache.spark.streaming.JobManager$JobHandler@38cf728d rejected from java.util.concurrent.ThreadPoolExecutor@3b69a41e[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) at org.apache.spark.streaming.JobManager.runJob(JobManager.scala:54) at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108) at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.streaming.Scheduler.generateJobs(Scheduler.scala:108) at org.apache.spark.streaming.Scheduler$$anonfun$1.apply$mcVJ$sp(Scheduler.scala:41) at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:66) at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:34)
| | | * | | | Don't setup the uncaught exception handler in local mode.Reynold Xin2013-10-201-20/+23
| | | | |/ / | | | |/| | | | | | | | | | | | | | This avoids unit test failures for Spark streaming.
| | * | | | Merge pull request #75 from JoshRosen/block-manager-cleanupMatei Zaharia2013-10-201-292/+166
| | |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Code de-duplication in BlockManager The BlockManager has a few methods that duplicate most of their code. This pull request extracts the duplicated code into private doPut(), doGetLocal(), and doGetRemote() methods that unify the storing/reading of bytes or objects. I believe that I preserved the logic of the original code, but I'd appreciate some help in reviewing this.
| | | * | | | Minor cleanup based on @aarondav's code review.Josh Rosen2013-10-201-11/+5
| | | | | | |
| | | * | | | De-duplicate code in dropOld[Non]BroadcastBlocks.Josh Rosen2013-10-191-20/+6
| | | | | | |
| | | * | | | Code de-duplication in put() and putBytes().Josh Rosen2013-10-191-140/+89
| | | | | | |
| | | * | | | De-duplication in getRemote() and getRemoteBytes().Josh Rosen2013-10-191-32/+18
| | | | | | |
| | | * | | | De-duplication in getLocal() and getLocalBytes().Josh Rosen2013-10-191-100/+59
| | | | |_|/ | | | |/| |
| | * | | | Merge pull request #84 from rxin/kill1Reynold Xin2013-10-201-21/+38
| | |\ \ \ \ | | | |_|/ / | | |/| | | | | | | | | Added documentation for setJobGroup. Also some minor cleanup in SparkContext.