spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge pull request #320 from kayousterhout/erroneous_failed_msg	Reynold Xin	2014-01-02	2	-12/+15
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove erroneous FAILED state for killed tasks. Currently, when tasks are killed, the Executor first sends a status update for the task with a "KILLED" state, and then sends a second status update with a "FAILED" state saying that the task failed due to an exception. The second FAILED state is misleading/unncessary, and occurs due to a NonLocalReturnControl Exception that gets thrown due to the way we kill tasks. This commit eliminates that problem. I'm not at all sure that this is the best way to fix this problem, so alternate suggestions welcome. @rxin guessing you're the right person to look at this.
\| *	Remove erroneous FAILED state for killed tasks.	Kay Ousterhout	2014-01-02	2	-12/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, when tasks are killed, the Executor first sends a status update for the task with a "KILLED" state, and then sends a second status update with a "FAILED" state saying that the task failed due to an exception. The second FAILED state is misleading/unncessary, and occurs due to a NonLocalReturnControl Exception that gets thrown due to the way we kill tasks. This commit eliminates that problem.
* \|	Merge pull request #297 from tdas/window-improvement	Patrick Wendell	2014-01-02	4	-157/+343
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Improvements to DStream window ops and refactoring of Spark's CheckpointSuite - Added a new RDD - PartitionerAwareUnionRDD. Using this RDD, one can take multiple RDDs partitioned by the same partitioner and unify them into a single RDD while preserving the partitioner. So m RDDs with p partitions each will be unified to a single RDD with p partitions and the same partitioner. The preferred location for each partition of the unified RDD will be the most common preferred location of the corresponding partitions of the parent RDDs. For example, location of partition 0 of the unified RDD will be where most of partition 0 of the parent RDDs are located. - Improved the performance of DStream's reduceByKeyAndWindow and groupByKeyAndWindow. Both these operations work by doing per-batch reduceByKey/groupByKey and then using PartitionerAwareUnionRDD to union the RDDs across the window. This eliminates a shuffle related to the window operation, which can reduce batch processing time by 30-40% for simple workloads. - Fixed bugs and simplified Spark's CheckpointSuite. Some of the tests were incorrect and unreliable. Added missing tests for ZippedRDD. I can go into greater detail if necessary. - Added mapSideCombine option to combineByKeyAndWindow.
\| * \|	Added Apache boilerplate and class docs to PartitionerAwareUnionRDD.	Tathagata Das	2013-12-26	1	-3/+33
\| \| \|
\| * \|	Merge branch 'apache-master' into window-improvement	Tathagata Das	2013-12-26	28	-1522/+975
\| \|\ \
\| * \ \	Merge branch 'master' into window-improvement	Tathagata Das	2013-12-26	30	-113/+395
\| \|\ \ \
\| * \| \| \|	Fixed bug in PartitionAwareUnionRDD	Tathagata Das	2013-12-26	1	-6/+9
\| \| \| \| \|
\| * \| \| \|	Added tests for PartitionerAwareUnionRDD in the CheckpointSuite. Refactored ↵	Tathagata Das	2013-12-20	3	-170/+231
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CheckpointSuite to make the tests simpler and more reliable. Added missing test for ZippedRDD.
\| * \| \| \|	Merge branch 'scheduler-update' into window-improvement	Tathagata Das	2013-12-19	149	-1263/+2721
\| \|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/WindowedDStream.scala
\| * \| \| \| \|	Added partitioner aware union, modified DStream.window.	Tathagata Das	2013-11-21	2	-0/+92
\| \| \| \| \| \|
* \| \| \| \| \|	Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/incubator-spark	Matei Zaharia	2014-01-02	2	-6/+1
\|\ \ \ \ \ \
\| * \| \| \| \| \|	Removed redundant TaskSetManager.error() function.	Kay Ousterhout	2014-01-02	2	-6/+1
\| \| \|_\|_\|_\|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This function was leftover from a while ago, and now just passes all calls through to the abort() function, so this commit deletes it.
* \| \| \| \| \|	Merge pull request #311 from tmyklebu/master	Matei Zaharia	2014-01-02	3	-4/+38
\|\ \ \ \ \ \ \| \|/ / / / / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-991: Report information gleaned from a Python stacktrace in the UI Scala: - Added setCallSite/clearCallSite to SparkContext and JavaSparkContext. These functions mutate a LocalProperty called "externalCallSite." - Add a wrapper, getCallSite, that checks for an externalCallSite and, if none is found, calls the usual Utils.formatSparkCallSite. - Change everything that calls Utils.formatSparkCallSite to call getCallSite instead. Except getCallSite. - Add wrappers to setCallSite/clearCallSite wrappers to JavaSparkContext. Python: - Add a gruesome hack to rdd.py that inspects the traceback and guesses what you want to see in the UI. - Add a RAII wrapper around said gruesome hack that calls setCallSite/clearCallSite as appropriate. - Wire said RAII wrapper up around three calls into the Scala code. I'm not sure that I hit all the spots with the RAII wrapper. I'm also not sure that my gruesome hack does exactly what we want. One could also approach this change by refactoring runJob/submitJob/runApproximateJob to take a call site, then threading that parameter through everything that needs to know it. One might object to the pointless-looking wrappers in JavaSparkContext. Unfortunately, I can't directly access the SparkContext from Python---or, if I can, I don't know how---so I need to wrap everything that matters in JavaSparkContext. Conflicts: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
\| * \| \| \| \|	Factor call site reporting out to SparkContext.	Tor Myklebust	2013-12-28	3	-4/+38
\| \| \| \| \| \|
* \| \| \| \| \|	Fixed two uses of conf.get with no default value in Mesos	Matei Zaharia	2014-01-01	2	-2/+2
\| \| \| \| \| \|
* \| \| \| \| \|	Miscellaneous fixes from code review.	Matei Zaharia	2014-01-01	44	-174/+195
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also replaced SparkConf.getOrElse with just a "get" that takes a default value, and added getInt, getLong, etc to make code that uses this simpler later on.
* \| \| \| \| \|	Merge remote-tracking branch 'apache/master' into conf2	Matei Zaharia	2014-01-01	11	-21/+45
\|\ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
\| * \ \ \ \ \	Merge remote-tracking branch 'apache-github/master' into log4j-fix-2	Patrick Wendell	2014-01-01	27	-311/+570
\| \|\ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
\| * \| \| \| \| \| \|	Adding outer checkout when initializing logging	Patrick Wendell	2013-12-31	1	-3/+5
\| \| \| \| \| \| \| \|
\| * \| \| \| \| \| \|	Tiny typo fix	Patrick Wendell	2013-12-31	1	-2/+2
\| \| \| \| \| \| \| \|
\| * \| \| \| \| \| \|	Removing use in test	Patrick Wendell	2013-12-31	1	-2/+0
\| \| \| \| \| \| \| \|
\| * \| \| \| \| \| \|	Minor fixes	Patrick Wendell	2013-12-30	2	-3/+3
\| \| \| \| \| \| \| \|
\| * \| \| \| \| \| \|	Removing initLogging entirely	Patrick Wendell	2013-12-30	9	-17/+21
\| \| \| \| \| \| \| \|
\| * \| \| \| \| \| \|	Response to Shivaram's review	Patrick Wendell	2013-12-30	1	-1/+1
\| \| \| \| \| \| \| \|
\| * \| \| \| \| \| \|	SPARK-1008: Logging improvments	Patrick Wendell	2013-12-29	2	-3/+23
\| \| \|/ / / / / \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Adds a default log4j file that gets loaded if users haven't specified a log4j file. 2. Isolates use of the tools assembly jar. I found this produced SLF4J warnings after building with SBT (and I've seen similar warnings on the mailing list).
* \| \| \| \| \| \|	Merge remote-tracking branch 'apache/master' into conf2	Matei Zaharia	2014-01-01	10	-210/+450
\|\ \ \ \ \ \ \ \| \| \|/ / / / / \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: project/SparkBuild.scala
\| * \| \| \| \| \|	restore core/pom.xml file modification	liguoqiang	2014-01-01	1	-1351/+235
\| \| \| \| \| \| \|
\| * \| \| \| \| \|	Merge pull request #73 from falaki/ApproximateDistinctCount	Reynold Xin	2013-12-31	10	-232/+1588
\| \|\ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Approximate distinct count Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.
\| \| * \| \| \| \| \|	Made the code more compact and readable	Hossein Falaki	2013-12-31	3	-23/+8
\| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \|	minor improvements	Hossein Falaki	2013-12-31	2	-4/+5
\| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \|	Added Java unit tests for countApproxDistinct and countApproxDistinctByKey	Hossein Falaki	2013-12-30	1	-0/+32
\| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \|	Added Java API for countApproxDistinct	Hossein Falaki	2013-12-30	1	-0/+11
\| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \|	Added Java API for countApproxDistinctByKey	Hossein Falaki	2013-12-30	1	-0/+36
\| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \|	Renamed countDistinct and countDistinctByKey methods to include Approx	Hossein Falaki	2013-12-30	5	-15/+15
\| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \|	Using origin version	Hossein Falaki	2013-12-30	194	-4956/+8379
\| \| \|\ \ \ \ \ \ \| \| \| \| \|_\|_\|/ / \| \| \| \|/\| \| \| \|
\| \| * \| \| \| \| \|	Removed superfluous abs call from test cases.	Hossein Falaki	2013-12-10	1	-2/+2
\| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \|	Made SerializableHyperLogLog Externalizable and added Kryo tests	Hossein Falaki	2013-10-18	2	-5/+10
\| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \|	Added stream-lib dependency to Maven build	Hossein Falaki	2013-10-18	1	-0/+4
\| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \|	Improved code style.	Hossein Falaki	2013-10-17	4	-15/+19
\| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \|	Fixed document typo	Hossein Falaki	2013-10-17	2	-4/+4
\| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \|	Added countDistinctByKey to PairRDDFunctions that counts the approximate ↵	Hossein Falaki	2013-10-17	2	-0/+81
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	number of unique values for each key in the RDD.
\| \| * \| \| \| \| \|	Added a countDistinct method to RDD that takes takes an accuracy parameter ↵	Hossein Falaki	2013-10-17	2	-1/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and returns the (approximate) number of distinct elements in the RDD.
\| \| * \| \| \| \| \|	Added a serializable wrapper for HyperLogLog	Hossein Falaki	2013-10-17	1	-0/+44
\| \| \| \| \| \| \| \|
* \| \| \| \| \| \| \|	Merge remote-tracking branch 'apache/master' into conf2	Matei Zaharia	2013-12-31	19	-107/+120
\|\\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
\| * \| \| \| \| \| \|	Merge pull request #238 from ngbinh/upgradeNetty	Patrick Wendell	2013-12-31	6	-42/+58
\| \|\ \ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	upgrade Netty from 4.0.0.Beta2 to 4.0.13.Final the changes are listed at https://github.com/netty/netty/wiki/New-and-noteworthy
\| \| * \| \| \| \| \| \|	Fix failed unit tests	Binh Nguyen	2013-12-27	3	-13/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also clean up a bit.
\| \| * \| \| \| \| \| \|	Fix imports order	Binh Nguyen	2013-12-24	3	-5/+2
\| \| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \| \|	Remove import * and fix some formatting	Binh Nguyen	2013-12-24	2	-7/+4
\| \| \| \| \| \| \| \| \|
\| \| * \| \| \| \| \| \|	upgrade Netty from 4.0.0.Beta2 to 4.0.13.Final	Binh Nguyen	2013-12-24	5	-29/+40
\| \| \| \|/ / / / / \| \| \|/\| \| \| \| \|
\| * \| \| \| \| \| \|	Merge pull request #289 from tdas/filestream-fix	Patrick Wendell	2013-12-31	5	-45/+44
\| \|\ \ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Bug fixes for file input stream and checkpointing - Fixed bugs in the file input stream that led the stream to fail due to transient HDFS errors (listing files when a background thread it deleting fails caused errors, etc.) - Updated Spark's CheckpointRDD and Streaming's CheckpointWriter to use SparkContext.hadoopConfiguration, to allow checkpoints to be written to any HDFS compatible store requiring special configuration. - Changed the API of SparkContext.setCheckpointDir() - eliminated the unnecessary 'useExisting' parameter. Now SparkContext will always create a unique subdirectory within the user specified checkpoint directory. This is to ensure that previous checkpoint files are not accidentally overwritten. - Fixed bug where setting checkpoint directory as a relative local path caused the checkpointing to fail.