spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge pull request #533 from andrewor14/master. Closes #533.	Andrew Or	2014-02-06	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	External spilling - generalize batching logic The existing implementation consists of a hack for Kryo specifically and only works for LZF compression. Introducing an intermediate batch-level stream takes care of pre-fetching and other arbitrary behavior of higher level streams in a more general way. Author: Andrew Or <andrewor14@gmail.com> == Merge branch commits == commit 3ddeb7ef89a0af2b685fb5d071aa0f71c975cc82 Author: Andrew Or <andrewor14@gmail.com> Date: Wed Feb 5 12:09:32 2014 -0800 Also privatize fields commit 090544a87a0767effd0c835a53952f72fc8d24f0 Author: Andrew Or <andrewor14@gmail.com> Date: Wed Feb 5 10:58:23 2014 -0800 Privatize methods commit 13920c918efe22e66a1760b14beceb17a61fd8cc Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 16:34:15 2014 -0800 Update docs commit bd5a1d7350467ed3dc19c2de9b2c9f531f0e6aa3 Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 13:44:24 2014 -0800 Typo: phyiscal -> physical commit 287ef44e593ad72f7434b759be3170d9ee2723d2 Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 13:38:32 2014 -0800 Avoid reading the entire batch into memory; also simplify streaming logic Additionally, address formatting comments. commit 3df700509955f7074821e9aab1e74cb53c58b5a5 Merge: a531d2e 164489d Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:27:49 2014 -0800 Merge branch 'master' of github.com:andrewor14/incubator-spark commit a531d2e347acdcecf2d0ab72cd4f965ab5e145d8 Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:18:04 2014 -0800 Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF. commit 164489d6f176bdecfa9dabec2dfce5504d1ee8af Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:18:04 2014 -0800 Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF.
*	Merge pull request #524 from rxin/doc	Reynold Xin	2014-01-30	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added spark.shuffle.file.buffer.kb to configuration doc. Author: Reynold Xin <rxin@apache.org> == Merge branch commits == commit 0eea1d761ff772ff89be234e1e28035d54e5a7de Author: Reynold Xin <rxin@apache.org> Date: Wed Jan 29 14:40:48 2014 -0800 Added spark.shuffle.file.buffer.kb to configuration doc.
*	Merge pull request #497 from tdas/docs-update	Tathagata Das	2014-01-28	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Updated Spark Streaming Programming Guide Here is the updated version of the Spark Streaming Programming Guide. This is still a work in progress, but the major changes are in place. So feedback is most welcome. In general, I have tried to make the guide to easier to understand even if the reader does not know much about Spark. The updated website is hosted here - http://www.eecs.berkeley.edu/~tdas/spark_docs/streaming-programming-guide.html The major changes are: - Overview illustrates the usecases of Spark Streaming - various input sources and various output sources - An example right after overview to quickly give an idea of what Spark Streaming program looks like - Made Java API and examples a first class citizen like Scala by using tabs to show both Scala and Java examples (similar to AMPCamp tutorial's code tabs) - Highlighted the DStream operations updateStateByKey and transform because of their powerful nature - Updated driver node failure recovery text to highlight automatic recovery in Spark standalone mode - Added information about linking and using the external input sources like Kafka and Flume - In general, reorganized the sections to better show the Basic section and the more advanced sections like Tuning and Recovery. Todos: - Links to the docs of external Kafka, Flume, etc - Illustrate window operation with figure as well as example. Author: Tathagata Das <tathagata.das1565@gmail.com> == Merge branch commits == commit 18ff10556570b39d672beeb0a32075215cfcc944 Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Tue Jan 28 21:49:30 2014 -0800 Fixed a lot of broken links. commit 34a5a6008dac2e107624c7ff0db0824ee5bae45f Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Tue Jan 28 18:02:28 2014 -0800 Updated github url to use SPARK_GITHUB_URL variable. commit f338a60ae8069e0a382d2cb170227e5757cc0b7a Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Mon Jan 27 22:42:42 2014 -0800 More updates based on Patrick and Harvey's comments. commit 89a81ff25726bf6d26163e0dd938290a79582c0f Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Mon Jan 27 13:08:34 2014 -0800 Updated docs based on Patricks PR comments. commit d5b6196b532b5746e019b959a79ea0cc013a8fc3 Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Sun Jan 26 20:15:58 2014 -0800 Added spark.streaming.unpersist config and info on StreamingListener interface. commit e3dcb46ab83d7071f611d9b5008ba6bc16c9f951 Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Sun Jan 26 18:41:12 2014 -0800 Fixed docs on StreamingContext.getOrCreate. commit 6c29524639463f11eec721e4d17a9d7159f2944b Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Thu Jan 23 18:49:39 2014 -0800 Added example and figure for window operations, and links to Kafka and Flume API docs. commit f06b964a51bb3b21cde2ff8bdea7d9785f6ce3a9 Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Wed Jan 22 22:49:12 2014 -0800 Fixed missing endhighlight tag in the MLlib guide. commit 036a7d46187ea3f2a0fb8349ef78f10d6c0b43a9 Merge: eab351d a1cd185 Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Wed Jan 22 22:17:42 2014 -0800 Merge remote-tracking branch 'apache/master' into docs-update commit eab351d05c0baef1d4b549e1581310087158d78d Author: Tathagata Das <tathagata.das1565@gmail.com> Date: Wed Jan 22 22:17:15 2014 -0800 Update Spark Streaming Programming Guide.
*	Merge pull request #466 from liyinan926/file-overwrite-new	Reynold Xin	2014-01-27	1	-0/+7
\|\ \| \| \| \| \| \| \| \| \| \|	Allow files added through SparkContext.addFile() to be overwritten This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. For example, a possible use case is: the driver periodically renews a Hadoop delegation token and writes it to a token file. The token file needs to be downloaded by the executors whenever it gets renewed. However, the current implementation throws an exception when the target file exists and its contents do not match those of the new source. This PR adds an option to allow files to be overwritten to support use cases similar to the above.
\| *	Addressed comments from Reynold	Yinan Li	2014-01-18	1	-1/+0
\| \| \| \| \| \| \| \|	Signed-off-by: Yinan Li <liyinan926@gmail.com>
\| *	Allow files added through SparkContext.addFile() to be overwritten	Yinan Li	2014-01-18	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. Signed-off-by: Yinan Li <liyinan926@gmail.com>
* \|	Clarify spark.default.parallelism	Andrew Ash	2014-01-21	1	-1/+1
\| \| \| \| \| \|	It's the task count across the cluster, not per worker, per machine, per core, or anything else.
* \|	Force use of LZF when spilling data	Patrick Wendell	2014-01-20	1	-1/+3
\| \|
* \|	Removing docs on akka options	Patrick Wendell	2014-01-20	1	-7/+0
\| \|
* \|	Merge pull request #462 from mateiz/conf-file-fix	Patrick Wendell	2014-01-18	1	-26/+2
\|/ \| \| \| \| \| \| \| \| \| \| \| \|	Remove Typesafe Config usage and conf files to fix nested property names With Typesafe Config we had the subtle problem of no longer allowing nested property names, which are used for a few of our properties: http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html This PR is for branch 0.9 but should be added into master too. (cherry picked from commit 34e911ce9a9f91f3259189861779032069257852) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	Enable compression by default for spills	Patrick Wendell	2014-01-13	1	-1/+1
\|
*	Changing option wording per discussion with Andrew	Patrick Wendell	2014-01-13	1	-3/+3
\|
*	Improvements to external sorting	Patrick Wendell	2014-01-13	1	-2/+9
\| \| \| \| \| \| \|	1. Adds the option of compressing outputs. 2. Adds batching to the serialization to prevent OOM on the read side. 3. Slight renaming of config options. 4. Use Spark's buffer size for reads in addition to writes.
*	Disable shuffle file consolidation by default	Patrick Wendell	2014-01-12	1	-1/+1
\|
*	Merge pull request #377 from andrewor14/master	Patrick Wendell	2014-01-10	1	-2/+21
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	External Sorting for Aggregator and CoGroupedRDDs (Revisited) (This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving) The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted. The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order. Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.
\| *	Update documentation for externalSorting	Andrew Or	2014-01-10	1	-3/+2
\| \|
\| *	Address Patrick's and Reynold's comments	Andrew Or	2014-01-10	1	-2/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Aside from trivial formatting changes, use nulls instead of Options for DiskMapIterator, and add documentation for spark.shuffle.externalSorting and spark.shuffle.memoryFraction. Also, set spark.shuffle.memoryFraction to 0.3, and spark.storage.memoryFraction = 0.6.
* \|	Enable shuffle consolidation by default.	Patrick Wendell	2014-01-09	1	-1/+1
\|/ \| \| \|	Bump this to being enabled for 0.9.0.
*	Fixing config option "retained_stages" => "retainedStages".	Patrick Wendell	2014-01-08	1	-1/+1
\| \| \| \| \|	This is a very esoteric option and it's out of sync with the style we use. So it seems fitting to fix it for 0.9.0.
*	Address review comments	Matei Zaharia	2014-01-07	1	-2/+2
\|
*	Add way to limit default # of cores used by applications on standalone mode	Matei Zaharia	2014-01-07	1	-4/+29
\| \| \| \|	Also documents the spark.deploy.spreadOut option.
*	formatting related fixes suggested by Patrick.	Prashant Sharma	2014-01-07	1	-1/+1
\|
*	Allow configuration to be printed in logs for diagnosis.	Prashant Sharma	2014-01-07	1	-0/+7
\|
*	Allow users to set arbitrary akka configurations via spark conf.	Prashant Sharma	2014-01-07	1	-0/+8
\|
*	Clarify spark.cores.max	Andrew Ash	2014-01-06	1	-1/+2
\| \| \|	It controls the count of cores across the cluster, not on a per-machine basis.
*	Updated docs for SparkConf and handled review comments	Matei Zaharia	2013-12-30	1	-15/+56
\|
*	A few corrections to documentation.	Prashant Sharma	2013-12-12	1	-7/+7
\|
*	Merge branch 'master' into akka-bug-fix	Prashant Sharma	2013-12-11	1	-1/+35
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/pom.xml core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala pom.xml project/SparkBuild.scala streaming/pom.xml yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
\| *	Correct spellling error in configuration.md	Aaron Davidson	2013-12-07	1	-1/+1
\| \|
\| *	Minor formatting fix in config file	Patrick Wendell	2013-12-06	1	-1/+0
\| \|
\| *	Adding disclaimer for shuffle file consolidation	Patrick Wendell	2013-12-06	1	-1/+1
\| \|
\| *	Small changes from Matei review	Patrick Wendell	2013-12-04	1	-2/+2
\| \|
\| *	Document missing configs and set shuffle consolidation to false.	Patrick Wendell	2013-12-04	1	-1/+36
\| \|
* \|	Improvements from the review comments and followed Boy Scout Rule.	Prashant Sharma	2013-11-27	1	-2/+2
\| \|
* \|	Documenting the newly added spark properties.	Prashant Sharma	2013-11-26	1	-1/+22
\|/
*	Merge pull request #76 from pwendell/master	Reynold Xin	2013-10-18	1	-1/+1
\|\ \| \| \| \| \| \| \| \| \| \| \| \|	Clarify compression property. Clarifies that this governs compression of internal data, not input data or output data.
\| *	Clarify compression property.	Patrick Wendell	2013-10-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Clarifies that this governs compression of internal data, not input data or output data.
* \|	Code styling. Updated doc.	Mosharaf Chowdhury	2013-10-17	1	-0/+8
\|/
*	Change port from 3030 to 4040	Patrick Wendell	2013-09-11	1	-1/+1
\|
*	Work in progress:	Matei Zaharia	2013-09-08	1	-11/+21
\| \| \| \| \| \| \|	- Add job scheduling docs - Rename some fair scheduler properties - Organize intro page better - Link to Apache wiki for "contributing to Spark"
*	Fix spark.io.compression.codec and change default codec to LZF	Matei Zaharia	2013-09-02	1	-1/+1
\|
*	Doc improvements	Matei Zaharia	2013-09-01	1	-1/+3
\|
*	Move some classes to more appropriate packages:	Matei Zaharia	2013-09-01	1	-5/+5
\| \| \| \| \| \|	* RDD, RDDFunctions -> org.apache.spark.rdd Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer
*	Update docs for new package	Matei Zaharia	2013-09-01	1	-9/+9
\|
*	More updates, describing changes to recommended use of environment vars	Matei Zaharia	2013-08-31	1	-47/+38
\| \| \| \|	and new Python stuff
*	Update docs for Spark UI port	Matei Zaharia	2013-08-20	1	-1/+1
\|
*	Address some review comments:	Matei Zaharia	2013-08-18	1	-2/+28
\| \| \| \| \| \| \| \| \| \|	- When a resourceOffers() call has multiple offers, force the TaskSets to consider them in increasing order of locality levels so that they get a chance to launch stuff locally across all offers - Simplify ClusterScheduler.prioritizeContainers - Add docs on the new configuration options
*	Merge remote-tracking branch 'dlyubimov/SPARK-827'	Matei Zaharia	2013-07-31	1	-0/+8
\|\ \| \| \| \| \| \| \| \|	Conflicts: docs/configuration.md
\| *	typo	Dmitriy Lyubimov	2013-07-27	1	-1/+1
\| \|
\| *	changes per comments.	Dmitriy Lyubimov	2013-07-27	1	-0/+8
\| \|