spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Fix Python code after change of getOrElse	Matei Zaharia	2014-01-01	1	-6/+8
\|
*	Merge remote-tracking branch 'apache/master' into conf2	Matei Zaharia	2013-12-31	1	-7/+2
\|\ \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
\| *	Fixed Python API for sc.setCheckpointDir. Also other fixes based on ↵	Tathagata Das	2013-12-24	1	-7/+2
\| \| \| \| \| \| \| \|	Reynold's comments on PR 289.
* \|	Updated docs for SparkConf and handled review comments	Matei Zaharia	2013-12-30	1	-12/+12
\| \|
* \|	Properly show Spark properties on web UI, and change app name property	Matei Zaharia	2013-12-29	1	-2/+2
\| \|
* \|	Fix some Python docs and make sure to unset SPARK_TESTING in Python	Matei Zaharia	2013-12-29	1	-1/+2
\| \| \| \| \| \| \| \|	tests so we don't get the test spark.conf on the classpath.
* \|	Add Python docs about SparkConf	Matei Zaharia	2013-12-29	1	-1/+2
\| \|
* \|	Fix some other Python tests due to initializing JVM in a different way	Matei Zaharia	2013-12-29	1	-8/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The test in context.py created two different instances of the SparkContext class by copying "globals", so that some tests can have a global "sc" object and others can try initializing their own contexts. This led to two JVM gateways being created since SparkConf also looked at pyspark.context.SparkContext to get the JVM.
* \|	Add SparkConf support in Python	Matei Zaharia	2013-12-29	1	-12/+28
\| \|
* \|	Fix Python use of getLocalDir	Matei Zaharia	2013-12-29	1	-1/+1
\|/
*	Add collectPartition to JavaRDD interface.	Shivaram Venkataraman	2013-12-18	1	-3/+0
\| \| \| \|	Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
*	FramedSerializer: _dumps => dumps, _loads => loads.	Josh Rosen	2013-11-10	1	-1/+1
\|
*	Add custom serializer support to PySpark.	Josh Rosen	2013-11-10	1	-16/+45
\| \| \| \| \| \| \| \| \|	For now, this only adds MarshalSerializer, but it lays the groundwork for other supporting custom serializers. Many of these mechanisms can also be used to support deserialization of different data formats sent by Java, such as data encoded by MsgPack. This also fixes a bug in SparkContext.union().
*	Remove Pickle-wrapping of Java objects in PySpark.	Josh Rosen	2013-11-03	1	-5/+5
\| \| \| \| \| \|	If we support custom serializers, the Python worker will know what type of input to expect, so we won't need to wrap Tuple2 and Strings into pickled tuples and strings.
*	Pass self to SparkContext._ensure_initialized.	Ewen Cheslack-Postava	2013-10-22	1	-1/+10
\| \| \| \| \| \| \|	The constructor for SparkContext should pass in self so that we track the current context and produce errors if another one is created. Add a doctest to make sure creating multiple contexts triggers the exception.
*	Add classmethod to SparkContext to set system properties.	Ewen Cheslack-Postava	2013-10-22	1	-12/+29
\| \| \| \| \| \| \| \| \|	Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.
*	Whoopsy daisy	Aaron Davidson	2013-09-08	1	-1/+0
\|
*	Export StorageLevel and refactor	Aaron Davidson	2013-09-07	1	-23/+12
\|
*	Remove reflection, hard-code StorageLevels	Aaron Davidson	2013-09-07	1	-22/+24
\| \| \| \| \| \| \| \| \| \| \|	The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise the shell would have to call a private method of SparkContext. Having StorageLevel available in sc also doesn't seem like the end of the world. There may be a better solution, though. As for creating the StorageLevel object itself, this seems to be the best way in Python 2 for creating singleton, enum-like objects: http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python
*	Memoize StorageLevels read from JVM	Aaron Davidson	2013-09-06	1	-2/+9
\|
*	SPARK-660: Add StorageLevel support in Python	Aaron Davidson	2013-09-05	1	-0/+14
\| \| \| \| \|	It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).
*	Move some classes to more appropriate packages:	Matei Zaharia	2013-09-01	1	-2/+2
\| \| \| \| \| \|	* RDD, RDDFunctions -> org.apache.spark.rdd Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer
*	Initial work to rename package to org.apache.spark	Matei Zaharia	2013-09-01	1	-2/+2
\|
*	Implementing SPARK-878 for PySpark: adding zip and egg files to context and ↵	Andre Schumacher	2013-08-16	1	-3/+11
\| \| \| \|	passing it down to workers which add these to their sys.path
*	SPARK-815. Python parallelize() should split lists before batching	Matei Zaharia	2013-07-29	1	-2/+9
\| \| \| \| \| \| \| \| \|	One unfortunate consequence of this fix is that we materialize any collections that are given to us as generators, but this seems necessary to get reasonable behavior on small collections. We could add a batchSize parameter later to bypass auto-computation of batch size if this becomes a problem (e.g. if users really want to parallelize big generators nicely)
*	Add Apache license headers and LICENSE and NOTICE files	Matei Zaharia	2013-07-16	1	-0/+17
\|
*	Fix reporting of PySpark doctest failures.	Josh Rosen	2013-02-03	1	-1/+3
\|
*	Use spark.local.dir for PySpark temp files (SPARK-580).	Josh Rosen	2013-02-01	1	-4/+8
\|
*	Do not launch JavaGateways on workers (SPARK-674).	Josh Rosen	2013-02-01	1	-10/+17
\| \| \| \| \| \| \| \| \| \| \|	The problem was that the gateway was being initialized whenever the pyspark.context module was loaded. The fix uses lazy initialization that occurs only when SparkContext instances are actually constructed. I also made the gateway and jvm variables private. This change results in ~3-4x performance improvement when running the PySpark unit tests.
*	Merge pull request #396 from JoshRosen/spark-653	Matei Zaharia	2013-01-24	1	-10/+5
\|\ \| \| \| \|	Make PySpark AccumulatorParam an abstract base class
\| *	Make AccumulatorParam an abstract base class.	Josh Rosen	2013-01-21	1	-10/+5
\| \|
* \|	Allow PySpark's SparkFiles to be used from driver	Josh Rosen	2013-01-23	1	-6/+21
\| \| \| \| \| \| \| \|	Fix minor documentation formatting issues.
* \|	Fix sys.path bug in PySpark SparkContext.addPyFile	Josh Rosen	2013-01-22	1	-2/+0
\| \|
* \|	Don't download files to master's working directory.	Josh Rosen	2013-01-21	1	-4/+36
\|/ \| \| \| \| \| \|	This should avoid exceptions caused by existing files with different contents. I also removed some unused code.
*	Update checkpointing API docs in Python/Java.	Josh Rosen	2013-01-20	1	-4/+7
\|
*	Add checkpointFile() and more tests to PySpark.	Josh Rosen	2013-01-20	1	-1/+5
\|
*	Add RDD checkpointing to Python API.	Josh Rosen	2013-01-20	1	-0/+9
\|
*	Added accumulators to PySpark	Matei Zaharia	2013-01-20	1	-0/+38
\|
*	Change PYSPARK_PYTHON_EXEC to PYSPARK_PYTHON.	Josh Rosen	2013-01-10	1	-1/+1
\|
*	Change PySpark RDD.take() to not call iterator().	Josh Rosen	2013-01-03	1	-0/+1
\|
*	Rename top-level 'pyspark' directory to 'python'	Josh Rosen	2013-01-01	1	-0/+158