spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
\| * \| \| \|	Make collectPartitions take an array of partitions	Shivaram Venkataraman	2013-12-19	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Change the implementation to use runJob instead of PartitionPruningRDD. Also update the unit tests and the python take implementation to use the new interface.
\| * \| \| \|	Add collectPartition to JavaRDD interface.	Shivaram Venkataraman	2013-12-18	2	-4/+1
\| \|/ / / \| \| \| \| \| \| \| \| \| \| \| \|	Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
* / / /	Add toString to Java RDD, and __repr__ to Python RDD	Nick Pentreath	2013-12-19	1	-0/+3
\|/ / /
* \| \|	Merge branch 'master' into akka-bug-fix	Prashant Sharma	2013-12-11	2	-1/+19
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/pom.xml core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala pom.xml project/SparkBuild.scala streaming/pom.xml yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
\| * \| \|	Fix UnicodeEncodeError in PySpark saveAsTextFile().	Josh Rosen	2013-11-28	2	-1/+19
\| \| \| \| \| \| \| \| \| \| \| \|	Fixes SPARK-970.
* \| \| \|	Merge branch 'master' into wip-scala-2.10	Prashant Sharma	2013-11-27	6	-141/+381
\|\\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithContextRDD.scala core/src/main/scala/org/apache/spark/rdd/RDD.scala python/pyspark/rdd.py
\| * \| \|	Removed unused basestring case from dump_stream.	Josh Rosen	2013-11-26	1	-2/+0
\| \| \| \|
\| * \| \|	FramedSerializer: _dumps => dumps, _loads => loads.	Josh Rosen	2013-11-10	4	-18/+18
\| \| \| \|
\| * \| \|	Send PySpark commands as bytes insetad of strings.	Josh Rosen	2013-11-10	3	-16/+13
\| \| \| \|
\| * \| \|	Add custom serializer support to PySpark.	Josh Rosen	2013-11-10	6	-147/+360
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For now, this only adds MarshalSerializer, but it lays the groundwork for other supporting custom serializers. Many of these mechanisms can also be used to support deserialization of different data formats sent by Java, such as data encoded by MsgPack. This also fixes a bug in SparkContext.union().
\| * \| \|	Remove Pickle-wrapping of Java objects in PySpark.	Josh Rosen	2013-11-03	4	-14/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we support custom serializers, the Python worker will know what type of input to expect, so we won't need to wrap Tuple2 and Strings into pickled tuples and strings.
\| * \| \|	Replace magic lengths with constants in PySpark.	Josh Rosen	2013-11-03	2	-6/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Write the length of the accumulators section up-front rather than terminating it with a negative length. I find this easier to read.
* \| \| \|	Merge branch 'master' into scala-2.10	Raymond Liu	2013-11-13	2	-13/+50
\|\\| \| \|
\| * \| \|	Pass self to SparkContext._ensure_initialized.	Ewen Cheslack-Postava	2013-10-22	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The constructor for SparkContext should pass in self so that we track the current context and produce errors if another one is created. Add a doctest to make sure creating multiple contexts triggers the exception.
\| * \| \|	Add classmethod to SparkContext to set system properties.	Ewen Cheslack-Postava	2013-10-22	1	-12/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.
\| * \| \|	Add an add() method to pyspark accumulators.	Ewen Cheslack-Postava	2013-10-19	1	-1/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a regular method for adding a term to accumulators in pyspark. Currently if you have a non-global accumulator, adding to it is awkward. The += operator can't be used for non-global accumulators captured via closure because it's involves an assignment. The only way to do it is using __iadd__ directly. Adding this method lets you write code like this: def main(): sc = SparkContext() accum = sc.accumulator(0) rdd = sc.parallelize([1,2,3]) def f(x): accum.add(x) rdd.foreach(f) print accum.value where using accum += x instead would have caused UnboundLocalError exceptions in workers. Currently it would have to be written as accum.__iadd__(x).
* \| \| \|	Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10	Prashant Sharma	2013-10-10	1	-7/+53
\|\\| \| \|
\| * \| \|	Fix PySpark docs and an overly long line of code after fdbae41e	Matei Zaharia	2013-10-09	1	-8/+8
\| \| \| \|
\| * \| \|	SPARK-705: implement sortByKey() in PySpark	Andre Schumacher	2013-10-07	1	-1/+47
\| \| \| \|
* \| \| \|	Merge branch 'master' into wip-merge-master	Prashant Sharma	2013-10-08	2	-4/+10
\|\\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: bagel/pom.xml core/pom.xml core/src/test/scala/org/apache/spark/ui/UISuite.scala examples/pom.xml mllib/pom.xml pom.xml project/SparkBuild.scala repl/pom.xml streaming/pom.xml tools/pom.xml In scala 2.10, a shorter representation is used for naming artifacts so changed to shorter scala version for artifacts and made it a property in pom.
\| * \| \|	Fixing SPARK-602: PythonPartitioner	Andre Schumacher	2013-10-04	2	-4/+10
\| \|/ / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently PythonPartitioner determines partition ID by hashing a byte-array representation of PySpark's key. This PR lets PythonPartitioner use the actual partition ID, which is required e.g. for sorting via PySpark.
* \| \|	Merge branch 'master' into scala-2.10	Prashant Sharma	2013-10-01	1	-1/+1
\|\\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressUI.scala docs/_config.yml project/SparkBuild.scala repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
\| * \|	Update build version in master	Patrick Wendell	2013-09-24	1	-1/+1
\| \|/
* \|	Merge branch 'master' of git://github.com/mesos/spark into scala-2.10	Prashant Sharma	2013-09-15	5	-1/+78
\|\\| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala project/SparkBuild.scala
\| *	Whoopsy daisy	Aaron Davidson	2013-09-08	1	-1/+0
\| \|
\| *	Export StorageLevel and refactor	Aaron Davidson	2013-09-07	5	-26/+62
\| \|
\| *	Remove reflection, hard-code StorageLevels	Aaron Davidson	2013-09-07	2	-24/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise the shell would have to call a private method of SparkContext. Having StorageLevel available in sc also doesn't seem like the end of the world. There may be a better solution, though. As for creating the StorageLevel object itself, this seems to be the best way in Python 2 for creating singleton, enum-like objects: http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python
\| *	Memoize StorageLevels read from JVM	Aaron Davidson	2013-09-06	1	-2/+9
\| \|
\| *	SPARK-660: Add StorageLevel support in Python	Aaron Davidson	2013-09-05	3	-1/+34
\| \| \| \| \| \| \| \| \| \|	It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).
* \|	Merged with master	Prashant Sharma	2013-09-06	14	-44/+692
\|\\|
\| *	Add missing license headers found with RAT	Matei Zaharia	2013-09-02	1	-1/+18
\| \|
\| *	Further fixes to get PySpark to work on Windows	Matei Zaharia	2013-09-02	1	-5/+12
\| \|
\| *	Allow PySpark to launch worker.py directly on Windows	Matei Zaharia	2013-09-01	1	-4/+7
\| \|
\| *	Move some classes to more appropriate packages:	Matei Zaharia	2013-09-01	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	* RDD, RDDFunctions -> org.apache.spark.rdd Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer
\| *	Add banner to PySpark and make wordcount output nicer	Matei Zaharia	2013-09-01	1	-0/+13
\| \|
\| *	Initial work to rename package to org.apache.spark	Matei Zaharia	2013-09-01	3	-5/+5
\| \|
\| *	Merge pull request #861 from AndreSchumacher/pyspark_sampling_function	Matei Zaharia	2013-08-31	2	-7/+167
\| \|\ \| \| \| \| \| \|	Pyspark sampling function
\| \| *	RDD sample() and takeSample() prototypes for PySpark	Andre Schumacher	2013-08-28	2	-7/+167
\| \| \|
\| * \|	Merge pull request #870 from JoshRosen/spark-885	Matei Zaharia	2013-08-31	1	-1/+5
\| \|\ \ \| \| \| \| \| \| \| \|	Don't send SIGINT / ctrl-c to Py4J gateway subprocess
\| \| * \|	Don't send SIGINT to Py4J gateway subprocess.	Josh Rosen	2013-08-28	1	-1/+5
\| \| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This addresses SPARK-885, a usability issue where PySpark's Java gateway process would be killed if the user hit ctrl-c. Note that SIGINT still won't cancel the running s This fix is based on http://stackoverflow.com/questions/5045771
\| * \|	Merge pull request #869 from AndreSchumacher/subtract	Matei Zaharia	2013-08-30	1	-0/+37
\| \|\ \ \| \| \| \| \| \| \| \|	PySpark: implementing subtractByKey(), subtract() and keyBy()
\| \| * \|	PySpark: implementing subtractByKey(), subtract() and keyBy()	Andre Schumacher	2013-08-28	1	-0/+37
\| \| \|/
\| * /	Change build and run instructions to use assemblies	Matei Zaharia	2013-08-29	1	-1/+1
\| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit makes Spark invocation saner by using an assembly JAR to find all of Spark's dependencies instead of adding all the JARs in lib_managed. It also packages the examples into an assembly and uses that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script with two better-named scripts: "run-examples" for examples, and "spark-class" for Spark internal classes (e.g. REPL, master, etc). This is also designed to minimize the confusion people have in trying to use "run" to run their own classes; it's not meant to do that, but now at least if they look at it, they can modify run-examples to do a decent job for them. As part of this, Bagel's examples are also now properly moved to the examples package instead of bagel.
\| *	Implementing SPARK-838: Add DoubleRDDFunctions methods to PySpark	Andre Schumacher	2013-08-21	2	-1/+168
\| \|
\| *	Implementing SPARK-878 for PySpark: adding zip and egg files to context and ↵	Andre Schumacher	2013-08-16	4	-5/+37
\| \| \| \| \| \| \| \|	passing it down to workers which add these to their sys.path
\| *	Fix PySpark unit tests on Python 2.6.	Josh Rosen	2013-08-14	1	-5/+8
\| \|
\| *	Merge pull request #813 from AndreSchumacher/add_files_pyspark	Matei Zaharia	2013-08-12	1	-1/+6
\| \|\ \| \| \| \| \| \|	Implementing SPARK-865: Add the equivalent of ADD_JARS to PySpark
\| \| *	Implementing SPARK-865: Add the equivalent of ADD_JARS to PySpark	Andre Schumacher	2013-08-12	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \|	Now ADD_FILES uses a comma as file name separator.
\| * \|	Do not inherit master's PYTHONPATH on workers.	Josh Rosen	2013-07-29	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixes SPARK-832, an issue where PySpark would not work when the master and workers used different SPARK_HOME paths. This change may potentially break code that relied on the master's PYTHONPATH being used on workers. To have custom PYTHONPATH additions used on the workers, users should set a custom PYTHONPATH in spark-env.sh rather than setting it in the shell.
\| * \|	SPARK-815. Python parallelize() should split lists before batching	Matei Zaharia	2013-07-29	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	One unfortunate consequence of this fix is that we materialize any collections that are given to us as generators, but this seems necessary to get reasonable behavior on small collections. We could add a batchSize parameter later to bypass auto-computation of batch size if this becomes a problem (e.g. if users really want to parallelize big generators nicely)