aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | | Make collectPartitions take an array of partitionsShivaram Venkataraman2013-12-191-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the implementation to use runJob instead of PartitionPruningRDD. Also update the unit tests and the python take implementation to use the new interface.
| * | | | Add collectPartition to JavaRDD interface.Shivaram Venkataraman2013-12-182-4/+1
| |/ / / | | | | | | | | | | | | Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
* / / / Add toString to Java RDD, and __repr__ to Python RDDNick Pentreath2013-12-191-0/+3
|/ / /
* | | Merge branch 'master' into akka-bug-fixPrashant Sharma2013-12-112-1/+19
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: core/pom.xml core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala pom.xml project/SparkBuild.scala streaming/pom.xml yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
| * | | Fix UnicodeEncodeError in PySpark saveAsTextFile().Josh Rosen2013-11-282-1/+19
| | | | | | | | | | | | Fixes SPARK-970.
* | | | Merge branch 'master' into wip-scala-2.10Prashant Sharma2013-11-276-141/+381
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithContextRDD.scala core/src/main/scala/org/apache/spark/rdd/RDD.scala python/pyspark/rdd.py
| * | | Removed unused basestring case from dump_stream.Josh Rosen2013-11-261-2/+0
| | | |
| * | | FramedSerializer: _dumps => dumps, _loads => loads.Josh Rosen2013-11-104-18/+18
| | | |
| * | | Send PySpark commands as bytes insetad of strings.Josh Rosen2013-11-103-16/+13
| | | |
| * | | Add custom serializer support to PySpark.Josh Rosen2013-11-106-147/+360
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For now, this only adds MarshalSerializer, but it lays the groundwork for other supporting custom serializers. Many of these mechanisms can also be used to support deserialization of different data formats sent by Java, such as data encoded by MsgPack. This also fixes a bug in SparkContext.union().
| * | | Remove Pickle-wrapping of Java objects in PySpark.Josh Rosen2013-11-034-14/+39
| | | | | | | | | | | | | | | | | | | | | | | | If we support custom serializers, the Python worker will know what type of input to expect, so we won't need to wrap Tuple2 and Strings into pickled tuples and strings.
| * | | Replace magic lengths with constants in PySpark.Josh Rosen2013-11-032-6/+13
| | | | | | | | | | | | | | | | | | | | | | | | Write the length of the accumulators section up-front rather than terminating it with a negative length. I find this easier to read.
* | | | Merge branch 'master' into scala-2.10Raymond Liu2013-11-132-13/+50
|\| | |
| * | | Pass self to SparkContext._ensure_initialized.Ewen Cheslack-Postava2013-10-221-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The constructor for SparkContext should pass in self so that we track the current context and produce errors if another one is created. Add a doctest to make sure creating multiple contexts triggers the exception.
| * | | Add classmethod to SparkContext to set system properties.Ewen Cheslack-Postava2013-10-221-12/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.
| * | | Add an add() method to pyspark accumulators.Ewen Cheslack-Postava2013-10-191-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a regular method for adding a term to accumulators in pyspark. Currently if you have a non-global accumulator, adding to it is awkward. The += operator can't be used for non-global accumulators captured via closure because it's involves an assignment. The only way to do it is using __iadd__ directly. Adding this method lets you write code like this: def main(): sc = SparkContext() accum = sc.accumulator(0) rdd = sc.parallelize([1,2,3]) def f(x): accum.add(x) rdd.foreach(f) print accum.value where using accum += x instead would have caused UnboundLocalError exceptions in workers. Currently it would have to be written as accum.__iadd__(x).
* | | | Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10Prashant Sharma2013-10-101-7/+53
|\| | |
| * | | Fix PySpark docs and an overly long line of code after fdbae41eMatei Zaharia2013-10-091-8/+8
| | | |
| * | | SPARK-705: implement sortByKey() in PySparkAndre Schumacher2013-10-071-1/+47
| | | |
* | | | Merge branch 'master' into wip-merge-masterPrashant Sharma2013-10-082-4/+10
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: bagel/pom.xml core/pom.xml core/src/test/scala/org/apache/spark/ui/UISuite.scala examples/pom.xml mllib/pom.xml pom.xml project/SparkBuild.scala repl/pom.xml streaming/pom.xml tools/pom.xml In scala 2.10, a shorter representation is used for naming artifacts so changed to shorter scala version for artifacts and made it a property in pom.
| * | | Fixing SPARK-602: PythonPartitionerAndre Schumacher2013-10-042-4/+10
| |/ / | | | | | | | | | | | | | | | | | | Currently PythonPartitioner determines partition ID by hashing a byte-array representation of PySpark's key. This PR lets PythonPartitioner use the actual partition ID, which is required e.g. for sorting via PySpark.
* | | Merge branch 'master' into scala-2.10Prashant Sharma2013-10-011-1/+1
|\| | | | | | | | | | | | | | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressUI.scala docs/_config.yml project/SparkBuild.scala repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
| * | Update build version in masterPatrick Wendell2013-09-241-1/+1
| |/
* | Merge branch 'master' of git://github.com/mesos/spark into scala-2.10Prashant Sharma2013-09-155-1/+78
|\| | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala project/SparkBuild.scala
| * Whoopsy daisyAaron Davidson2013-09-081-1/+0
| |
| * Export StorageLevel and refactorAaron Davidson2013-09-075-26/+62
| |
| * Remove reflection, hard-code StorageLevelsAaron Davidson2013-09-072-24/+26
| | | | | | | | | | | | | | | | | | | | | | The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise the shell would have to call a private method of SparkContext. Having StorageLevel available in sc also doesn't seem like the end of the world. There may be a better solution, though. As for creating the StorageLevel object itself, this seems to be the best way in Python 2 for creating singleton, enum-like objects: http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python
| * Memoize StorageLevels read from JVMAaron Davidson2013-09-061-2/+9
| |
| * SPARK-660: Add StorageLevel support in PythonAaron Davidson2013-09-053-1/+34
| | | | | | | | | | It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).
* | Merged with masterPrashant Sharma2013-09-0614-44/+692
|\|
| * Add missing license headers found with RATMatei Zaharia2013-09-021-1/+18
| |
| * Further fixes to get PySpark to work on WindowsMatei Zaharia2013-09-021-5/+12
| |
| * Allow PySpark to launch worker.py directly on WindowsMatei Zaharia2013-09-011-4/+7
| |
| * Move some classes to more appropriate packages:Matei Zaharia2013-09-011-2/+2
| | | | | | | | | | | | * RDD, *RDDFunctions -> org.apache.spark.rdd * Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer
| * Add banner to PySpark and make wordcount output nicerMatei Zaharia2013-09-011-0/+13
| |
| * Initial work to rename package to org.apache.sparkMatei Zaharia2013-09-013-5/+5
| |
| * Merge pull request #861 from AndreSchumacher/pyspark_sampling_functionMatei Zaharia2013-08-312-7/+167
| |\ | | | | | | Pyspark sampling function
| | * RDD sample() and takeSample() prototypes for PySparkAndre Schumacher2013-08-282-7/+167
| | |
| * | Merge pull request #870 from JoshRosen/spark-885Matei Zaharia2013-08-311-1/+5
| |\ \ | | | | | | | | Don't send SIGINT / ctrl-c to Py4J gateway subprocess
| | * | Don't send SIGINT to Py4J gateway subprocess.Josh Rosen2013-08-281-1/+5
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | This addresses SPARK-885, a usability issue where PySpark's Java gateway process would be killed if the user hit ctrl-c. Note that SIGINT still won't cancel the running s This fix is based on http://stackoverflow.com/questions/5045771
| * | Merge pull request #869 from AndreSchumacher/subtractMatei Zaharia2013-08-301-0/+37
| |\ \ | | | | | | | | PySpark: implementing subtractByKey(), subtract() and keyBy()
| | * | PySpark: implementing subtractByKey(), subtract() and keyBy()Andre Schumacher2013-08-281-0/+37
| | |/
| * / Change build and run instructions to use assembliesMatei Zaharia2013-08-291-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit makes Spark invocation saner by using an assembly JAR to find all of Spark's dependencies instead of adding all the JARs in lib_managed. It also packages the examples into an assembly and uses that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script with two better-named scripts: "run-examples" for examples, and "spark-class" for Spark internal classes (e.g. REPL, master, etc). This is also designed to minimize the confusion people have in trying to use "run" to run their own classes; it's not meant to do that, but now at least if they look at it, they can modify run-examples to do a decent job for them. As part of this, Bagel's examples are also now properly moved to the examples package instead of bagel.
| * Implementing SPARK-838: Add DoubleRDDFunctions methods to PySparkAndre Schumacher2013-08-212-1/+168
| |
| * Implementing SPARK-878 for PySpark: adding zip and egg files to context and ↵Andre Schumacher2013-08-164-5/+37
| | | | | | | | passing it down to workers which add these to their sys.path
| * Fix PySpark unit tests on Python 2.6.Josh Rosen2013-08-141-5/+8
| |
| * Merge pull request #813 from AndreSchumacher/add_files_pysparkMatei Zaharia2013-08-121-1/+6
| |\ | | | | | | Implementing SPARK-865: Add the equivalent of ADD_JARS to PySpark
| | * Implementing SPARK-865: Add the equivalent of ADD_JARS to PySparkAndre Schumacher2013-08-121-1/+6
| | | | | | | | | | | | Now ADD_FILES uses a comma as file name separator.
| * | Do not inherit master's PYTHONPATH on workers.Josh Rosen2013-07-291-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes SPARK-832, an issue where PySpark would not work when the master and workers used different SPARK_HOME paths. This change may potentially break code that relied on the master's PYTHONPATH being used on workers. To have custom PYTHONPATH additions used on the workers, users should set a custom PYTHONPATH in spark-env.sh rather than setting it in the shell.
| * | SPARK-815. Python parallelize() should split lists before batchingMatei Zaharia2013-07-291-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | One unfortunate consequence of this fix is that we materialize any collections that are given to us as generators, but this seems necessary to get reasonable behavior on small collections. We could add a batchSize parameter later to bypass auto-computation of batch size if this becomes a problem (e.g. if users really want to parallelize big generators nicely)