aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark
Commit message (Collapse)AuthorAgeFilesLines
* Add missing license headers found with RATMatei Zaharia2013-09-021-1/+18
|
* Further fixes to get PySpark to work on WindowsMatei Zaharia2013-09-021-5/+12
|
* Allow PySpark to launch worker.py directly on WindowsMatei Zaharia2013-09-011-4/+7
|
* Move some classes to more appropriate packages:Matei Zaharia2013-09-011-2/+2
| | | | | | * RDD, *RDDFunctions -> org.apache.spark.rdd * Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer
* Add banner to PySpark and make wordcount output nicerMatei Zaharia2013-09-011-0/+13
|
* Initial work to rename package to org.apache.sparkMatei Zaharia2013-09-013-5/+5
|
* Merge pull request #861 from AndreSchumacher/pyspark_sampling_functionMatei Zaharia2013-08-312-7/+167
|\ | | | | Pyspark sampling function
| * RDD sample() and takeSample() prototypes for PySparkAndre Schumacher2013-08-282-7/+167
| |
* | Merge pull request #870 from JoshRosen/spark-885Matei Zaharia2013-08-311-1/+5
|\ \ | | | | | | Don't send SIGINT / ctrl-c to Py4J gateway subprocess
| * | Don't send SIGINT to Py4J gateway subprocess.Josh Rosen2013-08-281-1/+5
| |/ | | | | | | | | | | | | | | | | This addresses SPARK-885, a usability issue where PySpark's Java gateway process would be killed if the user hit ctrl-c. Note that SIGINT still won't cancel the running s This fix is based on http://stackoverflow.com/questions/5045771
* | Merge pull request #869 from AndreSchumacher/subtractMatei Zaharia2013-08-301-0/+37
|\ \ | | | | | | PySpark: implementing subtractByKey(), subtract() and keyBy()
| * | PySpark: implementing subtractByKey(), subtract() and keyBy()Andre Schumacher2013-08-281-0/+37
| |/
* / Change build and run instructions to use assembliesMatei Zaharia2013-08-291-1/+1
|/ | | | | | | | | | | | | | | | This commit makes Spark invocation saner by using an assembly JAR to find all of Spark's dependencies instead of adding all the JARs in lib_managed. It also packages the examples into an assembly and uses that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script with two better-named scripts: "run-examples" for examples, and "spark-class" for Spark internal classes (e.g. REPL, master, etc). This is also designed to minimize the confusion people have in trying to use "run" to run their own classes; it's not meant to do that, but now at least if they look at it, they can modify run-examples to do a decent job for them. As part of this, Bagel's examples are also now properly moved to the examples package instead of bagel.
* Implementing SPARK-838: Add DoubleRDDFunctions methods to PySparkAndre Schumacher2013-08-212-1/+168
|
* Implementing SPARK-878 for PySpark: adding zip and egg files to context and ↵Andre Schumacher2013-08-164-5/+37
| | | | passing it down to workers which add these to their sys.path
* Fix PySpark unit tests on Python 2.6.Josh Rosen2013-08-141-5/+8
|
* Merge pull request #813 from AndreSchumacher/add_files_pysparkMatei Zaharia2013-08-121-1/+6
|\ | | | | Implementing SPARK-865: Add the equivalent of ADD_JARS to PySpark
| * Implementing SPARK-865: Add the equivalent of ADD_JARS to PySparkAndre Schumacher2013-08-121-1/+6
| | | | | | | | Now ADD_FILES uses a comma as file name separator.
* | Do not inherit master's PYTHONPATH on workers.Josh Rosen2013-07-291-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | This fixes SPARK-832, an issue where PySpark would not work when the master and workers used different SPARK_HOME paths. This change may potentially break code that relied on the master's PYTHONPATH being used on workers. To have custom PYTHONPATH additions used on the workers, users should set a custom PYTHONPATH in spark-env.sh rather than setting it in the shell.
* | SPARK-815. Python parallelize() should split lists before batchingMatei Zaharia2013-07-291-2/+9
| | | | | | | | | | | | | | | | | | One unfortunate consequence of this fix is that we materialize any collections that are given to us as generators, but this seems necessary to get reasonable behavior on small collections. We could add a batchSize parameter later to bypass auto-computation of batch size if this becomes a problem (e.g. if users really want to parallelize big generators nicely)
* | Use None instead of empty string as it's slightly smaller/fasterMatei Zaharia2013-07-291-1/+1
| |
* | Optimize Python foreach() to not return as many objectsMatei Zaharia2013-07-291-1/+5
| |
* | Optimize Python take() to not compute entire first partitionMatei Zaharia2013-07-291-6/+9
| |
* | Add Apache license headers and LICENSE and NOTICE filesMatei Zaharia2013-07-1611-0/+187
| |
* | Fixed PySpark perf regression by not using socket.makefile(), and improvedroot2013-07-011-18/+24
| | | | | | | | | | | | | | debuggability by letting "print" statements show up in the executor's stderr Conflicts: core/src/main/scala/spark/api/python/PythonRDD.scala
* | Fix reporting of PySpark exceptionsJey Kottalam2013-06-212-5/+19
| |
* | PySpark daemon: fix deadlock, improve error handlingJey Kottalam2013-06-211-17/+50
| |
* | Add tests and fixes for Python daemon shutdownJey Kottalam2013-06-213-22/+69
| |
* | Prefork Python worker processesJey Kottalam2013-06-212-32/+138
| |
* | Add Python timing instrumentationJey Kottalam2013-06-212-1/+19
| |
* | Fix Python saveAsTextFile doctest to not expect order to be preservedJey Kottalam2013-04-021-1/+1
| |
* | Change numSplits to numPartitions in PySpark.Josh Rosen2013-02-242-38/+38
| |
* | Add commutative requirement for 'reduce' to Python docstring.Mark Hamstra2013-02-091-2/+2
|/
* Remove unnecessary doctest __main__ methods.Josh Rosen2013-02-032-18/+0
|
* Fetch fewer objects in PySpark's take() method.Josh Rosen2013-02-031-0/+4
|
* Fix reporting of PySpark doctest failures.Josh Rosen2013-02-032-2/+6
|
* Use spark.local.dir for PySpark temp files (SPARK-580).Josh Rosen2013-02-012-10/+9
|
* Do not launch JavaGateways on workers (SPARK-674).Josh Rosen2013-02-014-18/+25
| | | | | | | | | | | The problem was that the gateway was being initialized whenever the pyspark.context module was loaded. The fix uses lazy initialization that occurs only when SparkContext instances are actually constructed. I also made the gateway and jvm variables private. This change results in ~3-4x performance improvement when running the PySpark unit tests.
* Fix stdout redirection in PySpark.Josh Rosen2013-02-012-2/+12
|
* SPARK-673: Capture and re-throw Python exceptionsPatrick Wendell2013-01-311-2/+8
| | | | | This patch alters the Python <-> executor protocol to pass on exception data when they occur in user Python code.
* Merge pull request #430 from pwendell/pyspark-guideMatei Zaharia2013-01-301-0/+1
|\ | | | | Minor improvements to PySpark docs
| * Make module help available in python shell.Patrick Wendell2013-01-301-0/+1
| | | | | | | | Also, adds a line in doc explaining how to use.
* | Replace old 'master' term with 'driver'.Stephen Haberman2013-01-251-1/+1
| |
* | Merge pull request #396 from JoshRosen/spark-653Matei Zaharia2013-01-242-14/+29
|\ \ | | | | | | Make PySpark AccumulatorParam an abstract base class
| * | Remove use of abc.ABCMeta due to cloudpickle issue.Josh Rosen2013-01-231-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | cloudpickle runs into issues while pickling subclasses of AccumulatorParam, which may be related to this Python issue: http://bugs.python.org/issue7689 This seems hard to fix and the ABCMeta wasn't necessary, so I removed it.
| * | Make AccumulatorParam an abstract base class.Josh Rosen2013-01-212-13/+31
| | |
* | | Allow PySpark's SparkFiles to be used from driverJosh Rosen2013-01-234-9/+62
| | | | | | | | | | | | Fix minor documentation formatting issues.
* | | Fix sys.path bug in PySpark SparkContext.addPyFileJosh Rosen2013-01-223-7/+34
| | |
* | | Don't download files to master's working directory.Josh Rosen2013-01-214-5/+67
|/ / | | | | | | | | | | | | This should avoid exceptions caused by existing files with different contents. I also removed some unused code.
* | Merge pull request #389 from JoshRosen/python_rdd_checkpointingMatei Zaharia2013-01-203-2/+112
|\ \ | | | | | | Add checkpointing to the Python API