aboutsummaryrefslogtreecommitdiff
path: root/pyspark
Commit message (Collapse)AuthorAgeFilesLines
* Adding IPYTHON environment variable support for launching pyspark using ↵Nick Pentreath2013-02-071-1/+6
| | | | ipython shell
* Warn users if they run pyspark or spark-shell without compiling SparkMatei Zaharia2013-01-171-0/+7
|
* Add `pyspark` script to replace the other scripts.Josh Rosen2013-01-011-0/+32
| | | Expand the PySpark programming guide.
* Rename top-level 'pyspark' directory to 'python'Josh Rosen2013-01-0123-2473/+0
|
* Minor documentation and style fixes for PySpark.Josh Rosen2013-01-016-13/+31
|
* Launch with `scala` by default in run-pysparkJosh Rosen2012-12-311-0/+5
|
* Port LR example to PySpark using numpy.Josh Rosen2012-12-291-0/+57
| | | | | | This version of the example crashes after the first iteration with "OverflowError: math range error" because Python's math.exp() behaves differently than Scala's; see SPARK-646.
* Add test for pyspark.RDD.saveAsTextFile().Josh Rosen2012-12-291-1/+8
|
* Update PySpark for compatibility with TaskContext.Josh Rosen2012-12-291-1/+2
|
* Use batching in pyspark parallelize(); fix cartesian()Josh Rosen2012-12-293-27/+31
|
* Fix bug in pyspark.serializers.batch; add .gitignore.Josh Rosen2012-12-293-2/+6
|
* Add documentation for Python API.Josh Rosen2012-12-287-42/+6
|
* Fix bug (introduced by batching) in PySpark take()Josh Rosen2012-12-283-14/+21
|
* Mark api.python classes as private; echo Java output to stderr.Josh Rosen2012-12-281-1/+2
|
* Simplify PySpark installation.Josh Rosen2012-12-2711-47/+72
| | | | | | | | | - Bundle Py4J binaries, since it's hard to install - Uses Spark's `run` script to launch the Py4J gateway, inheriting the settings in spark-env.sh With these changes, (hopefully) nothing more than running `sbt/sbt package` will be necessary to run PySpark.
* Use addFile() to ship code to cluster in PySpark.Josh Rosen2012-12-272-10/+74
| | | Add options to pyspark.SparkContext constructor.
* Add epydoc API documentation for PySpark.Josh Rosen2012-12-273-14/+224
|
* Add IPython support to pyspark-shell.Josh Rosen2012-12-273-8/+21
| | | | Suggested by / based on code from @MLnick
* Add support for batched serialization of Python objects in PySpark.Josh Rosen2012-12-263-20/+74
|
* Use filesystem to collect RDDs in PySpark.Josh Rosen2012-12-244-21/+42
| | | | | | Passing large volumes of data through Py4J seems to be slow. It appears to be faster to write the data to the local filesystem and read it back from Python.
* Reduce object overhead in Pyspark shuffle and collectJosh Rosen2012-12-241-5/+14
|
* Fix PySpark hash partitioning bug.Josh Rosen2012-10-281-3/+9
| | | | | | | | A Java array's hashCode is based on its object identify, not its elements, so this was causing serialized keys to be hashed incorrectly. This commit adds a PySpark-specific workaround and adds more tests.
* Bump required Py4J version and add test for large broadcast variables.Josh Rosen2012-10-283-2/+4
|
* Remove PYTHONPATH from SparkContext's executorEnvs.Josh Rosen2012-10-221-2/+6
| | | | | It makes more sense to pass it in the dictionary of environment variables that is used to construct PythonRDD.
* Add PySpark README and run scripts.Josh Rosen2012-10-206-3/+124
|
* Update Python API for v0.6.0 compatibility.Josh Rosen2012-10-195-19/+30
|
* Fix Python 2.6 compatibility in Python API.Josh Rosen2012-09-171-6/+11
|
* Fix minor bugs in Python API examples.Josh Rosen2012-08-272-5/+5
|
* Add pipe(), saveAsTextFile(), sc.union() to Python API.Josh Rosen2012-08-272-8/+31
|
* Simplify Python worker; pipeline the map step of partitionBy().Josh Rosen2012-08-274-100/+52
|
* Use local combiners in Python API combineByKey().Josh Rosen2012-08-272-25/+24
|
* Add countByKey(), reduceByKeyLocally() to Python APIJosh Rosen2012-08-271-13/+39
|
* Add mapPartitions(), glom(), countByValue() to Python API.Josh Rosen2012-08-271-4/+28
|
* Add broadcast variables to Python API.Josh Rosen2012-08-274-12/+84
|
* Implement fold() in Python API.Josh Rosen2012-08-271-1/+19
|
* Refactor Python MappedRDD to use iterator pipelines.Josh Rosen2012-08-242-97/+41
|
* Fix options parsing in Python pi example.Josh Rosen2012-08-241-1/+1
|
* Use numpy in Python k-means example.Josh Rosen2012-08-223-26/+14
|
* Use only cPickle for serialization in Python API.Josh Rosen2012-08-216-560/+233
| | | | | Objects serialized with JSON can be compared for equality, but JSON can be slow to serialize and only supports a limited range of data types.
* Bundle cloudpickle with pyspark.Josh Rosen2012-08-194-5/+976
|
* Add Python API.Josh Rosen2012-08-1812-0/+1170