spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	More doc improvements + better warnings when you haven't built Spark	Matei Zaharia	2013-08-30	1	-1/+1
\|
*	Don't use SPARK_LAUNCH_WITH_SCALA in pyspark	Matei Zaharia	2013-08-29	1	-5/+0
\|
*	Find assembly correctly in pyspark	Matei Zaharia	2013-08-29	1	-1/+3
\|
*	Fix PySpark for assembly run and include it in dist	Matei Zaharia	2013-08-29	1	-4/+8
\|
*	Two fixes to IPython support:	Matei Zaharia	2013-07-28	1	-3/+7
\| \| \| \| \| \|	- Don't attempt to run worker processes with ipython (that can cause some crashes as ipython prints things to standard out) - Allow passing some IPYTHON_OPTS to launch things like the notebook
*	Add Apache license headers and LICENSE and NOTICE files	Matei Zaharia	2013-07-16	1	-0/+17
\|
*	Adding IPYTHON environment variable support for launching pyspark using ↵	Nick Pentreath	2013-02-07	1	-1/+6
\| \| \| \|	ipython shell
*	Warn users if they run pyspark or spark-shell without compiling Spark	Matei Zaharia	2013-01-17	1	-0/+7
\|
*	Add `pyspark` script to replace the other scripts.	Josh Rosen	2013-01-01	1	-0/+32
\| \| \|	Expand the PySpark programming guide.
*	Rename top-level 'pyspark' directory to 'python'	Josh Rosen	2013-01-01	23	-2473/+0
\|
*	Minor documentation and style fixes for PySpark.	Josh Rosen	2013-01-01	6	-13/+31
\|
*	Launch with `scala` by default in run-pyspark	Josh Rosen	2012-12-31	1	-0/+5
\|
*	Port LR example to PySpark using numpy.	Josh Rosen	2012-12-29	1	-0/+57
\| \| \| \| \| \|	This version of the example crashes after the first iteration with "OverflowError: math range error" because Python's math.exp() behaves differently than Scala's; see SPARK-646.
*	Add test for pyspark.RDD.saveAsTextFile().	Josh Rosen	2012-12-29	1	-1/+8
\|
*	Update PySpark for compatibility with TaskContext.	Josh Rosen	2012-12-29	1	-1/+2
\|
*	Use batching in pyspark parallelize(); fix cartesian()	Josh Rosen	2012-12-29	3	-27/+31
\|
*	Fix bug in pyspark.serializers.batch; add .gitignore.	Josh Rosen	2012-12-29	3	-2/+6
\|
*	Add documentation for Python API.	Josh Rosen	2012-12-28	7	-42/+6
\|
*	Fix bug (introduced by batching) in PySpark take()	Josh Rosen	2012-12-28	3	-14/+21
\|
*	Mark api.python classes as private; echo Java output to stderr.	Josh Rosen	2012-12-28	1	-1/+2
\|
*	Simplify PySpark installation.	Josh Rosen	2012-12-27	11	-47/+72
\| \| \| \| \| \| \| \| \|	- Bundle Py4J binaries, since it's hard to install - Uses Spark's `run` script to launch the Py4J gateway, inheriting the settings in spark-env.sh With these changes, (hopefully) nothing more than running `sbt/sbt package` will be necessary to run PySpark.
*	Use addFile() to ship code to cluster in PySpark.	Josh Rosen	2012-12-27	2	-10/+74
\| \| \|	Add options to pyspark.SparkContext constructor.
*	Add epydoc API documentation for PySpark.	Josh Rosen	2012-12-27	3	-14/+224
\|
*	Add IPython support to pyspark-shell.	Josh Rosen	2012-12-27	3	-8/+21
\| \| \| \|	Suggested by / based on code from @MLnick
*	Add support for batched serialization of Python objects in PySpark.	Josh Rosen	2012-12-26	3	-20/+74
\|
*	Use filesystem to collect RDDs in PySpark.	Josh Rosen	2012-12-24	4	-21/+42
\| \| \| \| \| \|	Passing large volumes of data through Py4J seems to be slow. It appears to be faster to write the data to the local filesystem and read it back from Python.
*	Reduce object overhead in Pyspark shuffle and collect	Josh Rosen	2012-12-24	1	-5/+14
\|
*	Fix PySpark hash partitioning bug.	Josh Rosen	2012-10-28	1	-3/+9
\| \| \| \| \| \| \| \|	A Java array's hashCode is based on its object identify, not its elements, so this was causing serialized keys to be hashed incorrectly. This commit adds a PySpark-specific workaround and adds more tests.
*	Bump required Py4J version and add test for large broadcast variables.	Josh Rosen	2012-10-28	3	-2/+4
\|
*	Remove PYTHONPATH from SparkContext's executorEnvs.	Josh Rosen	2012-10-22	1	-2/+6
\| \| \| \| \|	It makes more sense to pass it in the dictionary of environment variables that is used to construct PythonRDD.
*	Add PySpark README and run scripts.	Josh Rosen	2012-10-20	6	-3/+124
\|
*	Update Python API for v0.6.0 compatibility.	Josh Rosen	2012-10-19	5	-19/+30
\|
*	Fix Python 2.6 compatibility in Python API.	Josh Rosen	2012-09-17	1	-6/+11
\|
*	Fix minor bugs in Python API examples.	Josh Rosen	2012-08-27	2	-5/+5
\|
*	Add pipe(), saveAsTextFile(), sc.union() to Python API.	Josh Rosen	2012-08-27	2	-8/+31
\|
*	Simplify Python worker; pipeline the map step of partitionBy().	Josh Rosen	2012-08-27	4	-100/+52
\|
*	Use local combiners in Python API combineByKey().	Josh Rosen	2012-08-27	2	-25/+24
\|
*	Add countByKey(), reduceByKeyLocally() to Python API	Josh Rosen	2012-08-27	1	-13/+39
\|
*	Add mapPartitions(), glom(), countByValue() to Python API.	Josh Rosen	2012-08-27	1	-4/+28
\|
*	Add broadcast variables to Python API.	Josh Rosen	2012-08-27	4	-12/+84
\|
*	Implement fold() in Python API.	Josh Rosen	2012-08-27	1	-1/+19
\|
*	Refactor Python MappedRDD to use iterator pipelines.	Josh Rosen	2012-08-24	2	-97/+41
\|
*	Fix options parsing in Python pi example.	Josh Rosen	2012-08-24	1	-1/+1
\|
*	Use numpy in Python k-means example.	Josh Rosen	2012-08-22	3	-26/+14
\|
*	Use only cPickle for serialization in Python API.	Josh Rosen	2012-08-21	6	-560/+233
\| \| \| \| \|	Objects serialized with JSON can be compared for equality, but JSON can be slow to serialize and only supports a limited range of data types.
*	Bundle cloudpickle with pyspark.	Josh Rosen	2012-08-19	4	-5/+976
\|
*	Add Python API.	Josh Rosen	2012-08-18	12	-0/+1170