spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge pull request #311 from tmyklebu/master	Matei Zaharia	2014-01-02	1	-11/+55
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-991: Report information gleaned from a Python stacktrace in the UI Scala: - Added setCallSite/clearCallSite to SparkContext and JavaSparkContext. These functions mutate a LocalProperty called "externalCallSite." - Add a wrapper, getCallSite, that checks for an externalCallSite and, if none is found, calls the usual Utils.formatSparkCallSite. - Change everything that calls Utils.formatSparkCallSite to call getCallSite instead. Except getCallSite. - Add wrappers to setCallSite/clearCallSite wrappers to JavaSparkContext. Python: - Add a gruesome hack to rdd.py that inspects the traceback and guesses what you want to see in the UI. - Add a RAII wrapper around said gruesome hack that calls setCallSite/clearCallSite as appropriate. - Wire said RAII wrapper up around three calls into the Scala code. I'm not sure that I hit all the spots with the RAII wrapper. I'm also not sure that my gruesome hack does exactly what we want. One could also approach this change by refactoring runJob/submitJob/runApproximateJob to take a call site, then threading that parameter through everything that needs to know it. One might object to the pointless-looking wrappers in JavaSparkContext. Unfortunately, I can't directly access the SparkContext from Python---or, if I can, I don't know how---so I need to wrap everything that matters in JavaSparkContext. Conflicts: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
\| *	Make Python function/line appear in the UI.	Tor Myklebust	2013-12-28	1	-11/+55
\| \|
* \|	Fix Python code after change of getOrElse	Matei Zaharia	2014-01-01	2	-7/+14
\| \|
* \|	Miscellaneous fixes from code review.	Matei Zaharia	2014-01-01	1	-8/+4
\| \| \| \| \| \| \| \| \| \| \| \|	Also replaced SparkConf.getOrElse with just a "get" that takes a default value, and added getInt, getLong, etc to make code that uses this simpler later on.
* \|	Merge remote-tracking branch 'apache/master' into conf2	Matei Zaharia	2013-12-31	2	-9/+4
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
\| * \	Merge pull request #289 from tdas/filestream-fix	Patrick Wendell	2013-12-31	2	-9/+4
\| \|\ \ \| \| \|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Bug fixes for file input stream and checkpointing - Fixed bugs in the file input stream that led the stream to fail due to transient HDFS errors (listing files when a background thread it deleting fails caused errors, etc.) - Updated Spark's CheckpointRDD and Streaming's CheckpointWriter to use SparkContext.hadoopConfiguration, to allow checkpoints to be written to any HDFS compatible store requiring special configuration. - Changed the API of SparkContext.setCheckpointDir() - eliminated the unnecessary 'useExisting' parameter. Now SparkContext will always create a unique subdirectory within the user specified checkpoint directory. This is to ensure that previous checkpoint files are not accidentally overwritten. - Fixed bug where setting checkpoint directory as a relative local path caused the checkpointing to fail.
\| \| *	Fixed Python API for sc.setCheckpointDir. Also other fixes based on ↵	Tathagata Das	2013-12-24	2	-9/+4
\| \| \| \| \| \| \| \| \| \| \| \|	Reynold's comments on PR 289.
* \| \|	Updated docs for SparkConf and handled review comments	Matei Zaharia	2013-12-30	2	-17/+31
\| \| \|
* \| \|	Properly show Spark properties on web UI, and change app name property	Matei Zaharia	2013-12-29	2	-3/+3
\| \| \|
* \| \|	Fix some Python docs and make sure to unset SPARK_TESTING in Python	Matei Zaharia	2013-12-29	6	-22/+37
\| \| \| \| \| \| \| \| \| \| \| \|	tests so we don't get the test spark.conf on the classpath.
* \| \|	Merge remote-tracking branch 'origin/master' into conf2	Matei Zaharia	2013-12-29	9	-2/+599
\|\\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala new-yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala
\| * \|	Merge pull request #283 from tmyklebu/master	Matei Zaharia	2013-12-26	8	-1/+598
\| \|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Python bindings for mllib This pull request contains Python bindings for the regression, clustering, classification, and recommendation tools in mllib. For each 'train' frontend exposed, there is a Scala stub in PythonMLLibAPI.scala and a Python stub in mllib.py. The Python stub serialises the input RDD and any vector/matrix arguments into a mutually-understood format and calls the Scala stub. The Scala stub deserialises the RDD and the vector/matrix arguments, calls the appropriate 'train' function, serialises the resulting model, and returns the serialised model. ALSModel is slightly different since a MatrixFactorizationModel has RDDs inside. The Scala stub returns a handle to a Scala MatrixFactorizationModel; prediction is done by calling the Scala predict method. I have tested these bindings on an x86_64 machine running Linux. There is a risk that these bindings may fail on some choose-your-own-endian platform if Python's endian differs from java.nio.ByteBuffer's idea of the native byte order.
\| \| * \|	Remove commented code in __init__.py.	Tor Myklebust	2013-12-25	1	-8/+0
\| \| \| \|
\| \| * \|	Fix copypasta in __init__.py. Don't import anything directly into ↵	Tor Myklebust	2013-12-25	1	-26/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pyspark.mllib.
\| \| * \|	Initial weights in Scala are ones; do that too. Also fix some errors.	Tor Myklebust	2013-12-25	1	-6/+6
\| \| \| \|
\| \| * \|	Split the mllib bindings into a whole bunch of modules and rename some things.	Tor Myklebust	2013-12-25	7	-183/+409
\| \| \| \|
\| \| * \|	Remove useless line from test stub.	Tor Myklebust	2013-12-24	1	-1/+0
\| \| \| \|
\| \| * \|	Python change for move of PythonMLLibAPI.	Tor Myklebust	2013-12-24	1	-1/+1
\| \| \| \|
\| \| * \|	Release JVM reference to the ALSModel when done.	Tor Myklebust	2013-12-22	1	-2/+2
\| \| \| \|
\| \| * \|	Python stubs for ALSModel.	Tor Myklebust	2013-12-21	2	-8/+56
\| \| \| \|
\| \| * \|	Un-semicolon mllib.py.	Tor Myklebust	2013-12-20	1	-11/+11
\| \| \| \|
\| \| * \|	Change some docstrings and add some others.	Tor Myklebust	2013-12-20	1	-1/+3
\| \| \| \|
\| \| * \|	Licence notice.	Tor Myklebust	2013-12-20	1	-0/+17
\| \| \| \|
\| \| * \|	Whitespace.	Tor Myklebust	2013-12-20	1	-1/+1
\| \| \| \|
\| \| * \|	Remove gigantic endian-specific test and exception tests.	Tor Myklebust	2013-12-20	1	-38/+3
\| \| \| \|
\| \| * \|	Tests for the Python side of the mllib bindings.	Tor Myklebust	2013-12-20	1	-52/+172
\| \| \| \|
\| \| * \|	Python stubs for classification and clustering.	Tor Myklebust	2013-12-20	2	-16/+96
\| \| \| \|
\| \| * \|	Python side of python bindings for linear, Lasso, and ridge regression	Tor Myklebust	2013-12-19	2	-15/+72
\| \| \| \|
\| \| * \|	Incorporate most of Josh's style suggestions. I don't want to deal with the ↵	Tor Myklebust	2013-12-19	2	-98/+91
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	type and length checking errors until we've got at least one working stub that we're all happy with.
\| \| * \|	The rest of the Python side of those bindings.	Tor Myklebust	2013-12-19	3	-2/+4
\| \| \| \|
\| \| * \|	First cut at python mllib bindings. Only LinearRegression is supported.	Tor Myklebust	2013-12-19	1	-0/+114
\| \| \| \|
\| * \| \|	Typo: avaiable -> available	Andrew Ash	2013-12-24	1	-1/+1
\| \| \|/ \| \|/\|
* \| \|	Add Python docs about SparkConf	Matei Zaharia	2013-12-29	2	-1/+44
\| \| \|
* \| \|	Fix some other Python tests due to initializing JVM in a different way	Matei Zaharia	2013-12-29	3	-10/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The test in context.py created two different instances of the SparkContext class by copying "globals", so that some tests can have a global "sc" object and others can try initializing their own contexts. This led to two JVM gateways being created since SparkConf also looked at pyspark.context.SparkContext to get the JVM.
* \| \|	Add SparkConf support in Python	Matei Zaharia	2013-12-29	4	-13/+146
\| \| \|
* \| \|	Fix Python use of getLocalDir	Matei Zaharia	2013-12-29	1	-1/+1
\|/ /
* \|	Merge pull request #276 from shivaram/collectPartition	Reynold Xin	2013-12-19	2	-4/+6
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add collectPartition to JavaRDD interface. This interface is useful for implementing `take` from other language frontends where the data is serialized. Also remove `takePartition` from PythonRDD and use `collectPartition` in rdd.py. Thanks @concretevitamin for the original change and tests.
\| * \|	Make collectPartitions take an array of partitions	Shivaram Venkataraman	2013-12-19	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Change the implementation to use runJob instead of PartitionPruningRDD. Also update the unit tests and the python take implementation to use the new interface.
\| * \|	Add collectPartition to JavaRDD interface.	Shivaram Venkataraman	2013-12-18	2	-4/+1
\| \|/ \| \| \| \| \| \|	Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
* /	Add toString to Java RDD, and __repr__ to Python RDD	Nick Pentreath	2013-12-19	1	-0/+3
\|/
*	Merge branch 'master' into akka-bug-fix	Prashant Sharma	2013-12-11	3	-1/+36
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/pom.xml core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala pom.xml project/SparkBuild.scala streaming/pom.xml yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
\| *	License headers	Patrick Wendell	2013-12-09	1	-0/+17
\| \|
\| *	Fix UnicodeEncodeError in PySpark saveAsTextFile().	Josh Rosen	2013-11-28	2	-1/+19
\| \| \| \| \| \|	Fixes SPARK-970.
* \|	Merge branch 'master' into wip-scala-2.10	Prashant Sharma	2013-11-27	8	-142/+383
\|\\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithContextRDD.scala core/src/main/scala/org/apache/spark/rdd/RDD.scala python/pyspark/rdd.py
\| *	Removed unused basestring case from dump_stream.	Josh Rosen	2013-11-26	1	-2/+0
\| \|
\| *	FramedSerializer: _dumps => dumps, _loads => loads.	Josh Rosen	2013-11-10	4	-18/+18
\| \|
\| *	Send PySpark commands as bytes insetad of strings.	Josh Rosen	2013-11-10	3	-16/+13
\| \|
\| *	Add custom serializer support to PySpark.	Josh Rosen	2013-11-10	8	-148/+362
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For now, this only adds MarshalSerializer, but it lays the groundwork for other supporting custom serializers. Many of these mechanisms can also be used to support deserialization of different data formats sent by Java, such as data encoded by MsgPack. This also fixes a bug in SparkContext.union().
\| *	Remove Pickle-wrapping of Java objects in PySpark.	Josh Rosen	2013-11-03	4	-14/+39
\| \| \| \| \| \| \| \| \| \| \| \|	If we support custom serializers, the Python worker will know what type of input to expect, so we won't need to wrap Tuple2 and Strings into pickled tuples and strings.
\| *	Replace magic lengths with constants in PySpark.	Josh Rosen	2013-11-03	2	-6/+13
\| \| \| \| \| \| \| \| \| \| \| \|	Write the length of the accumulators section up-front rather than terminating it with a negative length. I find this easier to read.