spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Improvements in example code for the programming guide as well as adding ↵	Joseph E. Gonzalez	2014-01-13	1	-17/+22
\| \| \| \|	serialization support for GraphImpl to address issues with failed closure capture.
*	Remove aggregateNeighbors	Ankur Dave	2014-01-13	1	-17/+0
\|
*	Merge branch 'master' into graphx	Reynold Xin	2014-01-13	5	-15/+74
\|\
\| *	Merge pull request #400 from tdas/dstream-move	Patrick Wendell	2014-01-13	1	-1/+1
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Moved DStream and PairDSream to org.apache.spark.streaming.dstream Similar to the package location of `org.apache.spark.rdd.RDD`, `DStream` has been moved from `org.apache.spark.streaming.DStream` to `org.apache.spark.streaming.dstream.DStream`. I know that the package name is a little long, but I think its better to keep it consistent with Spark's structure. Also fixed persistence of windowed DStream. The RDDs generated generated by windowed DStream are essentially unions of underlying RDDs, and persistent these union RDDs would store numerous copies of the underlying data. Instead setting the persistence level on the windowed DStream is made to set the persistence level of the underlying DStream.
\| \| *	Merge remote-tracking branch 'apache/master' into dstream-move	Tathagata Das	2014-01-12	1	-2/+2
\| \| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
\| \| * \|	Moved DStream, DStreamCheckpointData and PairDStream from ↵	Tathagata Das	2014-01-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	org.apache.spark.streaming to org.apache.spark.streaming.dstream.
\| * \| \|	Merge pull request #399 from pwendell/consolidate-off	Patrick Wendell	2014-01-12	1	-1/+1
\| \|\ \ \ \| \| \|_\|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Disable shuffle file consolidation by default After running various performance tests for the 0.9 release, this still seems to have performance issues even on XFS. So let's keep this off-by-default for 0.9 and users can experiment with it depending on their disk configurations.
\| \| * \|	Disable shuffle file consolidation by default	Patrick Wendell	2014-01-12	1	-1/+1
\| \| \|/
\| * /	Rename DStream.foreach to DStream.foreachRDD	Patrick Wendell	2014-01-12	1	-2/+2
\| \|/ \| \| \| \| \| \| \| \| \| \|	`foreachRDD` makes it clear that the granularity of this operator is per-RDD. As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other DStream operators which get pushed down to individual records within each RDD.
\| *	Merge pull request #377 from andrewor14/master	Patrick Wendell	2014-01-10	1	-2/+21
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	External Sorting for Aggregator and CoGroupedRDDs (Revisited) (This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving) The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted. The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order. Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.
\| \| *	Update documentation for externalSorting	Andrew Or	2014-01-10	1	-3/+2
\| \| \|
\| \| *	Address Patrick's and Reynold's comments	Andrew Or	2014-01-10	1	-2/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Aside from trivial formatting changes, use nulls instead of Options for DiskMapIterator, and add documentation for spark.shuffle.externalSorting and spark.shuffle.memoryFraction. Also, set spark.shuffle.memoryFraction to 0.3, and spark.storage.memoryFraction = 0.6.
\| * \|	Merge pull request #371 from tgravescs/yarn_client_addjar_misc_fixes	Thomas Graves	2014-01-10	1	-2/+13
\| \|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Yarn client addjar and misc fixes Fix the addJar functionality in yarn-client mode, add support for the other options supported in yarn-standalone mode, set the application type on yarn in hadoop 2.X, add documentation, change heartbeat interval to be same code as the yarn-standalone so it doesn't take so long to get containers and exit.
\| \| * \|	yarn-client addJar fix and misc other	Thomas Graves	2014-01-09	1	-2/+13
\| \| \| \|
\| * \| \|	Merge pull request #378 from pwendell/consolidate_on	Patrick Wendell	2014-01-09	1	-1/+1
\| \|\ \ \ \| \| \|_\|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Enable shuffle consolidation by default. Bump this to being enabled for 0.9.0.
\| \| * \|	Enable shuffle consolidation by default.	Patrick Wendell	2014-01-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Bump this to being enabled for 0.9.0.
\| * \| \|	Merge pull request #353 from pwendell/ipython-simplify	Patrick Wendell	2014-01-09	1	-2/+3
\| \|\ \ \ \| \| \|/ / \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Simplify and fix pyspark script. This patch removes compatibility for IPython < 1.0 but fixes the launch script and makes it much simpler. I tested this using the three commands in the PySpark documentation page: 1. IPYTHON=1 ./pyspark 2. IPYTHON_OPTS="notebook" ./pyspark 3. IPYTHON_OPTS="notebook --pylab inline" ./pyspark There are two changes: - We rely on PYTHONSTARTUP env var to start PySpark - Removed the quotes around $IPYTHON_OPTS... having quotes gloms them together as a single argument passed to `exec` which seemed to cause ipython to fail (it instead expects them as multiple arguments).
\| \| * \|	Simplify and fix pyspark script.	Patrick Wendell	2014-01-07	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch removes compatibility for IPython < 1.0 but fixes the launch script and makes it much simpler. I tested this using the three commands in the PySpark documentation page: 1. IPYTHON=1 ./pyspark 2. IPYTHON_OPTS="notebook" ./pyspark 3. IPYTHON_OPTS="notebook --pylab inline" ./pyspark There are two changes: - We rely on PYTHONSTARTUP env var to start PySpark - Removed the quotes around $IPYTHON_OPTS... having quotes gloms them together as a single argument passed to `exec` which seemed to cause ipython to fail (it instead expects them as multiple arguments).
\| * \| \|	Merge pull request #293 from pwendell/standalone-driver	Patrick Wendell	2014-01-09	1	-5/+33
\| \|\ \ \ \| \| \|_\|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-998: Support Launching Driver Inside of Standalone Mode [NOTE: I need to bring the tests up to date with new changes, so for now they will fail] This patch provides support for launching driver programs inside of a standalone cluster manager. It also supports monitoring and re-launching of driver programs which is useful for long running, recoverable applications such as Spark Streaming jobs. For those jobs, this patch allows a deployment mode which is resilient to the failure of any worker node, failure of a master node (provided a multi-master setup), and even failures of the applicaiton itself, provided they are recoverable on a restart. Driver information, such as the status and logs from a driver, is displayed in the UI There are a few small TODO's here, but the code is generally feature-complete. They are: - Bring tests up to date and add test coverage - Restarting on failure should be optional and maybe off by default. - See if we can re-use akka connections to facilitate clients behind a firewall A sensible place to start for review would be to look at the `DriverClient` class which presents users the ability to launch their driver program. I've also added an example program (`DriverSubmissionTest`) that allows you to test this locally and play around with killing workers, etc. Most of the code is devoted to persisting driver state in the cluster manger, exposing it in the UI, and dealing correctly with various types of failures. Instructions to test locally: - `sbt/sbt assembly/assembly examples/assembly` - start a local version of the standalone cluster manager ``` ./spark-class org.apache.spark.deploy.client.DriverClient \ -j -Dspark.test.property=something \ -e SPARK_TEST_KEY=SOMEVALUE \ launch spark://10.99.1.14:7077 \ ../path-to-examples-assembly-jar \ org.apache.spark.examples.DriverSubmissionTest 1000 some extra options --some-option-here -X 13 ``` - Go in the UI and make sure it started correctly, look at the output etc - Kill workers, the driver program, masters, etc.
\| \| * \|	Merge remote-tracking branch 'apache-github/master' into standalone-driver	Patrick Wendell	2014-01-08	14	-81/+348
\| \| \|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala pom.xml
\| \| * \| \|	Fixes	Patrick Wendell	2014-01-08	1	-2/+3
\| \| \| \| \|
\| \| * \| \|	Some doc fixes	Patrick Wendell	2014-01-06	1	-3/+2
\| \| \| \| \|
\| \| * \| \|	Merge remote-tracking branch 'apache-github/master' into standalone-driver	Patrick Wendell	2014-01-06	23	-155/+228
\| \| \|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala core/src/main/scala/org/apache/spark/deploy/client/TestClient.scala core/src/main/scala/org/apache/spark/deploy/master/Master.scala core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala core/src/main/scala/org/apache/spark/scheduler/cluster/SparkDeploySchedulerBackend.scala
\| \| * \| \| \|	Documentation and adding supervise option	Patrick Wendell	2013-12-29	1	-5/+33
\| \| \| \| \| \|
\| * \| \| \| \|	Fixing config option "retained_stages" => "retainedStages".	Patrick Wendell	2014-01-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a very esoteric option and it's out of sync with the style we use. So it seems fitting to fix it for 0.9.0.
* \| \| \| \| \|	Merge pull request #2 from jegonzal/GraphXCCIssue	Ankur Dave	2014-01-13	1	-6/+27
\|\ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Improving documentation and identifying potential bug in CC calculation.
\| * \| \| \| \| \|	Improving documentation and identifying potential bug in CC calculation.	Joseph E. Gonzalez	2014-01-13	1	-6/+27
\| \| \| \| \| \| \|
* \| \| \| \| \| \|	Add graph loader links to doc	Ankur Dave	2014-01-13	1	-0/+13
\| \| \| \| \| \| \|
* \| \| \| \| \| \|	Fix mapReduceTriplets links in doc	Ankur Dave	2014-01-13	1	-4/+4
\|/ / / / / /
* \| \| \| \| \|	Tested and corrected all examples up to mask in the graphx-programming-guide.	Joseph E. Gonzalez	2014-01-12	1	-17/+20
\| \| \| \| \| \|
* \| \| \| \| \|	Use GraphLoader for algorithms examples in doc	Ankur Dave	2014-01-12	1	-17/+19
\| \| \| \| \| \|
* \| \| \| \| \|	Move algorithms to GraphOps	Ankur Dave	2014-01-12	1	-9/+3
\| \| \| \| \| \|
* \| \| \| \| \|	Add TriangleCount example	Ankur Dave	2014-01-12	1	-4/+27
\| \| \| \| \| \|
* \| \| \| \| \|	Documenting Pregel API	Joseph E. Gonzalez	2014-01-12	1	-1/+198
\| \| \| \| \| \|
* \| \| \| \| \|	Add connected components example to doc	Ankur Dave	2014-01-12	1	-1/+19
\| \| \| \| \| \|
* \| \| \| \| \|	Add PageRank example and data	Ankur Dave	2014-01-12	1	-1/+31
\| \| \| \| \| \|
* \| \| \| \| \|	Link methods in programming guide; document VertexID	Ankur Dave	2014-01-12	1	-69/+86
\| \| \| \| \| \|
* \| \| \| \| \|	Correcting typos in documentation.	Joseph E. Gonzalez	2014-01-11	1	-66/+79
\| \| \| \| \| \|
* \| \| \| \| \|	Finished docummenting join operators and revised some of the initial ↵	Joseph E. Gonzalez	2014-01-11	2	-37/+82
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	presentation.
* \| \| \| \| \|	Remove GraphLab	Ankur Dave	2014-01-11	1	-7/+6
\| \| \| \| \| \|
* \| \| \| \| \|	Finished documenting structural operators and starting join operators.	Joseph E. Gonzalez	2014-01-11	1	-18/+72
\| \| \| \| \| \|
* \| \| \| \| \|	starting structural operator discussion.	Joseph E. Gonzalez	2014-01-11	1	-2/+30
\| \| \| \| \| \|
* \| \| \| \| \|	Addressing comment about Graph Processing in docs.	Joseph E. Gonzalez	2014-01-11	1	-2/+2
\| \| \| \| \| \|
* \| \| \| \| \|	More organizational changes and dropping the benchmark plot.	Joseph E. Gonzalez	2014-01-11	1	-12/+20
\| \| \| \| \| \|
* \| \| \| \| \|	More edits.	Joseph E. Gonzalez	2014-01-10	4	-16/+215
\| \| \| \| \| \|
* \| \| \| \| \|	Soften wording about GraphX superseding Bagel	Ankur Dave	2014-01-10	4	-6/+6
\| \| \| \| \| \|
* \| \| \| \| \|	Generate GraphX docs	Ankur Dave	2014-01-10	1	-1/+1
\| \| \| \| \| \|
* \| \| \| \| \|	Add back Bagel links to docs, but mark them superseded	Ankur Dave	2014-01-10	5	-14/+21
\| \| \| \| \| \|
* \| \| \| \| \|	WIP. Updating figures and cleaning up initial skeleton for GraphX ↵	Joseph E. Gonzalez	2014-01-10	12	-159/+134
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Programming guide.
* \| \| \| \| \|	Start fixing formatting of graphx-programming-guide	Ankur Dave	2014-01-09	1	-7/+6
\| \| \| \| \| \|