spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge pull request #511 from JoshRosen/SPARK-1040	Reynold Xin	2014-01-25	6	-7/+25
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040) This fixes [SPARK-1040](https://spark-project.atlassian.net/browse/SPARK-1040), an issue where JavaPairRDD.collectAsMap() could sometimes fail with ClassCastException. I applied the same fix to the Spark Streaming Java APIs. The commit message describes the fix in more detail. I also increased the verbosity of JUnit test output under SBT to make it easier to verify that the Java tests are actually running.
\| *	Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040)	Josh Rosen	2014-01-25	4	-5/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixes an issue where collectAsMap() could fail when called on a JavaPairRDD that was derived by transforming a non-JavaPairRDD. The root problem was that we were creating the JavaPairRDD's ClassTag by casting a ClassTag[AnyRef] to a ClassTag[Tuple2[K2, V2]]. To fix this, I cast a ClassTag[Tuple2[_, _]] instead, since this actually produces a ClassTag of the appropriate type because ClassTags don't capture type parameters: scala> implicitly[ClassTag[Tuple2[_, _]]] == implicitly[ClassTag[Tuple2[Int, Int]]] res8: Boolean = true scala> implicitly[ClassTag[AnyRef]].asInstanceOf[ClassTag[Tuple2[Int, Int]]] == implicitly[ClassTag[Tuple2[Int, Int]]] res9: Boolean = false
\| *	Increase JUnit test verbosity under SBT.	Josh Rosen	2014-01-25	2	-2/+3
\|/ \| \| \| \| \| \| \| \|	Upgrade junit-interface plugin from 0.9 to 0.10. I noticed that the JavaAPISuite tests didn't appear to display any output locally or under Jenkins, making it difficult to know whether they were running. This change increases the verbosity to more closely match the ScalaTest tests.
*	Merge pull request #505 from JoshRosen/SPARK-1026	Patrick Wendell	2014-01-23	2	-6/+23
\|\ \| \| \| \| \| \| \| \| \| \|	Deprecate mapPartitionsWithSplit in PySpark (SPARK-1026) This commit deprecates `mapPartitionsWithSplit` in PySpark (see [SPARK-1026](https://spark-project.atlassian.net/browse/SPARK-1026) and removes the remaining references to it from the docs.
\| *	Deprecate mapPartitionsWithSplit in PySpark.	Josh Rosen	2014-01-23	2	-6/+23
\| \| \| \| \| \| \| \| \| \| \| \|	Also, replace the last reference to it in the docs. This fixes SPARK-1026.
* \|	Merge pull request #503 from pwendell/master	Patrick Wendell	2014-01-23	1	-1/+9
\|\ \ \| \|/ \|/\| \| \| \| \| \| \|	Fix bug on read-side of external sort when using Snappy. This case wasn't handled correctly and this patch fixes it.
\| *	Fix bug on read-side of external sort when using Snappy.	Patrick Wendell	2014-01-23	1	-1/+9
\| \| \| \| \| \| \| \|	This case wasn't handled correctly and this patch fixes it.
* \|	Minor fix	Patrick Wendell	2014-01-23	1	-0/+1
\| \|
* \|	Merge pull request #502 from pwendell/clone-1	Patrick Wendell	2014-01-23	5	-229/+137
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove Hadoop object cloning and warn users making Hadoop RDD's. The code introduced in #359 used Hadoop's WritableUtils.clone() to duplicate objects when reading from Hadoop files. Some users have reported exceptions when cloning data in various file formats, including Avro and another custom format. This patch removes that functionality to ensure stability for the 0.9 release. Instead, it puts a clear warning in the documentation that copying may be necessary for Hadoop data sets.
\| * \|	Minor changes after auditing diff from earlier version	Patrick Wendell	2014-01-23	3	-7/+1
\| \| \|
\| * \|	Response to Matei's review	Patrick Wendell	2014-01-23	2	-21/+22
\| \| \|
\| * \|	Remove Hadoop object cloning and warn users making Hadoop RDD's.	Patrick Wendell	2014-01-23	5	-221/+134
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The code introduced in #359 used Hadoop's WritableUtils.clone() to duplicate objects when reading from Hadoop files. Some users have reported exceptions when cloning data in verious file formats, including Avro and another custom format. This patch removes that functionality to ensure stability for the 0.9 release. Instead, it puts a clear warning in the documentation that copying may be necessary for Hadoop data sets.
* \| \|	Merge pull request #501 from JoshRosen/cartesian-rdd-fixes	Patrick Wendell	2014-01-23	3	-22/+56
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix two bugs in PySpark cartesian(): SPARK-978 and SPARK-1034 This pull request fixes two bugs in PySpark's `cartesian()` method: - [SPARK-978](https://spark-project.atlassian.net/browse/SPARK-978): PySpark's cartesian method throws ClassCastException exception - [SPARK-1034](https://spark-project.atlassian.net/browse/SPARK-1034): Py4JException on PySpark Cartesian Result The JIRAs have more details describing the fixes.
\| * \| \|	Fix SPARK-978: ClassCastException in PySpark cartesian.	Josh Rosen	2014-01-23	2	-20/+48
\| \| \| \|
\| * \| \|	Fix SPARK-1034: Py4JException on PySpark Cartesian Result	Josh Rosen	2014-01-23	2	-2/+8
\| \| \| \|
* \| \| \|	Merge pull request #406 from eklavya/master	Josh Rosen	2014-01-23	1	-1/+39
\|\ \ \ \ \| \|/ / / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Extending Java API coverage Hi, I have added three new methods to JavaRDD. Please review and merge.
\| * \| \|	fixed ClassTag in mapPartitions	eklavya	2014-01-23	1	-8/+9
\| \| \| \|
\| * \| \|	Modifications as suggested in PR feedback-	Saurabh Rawat	2014-01-14	2	-8/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- more variants of mapPartitions added to JavaRDDLike - move setGenerator to JavaRDDLike - clean up
\| * \| \|	Modifications as suggested in PR feedback-	Saurabh Rawat	2014-01-13	2	-17/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- mapPartitions, foreachPartition moved to JavaRDDLike - call scala rdd's setGenerator instead of setting directly in JavaRDD
\| * \| \|	Remove default param from mapPartitions	eklavya	2014-01-13	1	-1/+1
\| \| \| \|
\| * \| \|	Remove classtag from mapPartitions.	eklavya	2014-01-13	1	-1/+1
\| \| \| \|
\| * \| \|	Added foreachPartition method to JavaRDD.	eklavya	2014-01-13	1	-1/+8
\| \| \| \|
\| * \| \|	Added mapPartitions method to JavaRDD.	eklavya	2014-01-13	1	-1/+12
\| \| \| \|
\| * \| \|	Added setter method setGenerator to JavaRDD.	eklavya	2014-01-13	1	-0/+5
\| \| \| \|
* \| \| \|	Merge pull request #499 from jianpingjwang/dev1	Reynold Xin	2014-01-23	3	-37/+40
\|\ \ \ \ \| \|_\|/ / \|/\| \| \| \| \| \| \|	Replace commons-math with jblas in SVDPlusPlus
\| * \| \|	Add jblas dependency	Jianping J Wang	2014-01-23	1	-1/+1
\| \| \| \|
\| * \| \|	Add jblas dependency	Jianping J Wang	2014-01-23	1	-4/+3
\| \| \| \|
\| * \| \|	Replace commons-math with jblas	Jianping J Wang	2014-01-23	1	-32/+36
\| \| \| \|
* \| \| \|	Merge pull request #496 from pwendell/master	Patrick Wendell	2014-01-22	1	-1/+1
\|\ \ \ \ \| \| \|_\|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix bug in worker clean-up in UI Introduced in d5a96fec (/cc @aarondav). This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers.
\| * \| \|	Fix bug in worker clean-up in UI	Patrick Wendell	2014-01-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduced in d5a96fec. This should be picked into 0.8 and 0.9 as well.
* \| \| \|	Merge pull request #447 from CodingCat/SPARK-1027	Patrick Wendell	2014-01-22	8	-27/+37
\|\ \ \ \ \| \|_\|/ / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fix for SPARK-1027 fix for SPARK-1027 (https://spark-project.atlassian.net/browse/SPARK-1027) FIXES 1. change sparkhome from String to Option(String) in ApplicationDesc 2. remove sparkhome parameter in LaunchExecutor message 3. adjust involved files
\| * \| \|	refactor sparkHome to val	CodingCat	2014-01-22	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	clean code
\| * \| \|	fix for SPARK-1027	CodingCat	2014-01-20	8	-17/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	change TestClient & Worker to Some("xxx") kill manager if it is started remove unnecessary .get when fetch "SPARK_HOME" values
\| * \| \|	executor creation failed should not make the worker restart	CodingCat	2014-01-20	1	-12/+20
\| \| \| \|
* \| \| \|	Merge pull request #495 from srowen/GraphXCommonsMathDependency	Patrick Wendell	2014-01-22	3	-2/+10
\|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix graphx Commons Math dependency `graphx` depends on Commons Math (2.x) in `SVDPlusPlus.scala`. However the module doesn't declare this dependency. It happens to work because it is included by Hadoop artifacts. But, I can tell you this isn't true as of a month or so ago. Building versus recent Hadoop would fail. (That's how we noticed.) The simple fix is to declare the dependency, as it should be. But it's also worth noting that `commons-math` is the old-ish 2.x line, while `commons-math3` is where newer 3.x releases are. Drop-in replacement, but different artifact and package name. Changing this only usage to `commons-math3` works, tests pass, and isn't surprising that it does, so is probably also worth changing. (A comment in some test code also references `commons-math3`, FWIW.) It does raise another question though: `mllib` looks like it uses the `jblas` `DoubleMatrix` for general purpose vector/matrix stuff. Should `graphx` really use Commons Math for this? Beyond the tiny scope here but worth asking.
\| * \| \| \|	Also add graphx commons-math3 dependeny in sbt build	Sean Owen	2014-01-22	1	-1/+4
\| \| \| \| \|
\| * \| \| \|	Depend on Commons Math explicitly instead of accidentally getting it from ↵	Sean Owen	2014-01-22	2	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Hadoop (which stops working in 2.2.x) and also use the newer commons-math3
* \| \| \| \|	Merge pull request #492 from skicavs/master	Patrick Wendell	2014-01-22	1	-2/+2
\|\ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fixed job name and usage information for the JavaSparkPi example
\| * \| \| \| \|	fixed job name and usage information for the JavaSparkPi example	Kevin Mader	2014-01-22	1	-2/+2
\| \| \| \| \| \|
* \| \| \| \| \|	Merge pull request #478 from sryza/sandy-spark-1033	Patrick Wendell	2014-01-22	2	-4/+4
\|\ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-1033. Ask for cores in Yarn container requests Tested on a pseudo-distributed cluster against the Fair Scheduler and observed a worker taking more than a single core.
\| * \| \| \| \| \|	Incorporate Tom's comments - update doc and code to reflect that core ↵	Sandy Ryza	2014-01-21	2	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	requests may not always be honored
\| * \| \| \| \| \|	SPARK-1033. Ask for cores in Yarn container requests	Sandy Ryza	2014-01-20	2	-5/+6
\| \| \|_\|/ / / \| \|/\| \| \| \|
* \| \| \| \| \|	Merge pull request #493 from kayousterhout/double_add	Matei Zaharia	2014-01-22	1	-1/+1
\|\ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixed bug where task set managers are added to queue twice @mateiz can you verify that this is a bug and wasn't intentional? (https://github.com/apache/incubator-spark/commit/90a04dab8d9a2a9a372cea7cdf46cc0fd0f2f76c#diff-7fa4f84a961750c374f2120ca70e96edR551) This bug leads to a small performance hit because task set managers will get offered each rejected resource offer twice, but doesn't lead to any incorrect functionality. Thanks to @hdc1112 for pointing this out.
\| * \| \| \| \| \|	Fixed bug where task set managers are added to queue twice	Kay Ousterhout	2014-01-22	1	-1/+1
\| \| \|_\|/ / / \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This bug leads to a small performance hit because task set managers will get offered each rejected resource offer twice, but doesn't lead to any incorrect functionality.
* \| \| \| \| \|	Merge pull request #315 from rezazadeh/sparsesvd	Matei Zaharia	2014-01-22	7	-0/+543
\|\ \ \ \ \ \ \| \|/ / / / / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Sparse SVD # Singular Value Decomposition Given an m x n matrix A, compute matrices U, S, V such that A = U S * V^T* There is no restriction on m, but we require n^2 doubles to fit in memory. Further, n should be less than m. The decomposition is computed by first computing A^TA = V S^2 V^T, computing svd locally on that (since n x n is small), from which we recover S and V. Then we compute U via easy matrix multiplication as U = A V * S^-1* Only singular vectors associated with the largest k singular values If there are k such values, then the dimensions of the return will be: * S is k x k and diagonal, holding the singular values on diagonal. * U is m x k and satisfies U^TU = eye(k). V is n x k and satisfies V^TV = eye(k). All input and output is expected in sparse matrix format, 0-indexed as tuples of the form ((i,j),value) all in RDDs. # Testing Tests included. They test: - Decomposition promise (A = USV^T) - For small matrices, output is compared to that of jblas - Rank 1 matrix test included - Full Rank matrix test included - Middle-rank matrix forced via k included # Example Usage import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg.SVD import org.apache.spark.mllib.linalg.SparseMatrix import org.apache.spark.mllib.linalg.MatrixyEntry // Load and parse the data file val data = sc.textFile("mllib/data/als/test.data").map { line => val parts = line.split(',') MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble) } val m = 4 val n = 4 // recover top 1 singular vector val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), 1) println("singular values = " + decomposed.S.data.toArray.mkString) # Documentation Added to docs/mllib-guide.md
\| * \| \| \| \|	rename to MatrixSVD	Reza Zadeh	2014-01-17	1	-2/+2
\| \| \| \| \| \|
\| * \| \| \| \|	rename to MatrixSVD	Reza Zadeh	2014-01-17	2	-4/+4
\| \| \| \| \| \|
\| * \| \| \| \|	Merge remote-tracking branch 'upstream/master' into sparsesvd	Reza Zadeh	2014-01-17	146	-1799/+2613
\| \|\ \ \ \ \
\| * \| \| \| \| \|	make example 0-indexed	Reza Zadeh	2014-01-17	1	-1/+1
\| \| \| \| \| \| \|
\| * \| \| \| \| \|	0index docs	Reza Zadeh	2014-01-17	1	-1/+1
\| \| \| \| \| \| \|