| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| | |
Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040)
This fixes [SPARK-1040](https://spark-project.atlassian.net/browse/SPARK-1040), an issue where JavaPairRDD.collectAsMap() could sometimes fail with ClassCastException. I applied the same fix to the Spark Streaming Java APIs. The commit message describes the fix in more detail.
I also increased the verbosity of JUnit test output under SBT to make it easier to verify that the Java tests are actually running.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This fixes an issue where collectAsMap() could
fail when called on a JavaPairRDD that was derived
by transforming a non-JavaPairRDD.
The root problem was that we were creating the
JavaPairRDD's ClassTag by casting a
ClassTag[AnyRef] to a ClassTag[Tuple2[K2, V2]].
To fix this, I cast a ClassTag[Tuple2[_, _]]
instead, since this actually produces a ClassTag
of the appropriate type because ClassTags don't
capture type parameters:
scala> implicitly[ClassTag[Tuple2[_, _]]] == implicitly[ClassTag[Tuple2[Int, Int]]]
res8: Boolean = true
scala> implicitly[ClassTag[AnyRef]].asInstanceOf[ClassTag[Tuple2[Int, Int]]] == implicitly[ClassTag[Tuple2[Int, Int]]]
res9: Boolean = false
|
|/
|
|
|
|
|
|
|
| |
Upgrade junit-interface plugin from 0.9 to 0.10.
I noticed that the JavaAPISuite tests didn't
appear to display any output locally or under
Jenkins, making it difficult to know whether they
were running. This change increases the verbosity
to more closely match the ScalaTest tests.
|
|\
| |
| |
| |
| |
| | |
Deprecate mapPartitionsWithSplit in PySpark (SPARK-1026)
This commit deprecates `mapPartitionsWithSplit` in PySpark (see [SPARK-1026](https://spark-project.atlassian.net/browse/SPARK-1026) and removes the remaining references to it from the docs.
|
| |
| |
| |
| |
| |
| | |
Also, replace the last reference to it in the docs.
This fixes SPARK-1026.
|
|\ \
| |/
|/|
| |
| |
| | |
Fix bug on read-side of external sort when using Snappy.
This case wasn't handled correctly and this patch fixes it.
|
| |
| |
| |
| | |
This case wasn't handled correctly and this patch fixes it.
|
| | |
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Remove Hadoop object cloning and warn users making Hadoop RDD's.
The code introduced in #359 used Hadoop's WritableUtils.clone() to
duplicate objects when reading from Hadoop files. Some users have
reported exceptions when cloning data in various file formats,
including Avro and another custom format.
This patch removes that functionality to ensure stability for the
0.9 release. Instead, it puts a clear warning in the documentation
that copying may be necessary for Hadoop data sets.
|
| | | |
|
| | | |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The code introduced in #359 used Hadoop's WritableUtils.clone() to
duplicate objects when reading from Hadoop files. Some users have
reported exceptions when cloning data in verious file formats,
including Avro and another custom format.
This patch removes that functionality to ensure stability for the
0.9 release. Instead, it puts a clear warning in the documentation
that copying may be necessary for Hadoop data sets.
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Fix two bugs in PySpark cartesian(): SPARK-978 and SPARK-1034
This pull request fixes two bugs in PySpark's `cartesian()` method:
- [SPARK-978](https://spark-project.atlassian.net/browse/SPARK-978): PySpark's cartesian method throws ClassCastException exception
- [SPARK-1034](https://spark-project.atlassian.net/browse/SPARK-1034): Py4JException on PySpark Cartesian Result
The JIRAs have more details describing the fixes.
|
| | | | |
|
| | | | |
|
|\ \ \ \
| |/ / /
|/| | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Extending Java API coverage
Hi,
I have added three new methods to JavaRDD.
Please review and merge.
|
| | | | |
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
- more variants of mapPartitions added to JavaRDDLike
- move setGenerator to JavaRDDLike
- clean up
|
| | | |
| | | |
| | | |
| | | |
| | | | |
- mapPartitions, foreachPartition moved to JavaRDDLike
- call scala rdd's setGenerator instead of setting directly in JavaRDD
|
| | | | |
|
| | | | |
|
| | | | |
|
| | | | |
|
| | | | |
|
|\ \ \ \
| |_|/ /
|/| | |
| | | | |
Replace commons-math with jblas in SVDPlusPlus
|
| | | | |
|
| | | | |
|
| | | | |
|
|\ \ \ \
| | |_|/
| |/| |
| | | |
| | | |
| | | |
| | | |
| | | | |
Fix bug in worker clean-up in UI
Introduced in d5a96fec (/cc @aarondav).
This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers.
|
| | | |
| | | |
| | | |
| | | | |
Introduced in d5a96fec. This should be picked into 0.8 and 0.9 as well.
|
|\ \ \ \
| |_|/ /
|/| | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
fix for SPARK-1027
fix for SPARK-1027 (https://spark-project.atlassian.net/browse/SPARK-1027)
FIXES
1. change sparkhome from String to Option(String) in ApplicationDesc
2. remove sparkhome parameter in LaunchExecutor message
3. adjust involved files
|
| | | |
| | | |
| | | |
| | | | |
clean code
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
change TestClient & Worker to Some("xxx")
kill manager if it is started
remove unnecessary .get when fetch "SPARK_HOME" values
|
| | | | |
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Fix graphx Commons Math dependency
`graphx` depends on Commons Math (2.x) in `SVDPlusPlus.scala`. However the module doesn't declare this dependency. It happens to work because it is included by Hadoop artifacts. But, I can tell you this isn't true as of a month or so ago. Building versus recent Hadoop would fail. (That's how we noticed.)
The simple fix is to declare the dependency, as it should be. But it's also worth noting that `commons-math` is the old-ish 2.x line, while `commons-math3` is where newer 3.x releases are. Drop-in replacement, but different artifact and package name. Changing this only usage to `commons-math3` works, tests pass, and isn't surprising that it does, so is probably also worth changing. (A comment in some test code also references `commons-math3`, FWIW.)
It does raise another question though: `mllib` looks like it uses the `jblas` `DoubleMatrix` for general purpose vector/matrix stuff. Should `graphx` really use Commons Math for this? Beyond the tiny scope here but worth asking.
|
| | | | | |
|
| | | | |
| | | | |
| | | | |
| | | | | |
Hadoop (which stops working in 2.2.x) and also use the newer commons-math3
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | | |
fixed job name and usage information for the JavaSparkPi example
|
| | | | | | |
|
|\ \ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
SPARK-1033. Ask for cores in Yarn container requests
Tested on a pseudo-distributed cluster against the Fair Scheduler and observed a worker taking more than a single core.
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
requests may not always be honored
|
| | |_|/ / /
| |/| | | | |
|
|\ \ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Fixed bug where task set managers are added to queue twice
@mateiz can you verify that this is a bug and wasn't intentional? (https://github.com/apache/incubator-spark/commit/90a04dab8d9a2a9a372cea7cdf46cc0fd0f2f76c#diff-7fa4f84a961750c374f2120ca70e96edR551)
This bug leads to a small performance hit because task
set managers will get offered each rejected resource
offer twice, but doesn't lead to any incorrect functionality.
Thanks to @hdc1112 for pointing this out.
|
| | |_|/ / /
| |/| | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
This bug leads to a small performance hit because task
set managers will get offered each rejected resource
offer twice, but doesn't lead to any incorrect functionality.
|
|\ \ \ \ \ \
| |/ / / / /
|/| | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Sparse SVD
# Singular Value Decomposition
Given an *m x n* matrix *A*, compute matrices *U, S, V* such that
*A = U * S * V^T*
There is no restriction on m, but we require n^2 doubles to fit in memory.
Further, n should be less than m.
The decomposition is computed by first computing *A^TA = V S^2 V^T*,
computing svd locally on that (since n x n is small),
from which we recover S and V.
Then we compute U via easy matrix multiplication
as *U = A * V * S^-1*
Only singular vectors associated with the largest k singular values
If there are k such values, then the dimensions of the return will be:
* *S* is *k x k* and diagonal, holding the singular values on diagonal.
* *U* is *m x k* and satisfies U^T*U = eye(k).
* *V* is *n x k* and satisfies V^TV = eye(k).
All input and output is expected in sparse matrix format, 0-indexed
as tuples of the form ((i,j),value) all in RDDs.
# Testing
Tests included. They test:
- Decomposition promise (A = USV^T)
- For small matrices, output is compared to that of jblas
- Rank 1 matrix test included
- Full Rank matrix test included
- Middle-rank matrix forced via k included
# Example Usage
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.SVD
import org.apache.spark.mllib.linalg.SparseMatrix
import org.apache.spark.mllib.linalg.MatrixyEntry
// Load and parse the data file
val data = sc.textFile("mllib/data/als/test.data").map { line =>
val parts = line.split(',')
MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble)
}
val m = 4
val n = 4
// recover top 1 singular vector
val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), 1)
println("singular values = " + decomposed.S.data.toArray.mkString)
# Documentation
Added to docs/mllib-guide.md
|
| | | | | | |
|
| | | | | | |
|
| |\ \ \ \ \ |
|
| | | | | | | |
|
| | | | | | | |
|