spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	SPARK 1084.1 (resubmitted)	Sean Owen	2014-02-27	15	-88/+154
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(Ported from https://github.com/apache/incubator-spark/pull/637 ) Author: Sean Owen <sowen@cloudera.com> Closes #31 from srowen/SPARK-1084.1 and squashes the following commits: 6c4a32c [Sean Owen] Suppress warnings about legitimate unchecked array creations, or change code to avoid it f35b833 [Sean Owen] Fix two misc javadoc problems 254e8ef [Sean Owen] Fix one new style error introduced in scaladoc warning commit 5b2fce2 [Sean Owen] Fix scaladoc invocation warning, and enable javac warnings properly, with plugin config updates 007762b [Sean Owen] Remove dead scaladoc links b8ff8cb [Sean Owen] Replace deprecated Ant <tasks> with <target>
*	Show Master status on UI page	Raymond Liu	2014-02-26	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	For standalone HA mode, A status is useful to identify the current master, already in json format too. Author: Raymond Liu <raymond.liu@intel.com> Closes #24 from colorant/status and squashes the following commits: df630b3 [Raymond Liu] Show Master status on UI page
*	[SPARK-1089] fix the regression problem on ADD_JARS in 0.9	CodingCat	2014-02-26	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://spark-project.atlassian.net/browse/SPARK-1089 copied from JIRA, reported by @ash211 "Using the ADD_JARS environment variable with spark-shell used to add the jar to both the shell and the various workers. Now it only adds to the workers and importing a custom class in the shell is broken. The workaround is to add custom jars to both ADD_JARS and SPARK_CLASSPATH. We should fix ADD_JARS so it works properly again. See various threads on the user list: https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201402.mbox/%3CCAJbo4neMLiTrnm1XbyqomWmp0m+EUcg4yE-txuRGSVKOb5KLeA@mail.gmail.com%3E (another one that doesn't appear in the archives yet titled "ADD_JARS not working on 0.9")" The reason of this bug is two-folds in the current implementation of SparkILoop.scala, the settings.classpath is not set properly when the process() method is invoked the weird behaviour of Scala 2.10, (I personally thought it is a bug) if we simply set value of a PathSettings object (like settings.classpath), the isDefault is not set to true (this is a flag showing if the variable is modified), so it makes the PathResolver loads the default CLASSPATH environment variable value to calculated the path (see https://github.com/scala/scala/blob/2.10.x/src/compiler/scala/tools/util/PathResolver.scala#L215) what we have to do is to manually make this flag set, (https://github.com/CodingCat/incubator-spark/blob/e3991d97ddc33e77645e4559b13bf78b9e68239a/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L884) Author: CodingCat <zhunansjtu@gmail.com> Closes #13 from CodingCat/SPARK-1089 and squashes the following commits: 8af81e7 [CodingCat] impose non-null settings 9aa2125 [CodingCat] code cleaning ce36676 [CodingCat] code cleaning e045582 [CodingCat] fix the regression problem on ADD_JARS in 0.9
*	SPARK-1121 Only add avro if the build is for Hadoop 0.23.X and SPARK_YARN is set	Prashant Sharma	2014-02-26	3	-55/+39
\| \| \| \| \| \| \| \| \|	Author: Prashant Sharma <prashant.s@imaginea.com> Closes #6 from ScrapCodes/SPARK-1121/avro-dep-fix and squashes the following commits: 9b29e34 [Prashant Sharma] Review feedback on PR 46ed2ad [Prashant Sharma] SPARK-1121-Only add avro if the build is for Hadoop 0.23.X and SPARK_YARN is set
*	SPARK-1129: use a predefined seed when seed is zero in XORShiftRandom	Xiangrui Meng	2014-02-26	2	-3/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the seed is zero, XORShift generates all zeros, which would create unexpected result. JIRA: https://spark-project.atlassian.net/browse/SPARK-1129 Author: Xiangrui Meng <meng@databricks.com> Closes #645 from mengxr/xor and squashes the following commits: 1b086ab [Xiangrui Meng] use MurmurHash3 to set seed in XORShiftRandom 45c6f16 [Xiangrui Meng] minor style change 51f4050 [Xiangrui Meng] use a predefined seed when seed is zero in XORShiftRandom
*	Remove references to ClusterScheduler (SPARK-1140)	Kay Ousterhout	2014-02-26	8	-46/+47
\| \| \| \| \| \| \| \| \| \| \|	ClusterScheduler was renamed to TaskSchedulerImpl; this commit updates comments and tests accordingly. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9 from kayousterhout/cluster_scheduler_death and squashes the following commits: d6fd119 [Kay Ousterhout] Remove references to ClusterScheduler.
*	Updated link for pyspark examples in docs	Jyotiska NK	2014-02-26	1	-1/+1
\| \| \| \| \| \| \| \|	Author: Jyotiska NK <jyotiska123@gmail.com> Closes #22 from jyotiska/pyspark_docs and squashes the following commits: 426136c [Jyotiska NK] Updated link for pyspark examples
*	Deprecated and added a few java api methods for corresponding scala api.	Prashant Sharma	2014-02-26	5	-6/+32
\| \| \| \| \| \| \| \| \| \| \| \|	PR [402](https://github.com/apache/incubator-spark/pull/402) from incubator repo. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #19 from ScrapCodes/java-api-completeness and squashes the following commits: 11d0c2b [Prashant Sharma] Integer -> java.lang.Integer 737819a [Prashant Sharma] SPARK-1095 add explicit return types to APIs. 3ddc8bb [Prashant Sharma] Deprected *With functions in scala and added a few missing Java APIs
*	Removed reference to incubation in README.md.	Reynold Xin	2014-02-26	1	-14/+3
\| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@apache.org> Closes #1 from rxin/readme and squashes the following commits: b3a77cd [Reynold Xin] Removed reference to incubation in README.md.
*	SPARK-1115: Catch depickling errors	Bouke van der Bijl	2014-02-26	1	-24/+24
\| \| \| \| \| \| \| \| \| \| \| \| \|	This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason @JoshRosen Author: Bouke van der Bijl <boukevanderbijl@gmail.com> Closes #644 from bouk/catch-depickling-errors and squashes the following commits: f0f67cc [Bouke van der Bijl] Lol indentation 0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block
*	SPARK-1135: fix broken anchors in docs	Matei Zaharia	2014-02-26	1	-28/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	A recent PR that added Java vs Scala tabs for streaming also inadvertently added some bad code to a document.ready handler, breaking our other handler that manages scrolling to anchors correctly with the floating top bar. As a result the section title ended up always being hidden below the top bar. This removes the unnecessary JavaScript code. Author: Matei Zaharia <matei@databricks.com> Closes #3 from mateiz/doc-links and squashes the following commits: e2a3488 [Matei Zaharia] SPARK-1135: fix broken anchors in docs
*	SPARK-1078: Replace lift-json with json4s-jackson.	William Benton	2014-02-26	9	-24/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The aim of the Json4s project is to provide a common API for Scala JSON libraries. It is Apache-licensed, easier for downstream distributions to package, and mostly API-compatible with lift-json. Furthermore, the Jackson-backed implementation parses faster than lift-json on all but the smallest inputs. Author: William Benton <willb@redhat.com> Closes #582 from willb/json4s and squashes the following commits: 7ca62c4 [William Benton] Replace lift-json with json4s-jackson.
*	SPARK-1053. Don't require SPARK_YARN_APP_JAR	Sandy Ryza	2014-02-26	4	-12/+7
\| \| \| \| \| \| \| \| \| \| \| \|	It looks this just requires taking out the checks. I verified that, with the patch, I was able to run spark-shell through yarn without setting the environment variable. Author: Sandy Ryza <sandy@cloudera.com> Closes #553 from sryza/sandy-spark-1053 and squashes the following commits: b037676 [Sandy Ryza] SPARK-1053. Don't require SPARK_YARN_APP_JAR
*	For SPARK-1082, Use Curator for ZK interaction in standalone cluster	Raymond Liu	2014-02-24	9	-300/+99
\| \| \| \| \| \| \| \| \| \| \|	Author: Raymond Liu <raymond.liu@intel.com> Closes #611 from colorant/curator and squashes the following commits: 7556aa1 [Raymond Liu] Address review comments af92e1f [Raymond Liu] Fix coding style 964f3c2 [Raymond Liu] Ignore NodeExists exception 6df2966 [Raymond Liu] Rewrite zookeeper client code with curator
*	Graph primitives2	Semih Salihoglu	2014-02-24	2	-10/+183
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Hi guys, I'm following Joey and Ankur's suggestions to add collectEdges and pickRandomVertex. I'm also adding the tests for collectEdges and refactoring one method getCycleGraph in GraphOpsSuite.scala. Thank you, semih Author: Semih Salihoglu <semihsalihoglu@gmail.com> Closes #580 from semihsalihoglu/GraphPrimitives2 and squashes the following commits: 937d3ec [Semih Salihoglu] - Fixed the scalastyle errors. a69a152 [Semih Salihoglu] - Adding collectEdges and pickRandomVertices. - Adding tests for collectEdges. - Refactoring a getCycle utility function for GraphOpsSuite.scala. 41265a6 [Semih Salihoglu] - Adding collectEdges and pickRandomVertex. - Adding tests for collectEdges. - Recycling a getCycle utility test file.
*	Include reference to twitter/chill in tuning docs	Andrew Ash	2014-02-24	1	-3/+6
\| \| \| \| \| \| \| \|	Author: Andrew Ash <andrew@andrewash.com> Closes #647 from ash211/doc-tuning and squashes the following commits: b87de0a [Andrew Ash] Include reference to twitter/chill in tuning docs
*	For outputformats that are Configurable, call setConf before sending data to ↵	Bryn Keller	2014-02-24	2	-1/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	them. [SPARK-1108] This allows us to use, e.g. HBase's TableOutputFormat with PairRDDFunctions.saveAsNewAPIHadoopFile, which otherwise would throw NullPointerException because the output table name hasn't been configured. Note this bug also affects branch-0.9 Author: Bryn Keller <bryn.keller@intel.com> Closes #638 from xoltar/SPARK-1108 and squashes the following commits: 7e94e7d [Bryn Keller] Import, comment, and format cleanup per code review 7cbcaa1 [Bryn Keller] For outputformats that are Configurable, call setConf before sending data to them. This allows us to use, e.g. HBase TableOutputFormat, which otherwise would throw NullPointerException because the output table name hasn't been configured
*	Merge pull request #641 from mateiz/spark-1124-master	Matei Zaharia	2014-02-24	2	-14/+30
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-1124: Fix infinite retries of reduce stage when a map stage failed In the previous code, if you had a failing map stage and then tried to run reduce stages on it repeatedly, the first reduce stage would fail correctly, but the later ones would mistakenly believe that all map outputs are available and start failing infinitely with fetch failures from "null". See https://spark-project.atlassian.net/browse/SPARK-1124 for an example. This PR also cleans up code style slightly where there was a variable named "s" and some weird map manipulation.
\| *	Fix removal from shuffleToMapStage to search for a key-value pair with	Matei Zaharia	2014-02-24	1	-2/+2
\| \| \| \| \| \| \| \|	our stage instead of using our shuffleID.
\| *	SPARK-1124: Fix infinite retries of reduce stage when a map stage failed	Matei Zaharia	2014-02-23	2	-14/+30
\|/ \| \| \| \| \| \| \|	In the previous code, if you had a failing map stage and then tried to run reduce stages on it repeatedly, the first reduce stage would fail correctly, but the later ones would mistakenly believe that all map outputs are available and start failing infinitely with fetch failures from "null".
*	SPARK-1071: Tidy logging strategy and use of log4j	Sean Owen	2014-02-23	11	-69/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Prompted by a recent thread on the mailing list, I tried and failed to see if Spark can be made independent of log4j. There are a few cases where control of the underlying logging is pretty useful, and to do that, you have to bind to a specific logger. Instead I propose some tidying that leaves Spark's use of log4j, but gets rid of warnings and should still enable downstream users to switch. The idea is to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J directly when logging, and where Spark needs to output info (REPL and tests), bind from SLF4J to log4j. This leaves the same behavior in Spark. It means that downstream users who want to use something except log4j should: - Exclude dependencies on log4j, slf4j-log4j12 from Spark - Include dependency on log4j-over-slf4j - Include dependency on another logger X, and another slf4j-X - Recreate any log config that Spark does, that is needed, in the other logger's config That sounds about right. Here are the key changes: - Include the jcl-over-slf4j shim everywhere by depending on it in core. - Exclude dependencies on commons-logging from third-party libraries. - Include the jul-to-slf4j shim everywhere by depending on it in core. - Exclude slf4j-* dependencies from third-party libraries to prevent collision or warnings - Added missing slf4j-log4j12 binding to GraphX, Bagel module tests And minor/incidental changes: - Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a recommended update over 1.7.2 - (Remove a duplicate HBase dependency declaration in SparkBuild.scala) - (Remove a duplicate mockito dependency declaration that was causing warnings and bugging me) Author: Sean Owen <sowen@cloudera.com> Closes #570 from srowen/SPARK-1071 and squashes the following commits: 52eac9f [Sean Owen] Add slf4j-over-log4j12 dependency to core (non-test) and remove it from things that depend on core. 77a7fa9 [Sean Owen] SPARK-1071: Tidy logging strategy and use of log4j
*	[SPARK-1041] remove dead code in start script, remind user to set that in ↵	CodingCat	2014-02-22	3	-18/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	spark-env.sh the lines in start-master.sh and start-slave.sh no longer work in ec2, the host name has changed, e.g. ubuntu@ip-172-31-36-93:~$ hostname ip-172-31-36-93 also, the URL to fetch public DNS name also changed, e.g. ubuntu@ip-172-31-36-93:~$ wget -q -O - http://instance-data.ec2.internal/latest/meta-data/public-hostname ubuntu@ip-172-31-36-93:~$ (returns nothing) since we have spark-ec2 project, we don't need to have such ec2-specific lines here, instead, user only need to set in spark-env.sh Author: CodingCat <zhunansjtu@gmail.com> Closes #588 from CodingCat/deadcode_in_sbin and squashes the following commits: e4236e0 [CodingCat] remove dead code in start script, remind user set that in spark-env.sh
*	Migrate Java code to Scala or move it to src/main/java	Punya Biswal	2014-02-22	11	-92/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These classes can't be migrated: StorageLevels: impossible to create static fields in Scala JavaSparkContextVarargsWorkaround: incompatible varargs JavaAPISuite: should test Java APIs in pure Java (for sanity) Author: Punya Biswal <pbiswal@palantir.com> Closes #605 from punya/move-java-sources and squashes the following commits: 25b00b2 [Punya Biswal] Remove redundant type param; reformat 853da46 [Punya Biswal] Use factory method rather than constructor e5d53d9 [Punya Biswal] Migrate Java code to Scala or move it to src/main/java
*	[SPARK-1055] fix the SCALA_VERSION and SPARK_VERSION in docker file	CodingCat	2014-02-22	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	As reported in https://spark-project.atlassian.net/browse/SPARK-1055 "The used Spark version in the .../base/Dockerfile is stale on 0.8.1 and should be updated to 0.9.x to match the release." Author: CodingCat <zhunansjtu@gmail.com> Author: Nan Zhu <CodingCat@users.noreply.github.com> Closes #634 from CodingCat/SPARK-1055 and squashes the following commits: cb7330e [Nan Zhu] Update Dockerfile adf8259 [CodingCat] fix the SCALA_VERSION and SPARK_VERSION in docker file
*	doctest updated for mapValues, flatMapValues in rdd.py	jyotiska	2014-02-22	1	-0/+10
\| \| \| \| \| \| \| \| \| \|	Updated doctests for mapValues and flatMapValues in rdd.py Author: jyotiska <jyotiska123@gmail.com> Closes #621 from jyotiska/python_spark and squashes the following commits: 716f7cd [jyotiska] doctest updated for mapValues, flatMapValues in rdd.py
*	Fixed minor typo in worker.py	jyotiska	2014-02-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Fixed minor typo in worker.py Author: jyotiska <jyotiska123@gmail.com> Closes #630 from jyotiska/pyspark_code and squashes the following commits: ee44201 [jyotiska] typo fixed in worker.py
*	SPARK-1117: update accumulator docs	Xiangrui Meng	2014-02-21	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \|	The current doc hints spark doesn't support accumulators of type `Long`, which is wrong. JIRA: https://spark-project.atlassian.net/browse/SPARK-1117 Author: Xiangrui Meng <meng@databricks.com> Closes #631 from mengxr/acc and squashes the following commits: 45ecd25 [Xiangrui Meng] update accumulator docs
*	[SPARK-1113] External spilling - fix Int.MaxValue hash code collision bug	Andrew Or	2014-02-21	2	-38/+102
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The original poster of this bug is @guojc, who opened a PR that preceded this one at https://github.com/apache/incubator-spark/pull/612. ExternalAppendOnlyMap uses key hash code to order the buffer streams from which spilled files are read back into memory. When a buffer stream is empty, the default hash code for that stream is equal to Int.MaxValue. This is, however, a perfectly legitimate candidate for a key hash code. When reading from a spilled map containing such a key, a hash collision may occur, in which case we attempt to read from an empty stream and throw NoSuchElementException. The fix is to maintain the invariant that empty buffer streams are never added back to the merge queue to be considered. This guarantees that we never read from an empty buffer stream, ever again. This PR also includes two new tests for hash collisions. Author: Andrew Or <andrewor14@gmail.com> Closes #624 from andrewor14/spilling-bug and squashes the following commits: 9e7263d [Andrew Or] Slightly optimize next() 2037ae2 [Andrew Or] Move a few comments around... cf95942 [Andrew Or] Remove default value of Int.MaxValue for minKeyHash c11f03b [Andrew Or] Fix Int.MaxValue hash collision bug in ExternalAppendOnlyMap 21c1a39 [Andrew Or] Add hash collision tests to ExternalAppendOnlyMapSuite
*	MLLIB-25: Implicit ALS runs out of memory for moderately large numbers of ↵	Sean Owen	2014-02-21	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	features There's a step in implicit ALS where the matrix `Yt * Y` is computed. It's computed as the sum of matrices; an f x f matrix is created for each of n user/item rows in a partition. In `ALS.scala:214`: ``` factors.flatMapValues{ case factorArray => factorArray.map{ vector => val x = new DoubleMatrix(vector) x.mmul(x.transpose()) } }.reduceByKeyLocally((a, b) => a.addi(b)) .values .reduce((a, b) => a.addi(b)) ``` Completely correct, but there's a subtle but quite large memory problem here. map() is going to create all of these matrices in memory at once, when they don't need to ever all exist at the same time. For example, if a partition has n = 100000 rows, and f = 200, then this intermediate product requires 32GB of heap. The computation will never work unless you can cough up workers with (more than) that much heap. Fortunately there's a trivial change that fixes it; just add `.view` in there. Author: Sean Owen <sowen@cloudera.com> Closes #629 from srowen/ALSMatrixAllocationOptimization and squashes the following commits: 062cda9 [Sean Owen] Update style per review comments e9a5d63 [Sean Owen] Avoid unnecessary out of memory situation by not simultaneously allocating lots of matrices
*	SPARK-1111: URL Validation Throws Error for HDFS URL's	Patrick Wendell	2014-02-21	2	-9/+42
\| \| \| \| \| \| \| \| \| \|	Fixes an error where HDFS URL's cause an exception. Should be merged into master and 0.9. Author: Patrick Wendell <pwendell@gmail.com> Closes #625 from pwendell/url-validation and squashes the following commits: d14bfe3 [Patrick Wendell] SPARK-1111: URL Validation Throws Error for HDFS URL's
*	SPARK-1114: Allow PySpark to use existing JVM and Gateway	Ahir Reddy	2014-02-20	2	-10/+22
\| \| \| \| \| \| \| \| \| \|	Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization. Author: Ahir Reddy <ahirreddy@gmail.com> Closes #622 from ahirreddy/pyspark-existing-jvm and squashes the following commits: a86f457 [Ahir Reddy] Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.
*	Super minor: Add require for mergeCombiners in combineByKey	Aaron Davidson	2014-02-20	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	We changed the behavior in 0.9.0 from requiring that mergeCombiners be null when mapSideCombine was false to requiring that mergeCombiners never be null, for external sorting. This patch adds a require() to make this behavior change explicitly messaged rather than resulting in a NPE. Author: Aaron Davidson <aaron@databricks.com> Closes #623 from aarondav/master and squashes the following commits: 520b80c [Aaron Davidson] Super minor: Add require for mergeCombiners in combineByKey
*	MLLIB-22. Support negative implicit input in ALS	Sean Owen	2014-02-19	3	-21/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I'm back with another less trivial suggestion for ALS: In ALS for implicit feedback, input values are treated as weights on squared-errors in a loss function (or rather, the weight is a simple function of the input r, like c = 1 + alphar). The paper on which it's based assumes that the input is positive. Indeed, if the input is negative, it will create a negative weight on squared-errors, which causes things to go haywire. The optimization will try to make the error in a cell as large possible, and the result is silently bogus. There is a good use case for negative input values though. Implicit feedback is usually collected from signals of positive interaction like a view or like or buy, but equally, can come from "not interested" signals. The natural representation is negative values. The algorithm can be extended quite simply to provide a sound interpretation of these values: negative values should encourage the factorization to come up with 0 for cells with large negative input values, just as much as positive values encourage it to come up with 1. The implications for the algorithm are simple: the confidence function value must not be negative, and so can become 1 + alpha\|r\| the matrix P should have a value 1 where the input R is _positive_, not merely where it is non-zero. Actually, that's what the paper already says, it's just that we can't assume P = 1 when a cell in R is specified anymore, since it may be negative This in turn entails just a few lines of code change in `ALS.scala`: * `rs(i)` becomes `abs(rs(i))` * When constructing `userXy(us(i))`, it's implicitly only adding where P is 1. That had been true for any us(i) that is iterated over, before, since these are exactly the ones for which P is 1. But now P is zero where rs(i) <= 0, and should not be added I think it's a safe change because: * It doesn't change any existing behavior (unless you're using negative values, in which case results are already borked) * It's the simplest direct extension of the paper's algorithm * (I've used it to good effect in production FWIW) Tests included. I tweaked minor things en route: * `ALS.scala` javadoc writes "R = XtY" when the paper and rest of code defines it as "R = XYt" * RMSE in the ALS tests uses a confidence-weighted mean, but the denominator is not actually sum of weights Excuse my Scala style; I'm sure it needs tweaks. Author: Sean Owen <sowen@cloudera.com> Closes #500 from srowen/ALSNegativeImplicitInput and squashes the following commits: cf902a9 [Sean Owen] Support negative implicit input in ALS 953be1c [Sean Owen] Make weighted RMSE in ALS test actually weighted; adjust comment about R = X*Yt
*	MLLIB-24: url of "Collaborative Filtering for Implicit Feedback Datasets" ↵	Chen Chao	2014-02-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in ALS is invalid now url of "Collaborative Filtering for Implicit Feedback Datasets" is invalid now. A new url is provided. http://research.yahoo.com/files/HuKorenVolinsky-ICDM08.pdf Author: Chen Chao <crazyjvm@gmail.com> Closes #619 from CrazyJvm/master and squashes the following commits: a0b54e4 [Chen Chao] change url to IEEE 9e0e9f0 [Chen Chao] correct spell mistale fcfab5d [Chen Chao] wrap line to to fit within 100 chars 590d56e [Chen Chao] url error
*	[SPARK-1105] fix site scala version error in docs	CodingCat	2014-02-19	8	-26/+27
\| \| \| \| \| \| \| \| \| \| \| \| \|	https://spark-project.atlassian.net/browse/SPARK-1105 fix site scala version error Author: CodingCat <zhunansjtu@gmail.com> Closes #618 from CodingCat/doc_version and squashes the following commits: 39bb8aa [CodingCat] more fixes 65bedb0 [CodingCat] fix site scala version error in doc
*	SPARK-1106: check key name and identity file before launch a cluster	Xiangrui Meng	2014-02-18	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \|	I launched an EC2 cluster without providing a key name and an identity file. The error showed up after two minutes. It would be good to check those options before launch, given the fact that EC2 billing rounds up to hours. JIRA: https://spark-project.atlassian.net/browse/SPARK-1106 Author: Xiangrui Meng <meng@databricks.com> Closes #617 from mengxr/ec2 and squashes the following commits: 2dfb316 [Xiangrui Meng] check key name and identity file before launch a cluster
*	Revert "[SPARK-1105] fix site scala version error in doc"	Patrick Wendell	2014-02-18	1	-1/+1
\| \| \| \|	This reverts commit d99773d5bba674cc1434c86435b6d9b3739314c8.
*	[SPARK-1105] fix site scala version error in doc	CodingCat	2014-02-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	https://spark-project.atlassian.net/browse/SPARK-1105 fix site scala version error Author: CodingCat <zhunansjtu@gmail.com> Closes #616 from CodingCat/doc_version and squashes the following commits: eafd99a [CodingCat] fix site scala version error in doc
*	Optimized imports	NirmalReddy	2014-02-18	246	-552/+446
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Optimized imports and arranged according to scala style guide @ https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports Author: NirmalReddy <nirmal.reddy@imaginea.com> Author: NirmalReddy <nirmal_reddy2000@yahoo.com> Closes #613 from NirmalReddy/opt-imports and squashes the following commits: 578b4f5 [NirmalReddy] imported java.lang.Double as JDouble a2cbcc5 [NirmalReddy] addressed the comments 776d664 [NirmalReddy] Optimized imports in core
*	SPARK-1098: Minor cleanup of ClassTag usage in Java API	Aaron Davidson	2014-02-17	4	-100/+108
\| \| \| \| \| \| \| \| \| \|	Our usage of fake ClassTags in this manner is probably not healthy, but I'm not sure if there's a better solution available, so I just cleaned up and documented the current one. Author: Aaron Davidson <aaron@databricks.com> Closes #604 from aarondav/master and squashes the following commits: b398e89 [Aaron Davidson] SPARK-1098: Minor cleanup of ClassTag usage in Java API
*	[SPARK-1090] improvement on spark_shell (help information, configure memory)	CodingCat	2014-02-17	2	-7/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://spark-project.atlassian.net/browse/SPARK-1090 spark-shell should print help information about parameters and should allow user to configure exe memory there is no document about hot to set --cores/-c in spark-shell and also users should be able to set executor memory through command line options In this PR I also check the format of the options passed by the user Author: CodingCat <zhunansjtu@gmail.com> Closes #599 from CodingCat/spark_shell_improve and squashes the following commits: de5aa38 [CodingCat] add parameter to set driver memory 915cbf8 [CodingCat] improvement on spark_shell (help information, configure memory)
*	Fix typos in Spark Streaming programming guide	Andrew Or	2014-02-17	1	-14/+13
\| \| \| \| \| \| \| \| \| \| \| \|	Author: Andrew Or <andrewor14@gmail.com> Closes #536 from andrewor14/streaming-typos and squashes the following commits: a05faa6 [Andrew Or] Fix broken link and wording bc2e4bc [Andrew Or] Merge github.com:apache/incubator-spark into streaming-typos d5515b4 [Andrew Or] TD's comments 767ef12 [Andrew Or] Fix broken links 8f4c731 [Andrew Or] Fix typos in programming guide
*	Worker registration logging fix	Andrew Ash	2014-02-17	1	-1/+1
\| \| \| \| \| \| \| \|	Author: Andrew Ash <andrew@andrewash.com> Closes #608 from ash211/patch-7 and squashes the following commits: bd85f2a [Andrew Ash] Worker registration logging fix
*	Add subtractByKey to the JavaPairRDD wrapper	Punya Biswal	2014-02-16	1	-0/+23
\| \| \| \| \| \| \| \| \|	Author: Punya Biswal <pbiswal@palantir.com> Closes #600 from punya/subtractByKey-java and squashes the following commits: e961913 [Punya Biswal] Hide implicit ClassTags from Java API c5d317b [Punya Biswal] Add subtractByKey to the JavaPairRDD wrapper
*	fix for https://spark-project.atlassian.net/browse/SPARK-1052	Bijay Bisht	2014-02-16	1	-7/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Author: Bijay Bisht <bijay.bisht@gmail.com> Closes #568 from bijaybisht/SPARK-1052 and squashes the following commits: da70395 [Bijay Bisht] fix for https://spark-project.atlassian.net/browse/SPARK-1052 - comments incorporated fdb1d94 [Bijay Bisht] fix for https://spark-project.atlassian.net/browse/SPARK-1052 (cherry picked from commit e797c1abd9692f1b7ec290e4c83d31fd106e6b05) Signed-off-by: Aaron Davidson <aaron@databricks.com>
*	[SPARK-1092] print warning information if user use SPARK_MEM to regulate ↵	CodingCat	2014-02-16	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	executor memory usage https://spark-project.atlassian.net/browse/SPARK-1092?jql=project%20%3D%20SPARK print warning information if user set SPARK_MEM to regulate memory usage of executors ---- OUTDATED: Currently, users will usually set SPARK_MEM to control the memory usage of driver programs, (in spark-class) 91 JAVA_OPTS="$OUR_JAVA_OPTS" 92 JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH" 93 JAVA_OPTS="$JAVA_OPTS -Xms$SPARK_MEM -Xmx$SPARK_MEM" if they didn't set spark.executor.memory, the value in this environment variable will also affect the memory usage of executors, because the following lines in SparkContext privatespark val executorMemory = conf.getOption("spark.executor.memory") .orElse(Option(System.getenv("SPARK_MEM"))) .map(Utils.memoryStringToMb) .getOrElse(512) also since SPARK_MEM has been (proposed to) deprecated in SPARK-929 (https://spark-project.atlassian.net/browse/SPARK-929) and the corresponding PR (https://github.com/apache/incubator-spark/pull/104) we should remove this line Author: CodingCat <zhunansjtu@gmail.com> Closes #602 from CodingCat/clean_spark_mem and squashes the following commits: 302bb28 [CodingCat] print warning information if user use SPARK_MEM to regulate executor memory usage
*	Typo: Standlone -> Standalone	Andrew Ash	2014-02-14	3	-5/+5
\| \| \| \| \| \| \| \| \| \|	Author: Andrew Ash <andrew@andrewash.com> Closes #601 from ash211/typo and squashes the following commits: 9cd43ac [Andrew Ash] Change docs references to metrics.properties, not metrics.conf 3813ff1 [Andrew Ash] Typo: mulitcast -> multicast 873bd2f [Andrew Ash] Typo: Standlone -> Standalone
*	Merge pull request #598 from shivaram/master.	Shivaram Venkataraman	2014-02-13	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Update spark_ec2 to use 0.9.0 by default Backports change from branch-0.9 Author: Shivaram Venkataraman <shivaram@eecs.berkeley.edu> Closes #598 and squashes the following commits: f6d3ed0 [Shivaram Venkataraman] Update spark_ec2 to use 0.9.0 by default Backports change from branch-0.9
*	Add c3 instance types to Spark EC2	Christian Lundgren	2014-02-13	1	-2/+12
\| \| \| \| \| \| \| \| \| \| \| \| \|	The number of disks for the c3 instance types taken from here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#StorageOnInstanceTypes Author: Christian Lundgren <christian.lundgren@gameanalytics.com> Closes #595 from chrisavl/branch-0.9 and squashes the following commits: c8af5f9 [Christian Lundgren] Add c3 instance types to Spark EC2 (cherry picked from commit 19b4bb2b444f1dbc4592bf3d58b17652e0ae6d6b) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	Ported hadoopClient jar for < 1.0.1 fix	Bijay Bisht	2014-02-12	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	#522 got messed after i rewrote the branch hadoop_jar_name. So created a new one. Author: Bijay Bisht <bijay.bisht@gmail.com> Closes #584 from bijaybisht/hadoop_jar_name_on_0.9.0 and squashes the following commits: 1b6fb3c [Bijay Bisht] Ported hadoopClient jar for < 1.0.1 fix (cherry picked from commit 8093de1bb319e86dcf0d6d8d97b043a2bc1aa8f2) Signed-off-by: Patrick Wendell <pwendell@gmail.com>