spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-5321] Support for transposing local matrices	Burak Yavuz	2015-01-27	6	-282/+436
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Support for transposing local matrices added. The `.transpose` function creates a new object re-using the backing array(s) but switches `numRows` and `numCols`. Operations check the flag `.isTransposed` to see whether the indexing in `values` should be modified. This PR will pave the way for transposing `BlockMatrix`. Author: Burak Yavuz <brkyvz@gmail.com> Closes #4109 from brkyvz/SPARK-5321 and squashes the following commits: 87ab83c [Burak Yavuz] fixed scalastyle caf4438 [Burak Yavuz] addressed code review v3 c524770 [Burak Yavuz] address code review comments 2 77481e8 [Burak Yavuz] fixed MiMa f1c1742 [Burak Yavuz] small refactoring ccccdec [Burak Yavuz] fixed failed test dd45c88 [Burak Yavuz] addressed code review a01bd5f [Burak Yavuz] [SPARK-5321] Fixed MiMa issues 2a63593 [Burak Yavuz] [SPARK-5321] fixed bug causing failed gemm test c55f29a [Burak Yavuz] [SPARK-5321] Support for transposing local matrices cleaned up c408c05 [Burak Yavuz] [SPARK-5321] Support for transposing local matrices added
*	[SPARK-5419][Mllib] Fix the logic in Vectors.sqdist	Liang-Chi Hsieh	2015-01-27	1	-7/+12
\| \| \| \| \| \| \| \| \| \| \|	The current implementation in Vectors.sqdist is not efficient because of allocating temp arrays. There is also a bug in the code `v1.indices.length / v1.size < 0.5`. This pr fixes the bug and refactors sqdist without allocating new arrays. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4217 from viirya/fix_sqdist and squashes the following commits: e8b0b3d [Liang-Chi Hsieh] For review comments. 314c424 [Liang-Chi Hsieh] Fix sqdist bug.
*	[SPARK-3726] [MLlib] Allow sampling_rate not equal to 1.0 in RandomForests	MechCoder	2015-01-26	3	-11/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I've added support for sampling_rate not equal to 1.0 . I have two major questions. 1. A Scala style test is failing, since the number of parameters now exceed 10. 2. I would like suggestions to understand how to test this. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #4073 from MechCoder/spark-3726 and squashes the following commits: 8012fb2 [MechCoder] Add test in Strategy e0e0d9c [MechCoder] TST: Add better test d1df1b2 [MechCoder] Add test to verify subsampling behavior a7bfc70 [MechCoder] [SPARK-3726] Allow sampling_rate not equal to 1.0
*	[SPARK-5119] java.lang.ArrayIndexOutOfBoundsException on trying to train...	lewuathe	2015-01-26	3	-0/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	... decision tree model Labels loaded from libsvm files are mapped to 0.0 if they are negative labels because they should be nonnegative value. Author: lewuathe <lewuathe@me.com> Closes #3975 from Lewuathe/map-negative-label-to-positive and squashes the following commits: 12d1d59 [lewuathe] [SPARK-5119] Fix code styles 6d9a18a [lewuathe] [SPARK-5119] Organize test codes 62a150c [lewuathe] [SPARK-5119] Modify Impurities throw exceptions with negatie labels 3336c21 [lewuathe] [SPARK-5119] java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model
*	[SPARK-5052] Add common/base classes to fix guava methods signatures.	Elmer Garduno	2015-01-26	2	-0/+4
\| \| \| \| \| \| \| \| \| \|	Fixes problems with incorrect method signatures related to shaded classes. For discussion see the jira issue. Author: Elmer Garduno <elmerg@google.com> Closes #3874 from elmer-garduno/fix_guava_signatures and squashes the following commits: aa5d8e0 [Elmer Garduno] Unshade common/base[Function\|Supplier] classes to fix guava methods signatures.
*	SPARK-960 [CORE] [TEST] JobCancellationSuite "two jobs sharing the same ↵	Sean Owen	2015-01-26	1	-8/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	stage" is broken This reenables and fixes this test, after addressing two issues: - The Semaphore that was intended to be shared locally was being serialized and copied; it's now a static member in the companion object as in other tests - Later changes to Spark means that cancelling the first task will not cancel the shared stage and therefore the second task should succeed Author: Sean Owen <sowen@cloudera.com> Closes #4180 from srowen/SPARK-960 and squashes the following commits: 43da66f [Sean Owen] Fix 'two jobs sharing the same stage' test and reenable it: truly share a Semaphore locally as intended, and update expectation of failure in non-cancelled task
*	Fix command spaces issue in make-distribution.sh	David Y. Ross	2015-01-26	1	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \|	Storing command in variables is tricky in bash, use an array to handle all issues with spaces, quoting, etc. See: http://mywiki.wooledge.org/BashFAQ/050 Author: David Y. Ross <dyross@gmail.com> Closes #4126 from dyross/dyr-fix-make-distribution and squashes the following commits: 4ce522b [David Y. Ross] Fix command spaces issue in make-distribution.sh
*	SPARK-4147 [CORE] Reduce log4j dependency	Sean Owen	2015-01-26	1	-9/+11
\| \| \| \| \| \| \| \| \| \|	Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework. The only change is to push one half of the check in the original `if` condition inside. This is a trivial change, may or may not actually solve a problem, but I think it's all that makes sense to do for SPARK-4147. Author: Sean Owen <sowen@cloudera.com> Closes #4190 from srowen/SPARK-4147 and squashes the following commits: 4e99942 [Sean Owen] Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework.
*	[SPARK-5339][BUILD] build/mvn doesn't work because of invalid URL for ↵	Kousuke Saruta	2015-01-26	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	maven's tgz. build/mvn will automatically download tarball of maven. But currently, the URL is invalid. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #4124 from sarutak/SPARK-5339 and squashes the following commits: 6e96121 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5339 0e012d1 [Kousuke Saruta] Updated Maven version to 3.2.5 ca26499 [Kousuke Saruta] Fixed URL of the tarball of Maven
*	[SPARK-5355] use j.u.c.ConcurrentHashMap instead of TrieMap	Davies Liu	2015-01-26	3	-21/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	j.u.c.ConcurrentHashMap is more battle tested. cc rxin JoshRosen pwendell Author: Davies Liu <davies@databricks.com> Closes #4208 from davies/safe-conf and squashes the following commits: c2182dc [Davies Liu] address comments, fix tests 3a1d821 [Davies Liu] fix test da14ced [Davies Liu] Merge branch 'master' of github.com:apache/spark into safe-conf ae4d305 [Davies Liu] change to j.u.c.ConcurrentMap f8fa1cf [Davies Liu] change to TrieMap a1d769a [Davies Liu] make SparkConf thread-safe
*	[SPARK-5384][mllib] Vectors.sqdist returns inconsistent results for ↵	Yuhao Yang	2015-01-25	1	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	sparse/dense vectors when the vectors have different lengths JIRA issue: https://issues.apache.org/jira/browse/SPARK-5384 Currently `Vectors.sqdist` return inconsistent result for sparse/dense vectors when the vectors have different lengths, please refer to JIRA for sample PR scope: Unify the sqdist logic for dense/sparse vectors and fix the inconsistency, also remove the possible sparse to dense conversion in the original code. For reviewers: Maybe we should first discuss what's the correct behavior. 1. Vectors for sqdist must have the same length, like in breeze? 2. If they can have different lengths, what's the correct result for sqdist? (should the extra part get into calculation?) I'll update PR with more optimization and additional ut afterwards. Thanks. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #4183 from hhbyyh/fixDouble and squashes the following commits: 1f17328 [Yuhao Yang] limit PR scope to size constraints only 54cbf97 [Yuhao Yang] fix Vectors.sqdist inconsistence
*	[SPARK-5268] don't stop CoarseGrainedExecutorBackend for irrelevant ↵	CodingCat	2015-01-25	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DisassociatedEvent https://issues.apache.org/jira/browse/SPARK-5268 In CoarseGrainedExecutorBackend, we subscribe DisassociatedEvent in executor backend actor and exit the program upon receive such event... let's consider the following case The user may develop an Akka-based program which starts the actor with Spark's actor system and communicate with an external actor system (e.g. an Akka-based receiver in spark streaming which communicates with an external system) If the external actor system fails or disassociates with the actor within spark's system with purpose, we may receive DisassociatedEvent and the executor is restarted. This is not the expected behavior..... ---- This is a simple fix to check the event before making the quit decision Author: CodingCat <zhunansjtu@gmail.com> Closes #4063 from CodingCat/SPARK-5268 and squashes the following commits: 4d7d48e [CodingCat] simplify the log 18c36f4 [CodingCat] more descriptive log f299e0b [CodingCat] clean log 1632e79 [CodingCat] check whether DisassociatedEvent is relevant before quit
*	SPARK-4430 [STREAMING] [TEST] Apache RAT Checks fail spuriously on test files	Sean Owen	2015-01-25	1	-7/+2
\| \| \| \| \| \| \| \| \| \|	Another trivial one. The RAT failure was due to temp files from `FailureSuite` not being cleaned up. This just makes the cleanup more reliable by using the standard temp dir mechanism. Author: Sean Owen <sowen@cloudera.com> Closes #4189 from srowen/SPARK-4430 and squashes the following commits: 9ea63ff [Sean Owen] Properly acquire a temp directory to ensure it is cleaned up at shutdown, which helps avoid a RAT check failure
*	[SPARK-5326] Show fetch wait time as optional metric in the UI	Kay Ousterhout	2015-01-25	4	-5/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With this change, here's what the UI looks like: ![image](https://cloud.githubusercontent.com/assets/1108612/5809994/1ec8a904-9ff4-11e4-8f24-6a59a1a858f7.png) If you want to locally test this, you need to spin up multiple executors, because the shuffle read metrics are only shown for data read remotely. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #4110 from kayousterhout/SPARK-5326 and squashes the following commits: 610051e [Kay Ousterhout] Josh style comments 5feaa28 [Kay Ousterhout] What is the difference here?? aa129cb [Kay Ousterhout] Removed inadvertent change 721c742 [Kay Ousterhout] Improved tooltip f3a7111 [Kay Ousterhout] Style fix 679b4e9 [Kay Ousterhout] [SPARK-5326] Show fetch wait time as optional metric in the UI
*	[SPARK-5344][WebUI] HistoryServer cannot recognize that inprogress file was ↵	Kousuke Saruta	2015-01-25	2	-1/+26
\| \| \| \| \| \| \| \| \| \| \| \| \|	renamed to completed file `FsHistoryProvider` tries to update application status but if `checkForLogs` is called before `.inprogress` file is renamed to completed file, the file is not recognized as completed. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #4132 from sarutak/SPARK-5344 and squashes the following commits: 9658008 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5344 d2c72b6 [Kousuke Saruta] Fixed update issue of FsHistoryProvider
*	SPARK-4506 [DOCS] Addendum: Update more docs to reflect that standalone ↵	Sean Owen	2015-01-25	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	works in cluster mode This is a trivial addendum to SPARK-4506, which was already resolved. noted by Asim Jalis in SPARK-4506. Author: Sean Owen <sowen@cloudera.com> Closes #4160 from srowen/SPARK-4506 and squashes the following commits: 5f5f7df [Sean Owen] Update more docs to reflect that standalone works in cluster mode
*	SPARK-5382: Use SPARK_CONF_DIR in spark-class if it is defined	Jacek Lewandowski	2015-01-25	1	-2/+3
\| \| \| \| \| \| \| \|	Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #4179 from jacek-lewandowski/SPARK-5382-1.3 and squashes the following commits: 55d7791 [Jacek Lewandowski] SPARK-5382: Use SPARK_CONF_DIR in spark-class if it is defined
*	SPARK-3782 [CORE] Direct use of log4j in AkkaUtils interferes with certain ↵	Sean Owen	2015-01-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	logging configurations Although the underlying issue can I think be solved by having user code use slf4j 1.7.6+, it might be helpful and consistent to update Spark's slf4j too. I see no reason to believe it would be incompatible with other 1.7.x releases: http://www.slf4j.org/news.html Lots of different version of slf4j are in use in the wild and anecdotally I have never seen an issue mixing them. Author: Sean Owen <sowen@cloudera.com> Closes #4184 from srowen/SPARK-3782 and squashes the following commits: 5608d28 [Sean Owen] Update slf4j to 1.7.10
*	SPARK-3852 [DOCS] Document spark.driver.extra* configs	Sean Owen	2015-01-25	1	-0/+21
\| \| \| \| \| \| \| \| \| \|	As per the JIRA. I copied the `spark.executor.extra` text, but removed info that appears to be specific to the `executor` config and not `driver`. Author: Sean Owen <sowen@cloudera.com> Closes #4185 from srowen/SPARK-3852 and squashes the following commits: f60a8a1 [Sean Owen] Document spark.driver.extra configs
*	[SPARK-5402] log executor ID at executor-construction time	Ryan Williams	2015-01-25	1	-4/+7
\| \| \| \| \| \| \| \| \| \|	also rename "slaveHostname" to "executorHostname" Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #4195 from ryan-williams/exec and squashes the following commits: e60a7bb [Ryan Williams] log executor ID at executor-construction time
*	[SPARK-5401] set executor ID before creating MetricsSystem	Ryan Williams	2015-01-25	2	-2/+6
\| \| \| \| \| \| \| \|	Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #4194 from ryan-williams/metrics and squashes the following commits: 7c5a33f [Ryan Williams] set executor ID before creating MetricsSystem
*	Add comment about defaultMinPartitions	Idan Zalzberg	2015-01-25	1	-1/+5
\| \| \| \| \| \| \| \| \| \|	Added a comment about using math.min for choosing default partition count Author: Idan Zalzberg <idanzalz@gmail.com> Closes #4102 from idanz/patch-2 and squashes the following commits: 50e9d58 [Idan Zalzberg] Update SparkContext.scala
*	Closes #4157	Reynold Xin	2015-01-25	0	-0/+0
\|
*	[SPARK-5214][Test] Add a test to demonstrate EventLoop can be stopped in the ↵	zsxwing	2015-01-24	1	-4/+22
\| \| \| \| \| \| \| \| \| \| \|	event thread Author: zsxwing <zsxwing@gmail.com> Closes #4174 from zsxwing/SPARK-5214-unittest and squashes the following commits: 443e564 [zsxwing] Change the check interval to 5ms 7aaa2d7 [zsxwing] Add a test to demonstrate EventLoop can be stopped in the event thread
*	[SPARK-5058] Part 2. Typos and broken URL	Jongyoul Lee	2015-01-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	- Also fixed java link Author: Jongyoul Lee <jongyoul@gmail.com> Closes #4172 from jongyoul/SPARK-FIXDOC and squashes the following commits: 6be03e5 [Jongyoul Lee] [SPARK-5058] Part 2. Typos and broken URL - Also fixed java link
*	[SPARK-5351][GraphX] Do not use Partitioner.defaultPartitioner as a ↵	Takeshi Yamamuro	2015-01-23	2	-2/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	partitioner of EdgeRDDImp... If the value of 'spark.default.parallelism' does not match the number of partitoins in EdgePartition(EdgeRDDImpl), the following error occurs in ReplicatedVertexView.scala:72; object GraphTest extends Logging { def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = { graph.aggregateMessages( ctx => { ctx.sendToSrc(1) ctx.sendToDst(2) }, _ + _) } } val g = GraphLoader.edgeListFile(sc, "graph.txt") val rdd = GraphTest.run(g) java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:82) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191) ... Author: Takeshi Yamamuro <linguin.m.s@gmail.com> Closes #4136 from maropu/EdgePartitionBugFix and squashes the following commits: 0cd8942 [Ankur Dave] Use more concise getOrElse aad4a2c [Ankur Dave] Add unit test for non-default number of edge partitions 0a2f32b [Takeshi Yamamuro] Do not use Partitioner.defaultPartitioner as a partitioner of EdgeRDDImpl
*	[SPARK-5063] More helpful error messages for several invalid operations	Josh Rosen	2015-01-23	6	-14/+138
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds more helpful error messages for invalid programs that define nested RDDs, broadcast RDDs, perform actions inside of transformations (e.g. calling `count()` from inside of `map()`), and call certain methods on stopped SparkContexts. Currently, these invalid programs lead to confusing NullPointerExceptions at runtime and have been a major source of questions on the mailing list and StackOverflow. In a few cases, I chose to log warnings instead of throwing exceptions in order to avoid any chance that this patch breaks programs that worked "by accident" in earlier Spark releases (e.g. programs that define nested RDDs but never run any jobs with them). In SparkContext, the new `assertNotStopped()` method is used to check whether methods are being invoked on a stopped SparkContext. In some cases, user programs will not crash in spite of calling methods on stopped SparkContexts, so I've only added `assertNotStopped()` calls to methods that always throw exceptions when called on stopped contexts (e.g. by dereferencing a null `dagScheduler` pointer). Author: Josh Rosen <joshrosen@databricks.com> Closes #3884 from JoshRosen/SPARK-5063 and squashes the following commits: a38774b [Josh Rosen] Fix spelling typo a943e00 [Josh Rosen] Convert two exceptions into warnings in order to avoid breaking user programs in some edge-cases. 2d0d7f7 [Josh Rosen] Fix test to reflect 1.2.1 compatibility 3f0ea0c [Josh Rosen] Revert two unintentional formatting changes 8e5da69 [Josh Rosen] Remove assertNotStopped() calls for methods that were sometimes safe to call on stopped SC's in Spark 1.2 8cff41a [Josh Rosen] IllegalStateException fix 6ef68d0 [Josh Rosen] Fix Python line length issues. 9f6a0b8 [Josh Rosen] Add improved error messages to PySpark. 13afd0f [Josh Rosen] SparkException -> IllegalStateException 8d404f3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-5063 b39e041 [Josh Rosen] Fix BroadcastSuite test which broadcasted an RDD 99cc09f [Josh Rosen] Guard against calling methods on stopped SparkContexts. 34833e8 [Josh Rosen] Add more descriptive error message. 57cc8a1 [Josh Rosen] Add error message when directly broadcasting RDD. 15b2e6b [Josh Rosen] [SPARK-5063] Useful error messages for nested RDDs and actions inside of transformations
*	[SPARK-3541][MLLIB] New ALS implementation with improved storage	Xiangrui Meng	2015-01-22	5	-2/+1584
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds a new ALS implementation to `spark.ml` using the pipeline API, which should be able to scale to billions of ratings. Compared with the ALS under `spark.mllib`, the new implementation 1. uses the same algorithm, 2. uses float type for ratings, 3. uses primitive arrays to avoid GC, 4. sorts and compresses ratings on each block so that we can solve least squares subproblems one by one using only one normal equation instance. The following figure shows performance comparison on copies of the Amazon Reviews dataset using a 16-node (m3.2xlarge) EC2 cluster (the same setup as in http://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html): ![als-wip](https://cloud.githubusercontent.com/assets/829644/5659447/4c4ff8e0-96c7-11e4-87a9-73c1c63d07f3.png) I keep the `spark.mllib`'s ALS untouched for easy comparison. If the new implementation works well, I'm going to match the features of the ALS under `spark.mllib` and then make it a wrapper of the new implementation, in a separate PR. TODO: - [X] Add unit tests for implicit preferences. Author: Xiangrui Meng <meng@databricks.com> Closes #3720 from mengxr/SPARK-3541 and squashes the following commits: 1b9e852 [Xiangrui Meng] fix compile 5129be9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3541 dd0d0e8 [Xiangrui Meng] simplify test code c627de3 [Xiangrui Meng] add tests for implicit feedback b84f41c [Xiangrui Meng] address comments a76da7b [Xiangrui Meng] update ALS tests 2a8deb3 [Xiangrui Meng] add some ALS tests 857e876 [Xiangrui Meng] add tests for rating block and encoded block d3c1ac4 [Xiangrui Meng] rename some classes for better code readability add more doc and comments 213d163 [Xiangrui Meng] org imports 771baf3 [Xiangrui Meng] chol doc update ca9ad9d [Xiangrui Meng] add unit tests for chol b4fd17c [Xiangrui Meng] add unit tests for NormalEquation d0f99d3 [Xiangrui Meng] add tests for LocalIndexEncoder 80b8e61 [Xiangrui Meng] fix imports 4937fd4 [Xiangrui Meng] update ALS example 56c253c [Xiangrui Meng] rename product to item bce8692 [Xiangrui Meng] doc for parameters and project the output columns 3f2d81a [Xiangrui Meng] add doc 1efaecf [Xiangrui Meng] add example code 8ae86b5 [Xiangrui Meng] add a working copy of the new ALS implementation
*	[SPARK-5315][Streaming] Fix reduceByWindow Java API not work bug	jerryshao	2015-01-22	3	-2/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	`reduceByWindow` for Java API is actually not Java compatible, change to make it Java compatible. Current solution is to deprecate the old one and add a new API, but since old API actually is not correct, so is keeping the old one meaningful? just to keep the binary compatible? Also even adding new API still need to add to Mima exclusion, I'm not sure to change the API, or deprecate the old API and add a new one, which is the best solution? Author: jerryshao <saisai.shao@intel.com> Closes #4104 from jerryshao/SPARK-5315 and squashes the following commits: 5bc8987 [jerryshao] Address the comment c7aa1b4 [jerryshao] Deprecate the old one to keep binary compatible 8e9dc67 [jerryshao] Fix JavaDStream reduceByWindow signature error
*	[SPARK-5233][Streaming] Fix error replaying of WAL introduced bug	jerryshao	2015-01-22	4	-20/+32
\| \| \| \| \| \| \| \| \| \| \| \| \|	Because of lacking of `BlockAllocationEvent` in WAL recovery, the dangled event will mix into the new batch, which will lead to the wrong result. Details can be seen in [SPARK-5233](https://issues.apache.org/jira/browse/SPARK-5233). Author: jerryshao <saisai.shao@intel.com> Closes #4032 from jerryshao/SPARK-5233 and squashes the following commits: f0b0c0b [jerryshao] Further address the comments a237c75 [jerryshao] Address the comments e356258 [jerryshao] Fix bug in unit test 558bdc3 [jerryshao] Correctly replay the WAL log when recovering from failure
*	SPARK-5370. [YARN] Remove some unnecessary synchronization in YarnAlloca...	Sandy Ryza	2015-01-22	1	-13/+10
\| \| \| \| \| \| \| \| \| \|	...tor Author: Sandy Ryza <sandy@cloudera.com> Closes #4164 from sryza/sandy-spark-5370 and squashes the following commits: 0c8d736 [Sandy Ryza] SPARK-5370. [YARN] Remove some unnecessary synchronization in YarnAllocator
*	[SPARK-5365][MLlib] Refactor KMeans to reduce redundant data	Liang-Chi Hsieh	2015-01-22	1	-4/+5
\| \| \| \| \| \| \| \| \| \|	If a point is selected as new centers for many runs, it would collect many redundant data. This pr refactors it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4159 from viirya/small_refactor_kmeans and squashes the following commits: 25487e6 [Liang-Chi Hsieh] Refactor codes to reduce redundant data.
*	[SPARK-5147][Streaming] Delete the received data WAL log periodically	Tathagata Das	2015-01-21	9	-50/+172
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a refactored fix based on jerryshao 's PR #4037 This enabled deletion of old WAL files containing the received block data. Improvements over #4037 - Respecting the rememberDuration of all receiver streams. In #4037, if there were two receiver streams with multiple remember durations, the deletion would have delete based on the shortest remember duration, thus deleting data prematurely for the receiver stream with longer remember duration. - Added unit test to test creation of receiver WAL, automatic deletion, and respecting of remember duration. jerryshao I am going to merge this ASAP to make it 1.2.1 Thanks for the initial draft of this PR. Made my job much easier. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: jerryshao <saisai.shao@intel.com> Closes #4149 from tdas/SPARK-5147 and squashes the following commits: 730798b [Tathagata Das] Added comments. c4cf067 [Tathagata Das] Minor fixes 2579b27 [Tathagata Das] Refactored the fix to make sure that the cleanup respects the remember duration of all the receiver streams 2736fd1 [jerryshao] Delete the old WAL log periodically
*	[SPARK-5317]Set BoostingStrategy.defaultParams With Enumeration ↵	Basin	2015-01-21	2	-11/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Algo.Classification or Algo.Regression JIRA Issue: https://issues.apache.org/jira/browse/SPARK-5317 When setting the BoostingStrategy.defaultParams("Classification"), It's more straightforward to set it with the Enumeration Algo.Classification, just like BoostingStragety.defaultParams(Algo.Classification). I overload the method BoostingStragety.defaultParams(). Author: Basin <jpsachilles@gmail.com> Closes #4103 from Peishen-Jia/stragetyAlgo and squashes the following commits: 87bab1c [Basin] Docs and Code documentations updated. 3b72875 [Basin] defaultParams(algoStr: String) call defaultParams(algo: Algo). 7c1e6ee [Basin] Doc of Java updated. algo -> algoStr instead. d5c8a2e [Basin] Merge branch 'stragetyAlgo' of github.com:Peishen-Jia/spark into stragetyAlgo 65f96ce [Basin] mllib-ensembles doc modified. e04a5aa [Basin] boostingstrategy.defaultParam string algo to enumeration. 68cf544 [Basin] mllib-ensembles doc modified. a4aea51 [Basin] boostingstrategy.defaultParam string algo to enumeration.
*	[SPARK-3424][MLLIB] cache point distances during k-means\|\| init	Xiangrui Meng	2015-01-21	1	-15/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR ports the following feature implemented in #2634 by derrickburns: * During k-means\|\| initialization, we should cache costs (squared distances) previously computed. It also contains the following optimization: * aggregate sumCosts directly * ran multiple (#runs) k-means++ in parallel I compared the performance locally on mnist-digit. Before this patch: ![before](https://cloud.githubusercontent.com/assets/829644/5845647/93080862-a172-11e4-9a35-044ec711afc4.png) with this patch: ![after](https://cloud.githubusercontent.com/assets/829644/5845653/a47c29e8-a172-11e4-8e9f-08db57fe3502.png) It is clear that each k-means\|\| iteration takes about the same amount of time with this patch. Authors: Derrick Burns <derrickburns@gmail.com> Xiangrui Meng <meng@databricks.com> Closes #4144 from mengxr/SPARK-3424-kmeans-parallel and squashes the following commits: 0a875ec [Xiangrui Meng] address comments 4341bb8 [Xiangrui Meng] do not re-compute point distances during k-means\|\|
*	[SPARK-5202] [SQL] Add hql variable substitution support	Cheng Hao	2015-01-21	2	-2/+22
\| \| \| \| \| \| \| \| \| \| \| \| \|	https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution This is a block issue for the CLI user, it impacts the existed hql scripts from Hive. Author: Cheng Hao <hao.cheng@intel.com> Closes #4003 from chenghao-intel/substitution and squashes the following commits: bb41fd6 [Cheng Hao] revert the removed the implicit conversion af7c31a [Cheng Hao] add hql variable substitution support
*	[SPARK-5355] make SparkConf thread-safe	Davies Liu	2015-01-21	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	The SparkConf is not thread-safe, but is accessed by many threads. The getAll() could return parts of the configs if another thread is access it. This PR changes SparkConf.settings to a thread-safe TrieMap. Author: Davies Liu <davies@databricks.com> Closes #4143 from davies/safe-conf and squashes the following commits: f8fa1cf [Davies Liu] change to TrieMap a1d769a [Davies Liu] make SparkConf thread-safe
*	[SPARK-4984][CORE][WEBUI] Adding a pop-up containing the full job ↵	wangfei	2015-01-21	3	-3/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	description when it is very long In some case the job description will be very long, such as a long sql. refer to #3718 This PR add a pop-up for job description when it is long. ![image](https://cloud.githubusercontent.com/assets/7018048/5847400/c757cbbc-a207-11e4-891f-528821c2e68d.png) ![image](https://cloud.githubusercontent.com/assets/7018048/5847409/d434b2b4-a207-11e4-8813-03a74b43d766.png) Author: wangfei <wangfei1@huawei.com> Closes #3819 from scwf/popup-descrip-ui and squashes the following commits: ba02b83 [wangfei] address comments a7c5e7b [wangfei] spot that it's been truncated fbf6162 [wangfei] Merge branch 'master' into popup-descrip-ui 0bca96d [wangfei] remove no use val 4b55c3b [wangfei] fix style issue 353c6f4 [wangfei] pop up the description of job with a styled read-only text form field
*	[SQL] [Minor] Remove deprecated parquet tests	Cheng Lian	2015-01-21	3	-1289/+212
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR removes the deprecated `ParquetQuerySuite`, renamed `ParquetQuerySuite2` to `ParquetQuerySuite`, and refactored changes introduced in #4115 to `ParquetFilterSuite` . It is a follow-up of #3644. Notice that test cases in the old `ParquetQuerySuite` have already been well covered by other test suites introduced in #3644. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4116) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4116 from liancheng/remove-deprecated-parquet-tests and squashes the following commits: f73b8f9 [Cheng Lian] Removes deprecated Parquet test suite
*	Revert "[SPARK-5244] [SQL] add coalesce() in sql parser"	Josh Rosen	2015-01-21	2	-11/+0
\| \| \| \|	This reverts commit 812d3679f5f97df7b667cbc3365a49866ebc02d5.
*	[SPARK-5009] [SQL] Long keyword support in SQL Parsers	Cheng Hao	2015-01-21	8	-81/+128
\| \| \| \| \| \| \| \| \| \| \|	* The `SqlLexical.allCaseVersions` will cause `StackOverflowException` if the key word is too long, the patch will fix that by normalizing all of the keywords in `SqlLexical`. * And make a unified SparkSQLParser for sharing the common code. Author: Cheng Hao <hao.cheng@intel.com> Closes #3926 from chenghao-intel/long_keyword and squashes the following commits: 686660f [Cheng Hao] Support Long Keyword and Refactor the SQLParsers
*	[SPARK-5244] [SQL] add coalesce() in sql parser	Daoyuan Wang	2015-01-21	2	-0/+11
\| \| \| \| \| \| \| \|	Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4040 from adrian-wang/coalesce and squashes the following commits: 0ac8e8f [Daoyuan Wang] add coalesce() in sql parser
*	[SPARK-5064][GraphX] Add numEdges upperbound validation for R-MAT graph ↵	Kenji Kikushima	2015-01-21	2	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	generator to prevent infinite loop I looked into GraphGenerators#chooseCell, and found that chooseCell can't generate more edges than pow(2, (2 * (log2(numVertices)-1))) to make a Power-law graph. (Ex. numVertices:4 upperbound:4, numVertices:8 upperbound:16, numVertices:16 upperbound:64) If we request more edges over the upperbound, rmatGraph fall into infinite loop. So, how about adding an argument validation? Author: Kenji Kikushima <kikushima.kenji@lab.ntt.co.jp> Closes #3950 from kj-ki/SPARK-5064 and squashes the following commits: 4ee18c7 [Ankur Dave] Reword error message and add unit test d760bc7 [Kenji Kikushima] Add numEdges upperbound validation for R-MAT graph generator to prevent infinite loop.
*	[SPARK-4749] [mllib]: Allow initializing KMeans clusters using a seed	nate.crosswhite	2015-01-21	5	-12/+84
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This implements the functionality for SPARK-4749 and provides units tests in Scala and PySpark Author: nate.crosswhite <nate.crosswhite@stresearch.com> Author: nxwhite-str <nxwhite-str@users.noreply.github.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3610 from nxwhite-str/master and squashes the following commits: a2ebbd3 [nxwhite-str] Merge pull request #1 from mengxr/SPARK-4749-kmeans-seed 7668124 [Xiangrui Meng] minor updates f8d5928 [nate.crosswhite] Addressing PR issues 277d367 [nate.crosswhite] Merge remote-tracking branch 'upstream/master' 9156a57 [nate.crosswhite] Merge remote-tracking branch 'upstream/master' 5d087b4 [nate.crosswhite] Adding KMeans train with seed and Scala unit test 616d111 [nate.crosswhite] Merge remote-tracking branch 'upstream/master' 35c1884 [nate.crosswhite] Add kmeans initial seed to pyspark API
*	[MLlib] [SPARK-5301] Missing conversions and operations on IndexedRowMatrix ↵	Reza Zadeh	2015-01-21	4	-0/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and CoordinateMatrix * Transpose is missing from CoordinateMatrix (this is cheap to compute, so it should be there) * IndexedRowMatrix should be convertable to CoordinateMatrix (conversion added) Tests for both added. Author: Reza Zadeh <reza@databricks.com> Closes #4089 from rezazadeh/matutils and squashes the following commits: ec5238b [Reza Zadeh] Array -> Iterator to avoid temp array 3ce0b5d [Reza Zadeh] Array -> Iterator bbc907a [Reza Zadeh] Use 'i' for index, and zipWithIndex cb10ae5 [Reza Zadeh] remove unnecessary import a7ae048 [Reza Zadeh] Missing linear algebra utilities
*	SPARK-1714. Take advantage of AMRMClient APIs to simplify logic in YarnA...	Sandy Ryza	2015-01-21	7	-550/+389
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...llocator The goal of this PR is to simplify YarnAllocator as much as possible and get it up to the level of code quality we see in the rest of Spark. In service of this, it does a few things: * Uses AMRMClient APIs for matching containers to requests. * Adds calls to AMRMClient.removeContainerRequest so that, when we use a container, we don't end up requesting it again. * Removes YarnAllocator's host->rack cache. YARN's RackResolver already does this caching, so this is redundant. * Adds tests for basic YarnAllocator functionality. * Breaks up the allocateResources method, which was previously nearly 300 lines. * A little bit of stylistic cleanup. * Fixes a bug that causes three times the requests to be filed when preferred host locations are given. The patch is lossy. In particular, it loses the logic for trying to avoid containers bunching up on nodes. As I understand it, the logic that's gone is: * If, in a single response from the RM, we receive a set of containers on a node, and prefer some number of containers on that node greater than 0 but less than the number we received, give back the delta between what we preferred and what we received. This seems like a weird way to avoid bunching E.g. it does nothing to avoid bunching when we don't request containers on particular nodes. Author: Sandy Ryza <sandy@cloudera.com> Closes #3765 from sryza/sandy-spark-1714 and squashes the following commits: 32a5942 [Sandy Ryza] Muffle RackResolver logs 74f56dd [Sandy Ryza] Fix a couple comments and simplify requestTotalExecutors 60ea4bd [Sandy Ryza] Fix scalastyle ca35b53 [Sandy Ryza] Simplify further e9cf8a6 [Sandy Ryza] Fix YarnClusterSuite 257acf3 [Sandy Ryza] Remove locality stuff and more cleanup 59a3c5e [Sandy Ryza] Take out rack stuff 5f72fd5 [Sandy Ryza] Further documentation and cleanup 89edd68 [Sandy Ryza] SPARK-1714. Take advantage of AMRMClient APIs to simplify logic in YarnAllocator
*	[SPARK-5336][YARN]spark.executor.cores must not be less than spark.task.cpus	WangTao	2015-01-21	3	-5/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-5336 Author: WangTao <barneystinson@aliyun.com> Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #4123 from WangTaoTheTonic/SPARK-5336 and squashes the following commits: 6c9676a [WangTao] Update ClientArguments.scala 9632d3a [WangTaoTheTonic] minor comment fix d03d6fa [WangTaoTheTonic] import ordering should be alphabetical' 3112af9 [WangTao] spark.executor.cores must not be less than spark.task.cpus
*	[SPARK-5297][Streaming] Fix Java file stream type erasure problem	jerryshao	2015-01-20	3	-15/+112
\| \| \| \| \| \| \| \| \| \| \|	Current Java file stream doesn't support custom key/value type because of loss of type information, details can be seen in [SPARK-5297](https://issues.apache.org/jira/browse/SPARK-5297). Fix this problem by getting correct `ClassTag` from `Class[_]`. Author: jerryshao <saisai.shao@intel.com> Closes #4101 from jerryshao/SPARK-5297 and squashes the following commits: e022ca3 [jerryshao] Add Mima exclusion ecd61b8 [jerryshao] Fix Java fileInputStream type erasure problem
*	[HOTFIX] Update pom.xml to pull MapR's Hadoop version 2.4.1.	Kannan Rajah	2015-01-20	1	-3/+3
\| \| \| \| \| \| \| \|	Author: Kannan Rajah <rkannan82@gmail.com> Closes #4108 from rkannan82/master and squashes the following commits: eca095b [Kannan Rajah] Update pom.xml to pull MapR's Hadoop version 2.4.1.
*	[SPARK-5275] [Streaming] include python source code	Davies Liu	2015-01-20	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	Include the python source code into assembly jar. cc mengxr pwendell Author: Davies Liu <davies@databricks.com> Closes #4128 from davies/build_streaming2 and squashes the following commits: 546af4c [Davies Liu] fix indent 48859b2 [Davies Liu] include python source code