spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[MINOR][ML] Rename weights to coefficients for examples/DeveloperApiExample	Yanbo Liang	2015-12-15	2	-19/+19
\| \| \| \| \| \| \| \| \| \|	Rename ```weights``` to ```coefficients``` for examples/DeveloperApiExample. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10280 from yanboliang/spark-coefficients.
*	[STREAMING][MINOR] Fix typo in function name of StateImpl	jerryshao	2015-12-15	3	-3/+3
\| \| \| \| \| \| \| \|	cc\ tdas zsxwing , please review. Thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #10305 from jerryshao/fix-typo-state-impl.
*	[SPARK-12332][TRIVIAL][TEST] Fix minor typo in ResetSystemProperties	Holden Karau	2015-12-15	1	-1/+1
\| \| \| \| \| \| \| \|	Fix a minor typo (unbalanced bracket) in ResetSystemProperties. Author: Holden Karau <holden@us.ibm.com> Closes #10303 from holdenk/SPARK-12332-trivial-typo-in-ResetSystemProperties-comment.
*	[SPARK-12288] [SQL] Support UnsafeRow in Coalesce/Except/Intersect.	gatorsmile	2015-12-14	2	-1/+46
\| \| \| \| \| \| \| \| \| \|	Support UnsafeRow for the Coalesce/Except/Intersect. Could you review if my code changes are ok? davies Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #10285 from gatorsmile/unsafeSupportCIE.
*	[SPARK-12188][SQL][FOLLOW-UP] Code refactoring and comment correction in ↵	gatorsmile	2015-12-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Dataset APIs marmbrus This PR is to address your comment. Thanks for your review! Author: gatorsmile <gatorsmile@gmail.com> Closes #10214 from gatorsmile/followup12188.
*	[SPARK-12274][SQL] WrapOption should not have type constraint for child	Wenchen Fan	2015-12-14	1	-4/+1
\| \| \| \| \| \| \| \|	I think it was a mistake, and we have not catched it so far until https://github.com/apache/spark/pull/10260 which begin to check if the `fromRowExpression` is resolved. Author: Wenchen Fan <wenchen@databricks.com> Closes #10263 from cloud-fan/encoder.
*	[SPARK-12327] Disable commented code lintr temporarily	Shivaram Venkataraman	2015-12-14	1	-1/+1
\| \| \| \| \| \| \| \|	cc yhuai felixcheung shaneknapp Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #10300 from shivaram/comment-lintr-disable.
*	[SPARK-12016] [MLLIB] [PYSPARK] Wrap Word2VecModel when loading it in pyspark	Liang-Chi Hsieh	2015-12-14	3	-34/+67
\| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-12016 We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10100 from viirya/fix-load-py-wordvecmodel.
*	[MINOR][DOC] Fix broken word2vec link	BenFradet	2015-12-14	1	-1/+1
\| \| \| \| \| \| \| \|	Follow-up of [SPARK-12199](https://issues.apache.org/jira/browse/SPARK-12199) and #10193 where a broken link has been left as is. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10282 from BenFradet/SPARK-12199.
*	[SPARK-12275][SQL] No plan for BroadcastHint in some condition	yucai	2015-12-13	2	-1/+8
\| \| \| \| \| \| \| \| \| \|	When SparkStrategies.BasicOperators's "case BroadcastHint(child) => apply(child)" is hit, it only recursively invokes BasicOperators.apply with this "child". It makes many strategies have no change to process this plan, which probably leads to "No plan" issue, so we use planLater to go through all strategies. https://issues.apache.org/jira/browse/SPARK-12275 Author: yucai <yucai.yu@intel.com> Closes #10265 from yucai/broadcast_hint.
*	[SPARK-12213][SQL] use multiple partitions for single distinct query	Davies Liu	2015-12-13	10	-990/+422
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, we could generate different plans for query with single distinct (depends on spark.sql.specializeSingleDistinctAggPlanning), one works better on low cardinality columns, the other works better for high cardinality column (default one). This PR change to generate a single plan (three aggregations and two exchanges), which work better in both cases, then we could safely remove the flag `spark.sql.specializeSingleDistinctAggPlanning` (introduced in 1.6). For a query like `SELECT COUNT(DISTINCT a) FROM table` will be ``` AGG-4 (count distinct) Shuffle to a single reducer Partial-AGG-3 (count distinct, no grouping) Partial-AGG-2 (grouping on a) Shuffle by a Partial-AGG-1 (grouping on a) ``` This PR also includes large refactor for aggregation (reduce 500+ lines of code) cc yhuai nongli marmbrus Author: Davies Liu <davies@databricks.com> Closes #10228 from davies/single_distinct.
*	[SPARK-12281][CORE] Fix a race condition when reporting ExecutorState in the ↵	Shixiong Zhu	2015-12-13	3	-3/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	shutdown hook 1. Make sure workers and masters exit so that no worker or master will still be running when triggering the shutdown hook. 2. Set ExecutorState to FAILED if it's still RUNNING when executing the shutdown hook. This should fix the potential exceptions when exiting a local cluster ``` java.lang.AssertionError: assertion failed: executor 4 state transfer from RUNNING to RUNNING is illegal at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown. at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180) at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73) at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` Author: Shixiong Zhu <shixiong@databricks.com> Closes #10269 from zsxwing/executor-state.
*	[SPARK-12267][CORE] Store the remote RpcEnv address to send the correct ↵	Shixiong Zhu	2015-12-12	4	-1/+65
\| \| \| \| \| \| \| \|	disconnetion message Author: Shixiong Zhu <shixiong@databricks.com> Closes #10261 from zsxwing/SPARK-12267.
*	[SPARK-12199][DOC] Follow-up: Refine example code in ml-features.md	Xusen Yin	2015-12-12	4	-15/+15
\| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-12199 Follow-up PR of SPARK-11551. Fix some errors in ml-features.md mengxr Author: Xusen Yin <yinxusen@gmail.com> Closes #10193 from yinxusen/SPARK-12199.
*	[SPARK-11193] Use Java ConcurrentHashMap instead of SynchronizedMap trait in ↵	Jean-Baptiste Onofré	2015-12-12	1	-8/+8
\| \| \| \| \| \| \| \|	order to avoid ClassCastException due to KryoSerializer in KinesisReceiver Author: Jean-Baptiste Onofré <jbonofre@apache.org> Closes #10203 from jbonofre/SPARK-11193.
*	[SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases	gatorsmile	2015-12-11	2	-6/+15
\| \| \| \| \| \| \| \| \| \| \|	The existing sample functions miss the parameter `seed`, however, the corresponding function interface in `generics` has such a parameter. Thus, although the function caller can call the function with the 'seed', we are not using the value. This could cause SparkR unit tests failed. For example, I hit it in another PR: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull Author: gatorsmile <gatorsmile@gmail.com> Closes #10160 from gatorsmile/sampleR.
*	[SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions	Ankur Dave	2015-12-11	2	-3/+3
\| \| \| \| \| \| \| \|	Modifies the String overload to call the Column overload and ensures this is called in a test. Author: Ankur Dave <ankurdave@gmail.com> Closes #10271 from ankurdave/SPARK-12298.
*	[SPARK-11978][ML] Move dataset_example.py to examples/ml and rename to ↵	Yanbo Liang	2015-12-11	2	-26/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	dataframe_example.py Since ```Dataset``` has a new meaning in Spark 1.6, we should rename it to avoid confusion. #9873 finished the work of Scala example, here we focus on the Python one. Move dataset_example.py to ```examples/ml``` and rename to ```dataframe_example.py```. BTW, fix minor missing issues of #9873. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9957 from yanboliang/SPARK-11978.
*	[SPARK-12217][ML] Document invalid handling for StringIndexer	BenFradet	2015-12-11	1	-0/+36
\| \| \| \| \| \| \| \| \| \|	Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation. I wonder if I should also add a snippet to the code example, input welcome. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10257 from BenFradet/SPARK-12217.
*	[SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure ↵	Mike Dusenberry	2015-12-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.
*	[SPARK-12273][STREAMING] Make Spark Streaming web UI list Receivers in order	proflin	2015-12-11	1	-2/+3
\| \| \| \| \| \| \| \| \| \|	Currently the Streaming web UI does NOT list Receivers in order; however, it seems more convenient for the users if Receivers are listed in order. ![spark-12273](https://cloud.githubusercontent.com/assets/15843379/11736602/0bb7f7a8-a00b-11e5-8e86-96ba9297fb12.png) Author: proflin <proflin.me@gmail.com> Closes #10264 from proflin/Spark-12273.
*	[SPARK-11964][DOCS][ML] Add in Pipeline Import/Export Documentation	anabranch	2015-12-11	1	-0/+13
\| \| \| \| \| \| \| \| \|	Adding in Pipeline Import and Export Documentation. Author: anabranch <wac.chambers@gmail.com> Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #10179 from anabranch/master.
*	[SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files	Yanbo Liang	2015-12-11	5	-116/+138
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* ```jsonFile``` should support multiple input files, such as: ```R jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments jsonFile(sqlContext, “path1,path2”) ``` * Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side. * Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case. * If this PR is accepted, we should also make almost the same change for ```parquetFile```. cc felixcheung sun-rui shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10145 from yanboliang/spark-12146.
*	[SPARK-12258] [SQL] passing null into ScalaUDF (follow-up)	Davies Liu	2015-12-11	2	-16/+23
\| \| \| \| \| \| \| \|	This is a follow-up PR for #10259 Author: Davies Liu <davies@databricks.com> Closes #10266 from davies/null_udf2.
*	[SPARK-10991][ML] logistic regression training summary handle empty ↵	Holden Karau	2015-12-11	2	-2/+29
\| \| \| \| \| \| \| \| \| \| \|	prediction col LogisticRegression training summary should still function if the predictionCol is set to an empty string or otherwise unset (related too https://issues.apache.org/jira/browse/SPARK-9718 ) Author: Holden Karau <holden@pigscanfly.ca> Author: Holden Karau <holden@us.ibm.com> Closes #9037 from holdenk/SPARK-10991-LogisticRegressionTrainingSummary-handle-empty-prediction-col.
*	[SPARK-12258][SQL] passing null into ScalaUDF	Davies Liu	2015-12-10	2	-6/+10
\| \| \| \| \| \| \| \| \| \|	Check nullability and passing them into ScalaUDF. Closes #10249 Author: Davies Liu <davies@databricks.com> Closes #10259 from davies/udf_null.
*	[STREAMING][DOC][MINOR] Update the description of direct Kafka stream doc	jerryshao	2015-12-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	With the merge of [SPARK-8337](https://issues.apache.org/jira/browse/SPARK-8337), now the Python API has the same functionalities compared to Scala/Java, so here changing the description to make it more precise. zsxwing tdas , please review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #10246 from jerryshao/direct-kafka-doc-update.
*	[SPARK-12155][SPARK-12253] Fix executor OOM in unified memory management	Andrew Or	2015-12-10	4	-31/+114
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem. In unified memory management, acquiring execution memory may lead to eviction of storage memory. However, the space freed from evicting cached blocks is distributed among all active tasks. Thus, an incorrect upper bound on the execution memory per task can cause the acquisition to fail, leading to OOM's and premature spills. Example. Suppose total memory is 1000B, cached blocks occupy 900B, `spark.memory.storageFraction` is 0.4, and there are two active tasks. In this case, the cap on task execution memory is 100B / 2 = 50B. If task A tries to acquire 200B, it will evict 100B of storage but can only acquire 50B because of the incorrect cap. For another example, see this [regression test](https://github.com/andrewor14/spark/blob/fix-oom/core/src/test/scala/org/apache/spark/memory/UnifiedMemoryManagerSuite.scala#L233) that I stole from JoshRosen. Solution. Fix the cap on task execution memory. It should take into account the space that could have been freed by storage in addition to the current amount of memory available to execution. In the example above, the correct cap should have been 600B / 2 = 300B. This patch also guards against the race condition (SPARK-12253): (1) Existing tasks collectively occupy all execution memory (2) New task comes in and blocks while existing tasks spill (3) After tasks finish spilling, another task jumps in and puts in a large block, stealing the freed memory (4) New task still cannot acquire memory and goes back to sleep Author: Andrew Or <andrew@databricks.com> Closes #10240 from andrewor14/fix-oom.
*	[SPARK-12251] Document and improve off-heap memory configurations	Josh Rosen	2015-12-10	15	-22/+65
\| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds documentation for Spark configurations that affect off-heap memory and makes some naming and validation improvements for those configs. - Change `spark.memory.offHeapSize` to `spark.memory.offHeap.size`. This is fine because this configuration has not shipped in any Spark release yet (it's new in Spark 1.6). - Deprecated `spark.unsafe.offHeap` in favor of a new `spark.memory.offHeap.enabled` configuration. The motivation behind this change is to gather all memory-related configurations under the same prefix. - Add a check which prevents users from setting `spark.memory.offHeap.enabled=true` when `spark.memory.offHeap.size == 0`. After SPARK-11389 (#9344), which was committed in Spark 1.6, Spark enforces a hard limit on the amount of off-heap memory that it will allocate to tasks. As a result, enabling off-heap execution memory without setting `spark.memory.offHeap.size` will lead to immediate OOMs. The new configuration validation makes this scenario easier to diagnose, helping to avoid user confusion. - Document these configurations on the configuration page. Author: Josh Rosen <joshrosen@databricks.com> Closes #10237 from JoshRosen/SPARK-12251.
*	[SPARK-11713] [PYSPARK] [STREAMING] Initial RDD updateStateByKey for PySpark	Bryan Cutler	2015-12-10	4	-5/+47
\| \| \| \| \| \| \| \|	Adding ability to define an initial state RDD for use with updateStateByKey PySpark. Added unit test and changed stateful_network_wordcount example to use initial RDD. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #10082 from BryanCutler/initial-rdd-updateStateByKey-SPARK-11713.
*	[SPARK-11563][CORE][REPL] Use RpcEnv to transfer REPL-generated classes.	Marcelo Vanzin	2015-12-10	15	-98/+183
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This avoids bringing up yet another HTTP server on the driver, and instead reuses the file server already managed by the driver's RpcEnv. As a bonus, the repl now inherits the security features of the network library. There's also a small change to create the directory for storing classes under the root temp dir for the application (instead of directly under java.io.tmpdir). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9923 from vanzin/SPARK-11563.
*	[SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, ↵	Timothy Hunter	2015-12-10	31	-1793/+149
\| \| \| \| \| \| \| \| \| \| \| \|	spark.mllib and mllib in the documentation. Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark). It also removes some files that I forgot to delete with #10207 Author: Timothy Hunter <timhunter@databricks.com> Closes #10234 from thunterdb/12212.
*	[SPARK-12228][SQL] Try to run execution hive's derby in memory.	Yin Huai	2015-12-10	4	-5/+9
\| \| \| \| \| \| \| \| \| \|	This PR tries to make execution hive's derby run in memory since it is a fake metastore and every time we create a HiveContext, we will switch to a new one. It is possible that it can reduce the flakyness of our tests that need to create HiveContext (e.g. HiveSparkSubmitSuite). I will test it more. https://issues.apache.org/jira/browse/SPARK-12228 Author: Yin Huai <yhuai@databricks.com> Closes #10204 from yhuai/derbyInMemory.
*	[SPARK-12250][SQL] Allow users to define a UDAF without providing details of ↵	Yin Huai	2015-12-10	2	-5/+64
\| \| \| \| \| \| \| \| \| \|	its inputSchema https://issues.apache.org/jira/browse/SPARK-12250 Author: Yin Huai <yhuai@databricks.com> Closes #10236 from yhuai/SPARK-12250.
*	[SPARK-12234][SPARKR] Fix ```subset``` function error when only set ↵	Yanbo Liang	2015-12-10	2	-2/+11
\| \| \| \| \| \| \| \| \| \| \| \|	```select``` argument Fix ```subset``` function error when only set ```select``` argument. Please refer to the [JIRA](https://issues.apache.org/jira/browse/SPARK-12234) about the error and how to reproduce it. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10217 from yanboliang/spark-12234.
*	[SPARK-11602][MLLIB] Refine visibility for 1.6 scala API audit	Yuhao Yang	2015-12-10	4	-5/+5
\| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-11602 Made a pass on the API change of 1.6. Open the PR for efficient discussion. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9939 from hhbyyh/auditScala.
*	[SPARK-12198][SPARKR] SparkR support read.parquet and deprecate parquetFile	Yanbo Liang	2015-12-10	3	-6/+22
\| \| \| \| \| \| \| \|	SparkR support ```read.parquet``` and deprecate ```parquetFile```. This change is similar with #10145 for ```jsonFile```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10191 from yanboliang/spark-12198.
*	[SPARK-11832][CORE] Process arguments in spark-shell for Scala 2.11	Jakob Odersky	2015-12-10	2	-13/+27
\| \| \| \| \| \| \| \|	Process arguments passed to the spark-shell. Fixes running the spark-shell from within a build environment. Author: Jakob Odersky <jodersky@gmail.com> Closes #9824 from jodersky/shell-2.11.
*	[SPARK-12242][SQL] Add DataFrame.transform method	Reynold Xin	2015-12-10	2	-1/+14
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #10226 from rxin/df-transform.
*	[SPARK-11530][MLLIB] Return eigenvalues with PCA model	Sean Owen	2015-12-10	7	-25/+67
\| \| \| \| \| \| \| \| \| \|	Add `computePrincipalComponentsAndVariance` to also compute PCA's explained variance. CC mengxr Author: Sean Owen <sowen@cloudera.com> Closes #9736 from srowen/SPARK-11530.
*	[SPARK-12136][STREAMING] rddToFileName does not properly handle prefix and ↵	bomeng	2015-12-10	1	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	suffix parameters The original code does not properly handle the cases where the prefix is null, but suffix is not null - the suffix should be used but is not. The fix is using StringBuilder to construct the proper file name. Author: bomeng <bmeng@us.ibm.com> Author: Bo Meng <mengbo@bos-macbook-pro.usca.ibm.com> Closes #10185 from bomeng/SPARK-12136.
*	[SPARK-12252][SPARK-12131][SQL] refactor MapObjects to make it less hacky	Wenchen Fan	2015-12-10	4	-47/+35
\| \| \| \| \| \| \| \| \| \| \| \|	in https://github.com/apache/spark/pull/10133 we found that, we shoud ensure the children of `TreeNode` are all accessible in the `productIterator`, or the behavior will be very confusing. In this PR, I try to fix this problem by expsing the `loopVar`. This also fixes SPARK-12131 which is caused by the hacky `MapObjects`. Author: Wenchen Fan <wenchen@databricks.com> Closes #10239 from cloud-fan/map-objects.
*	[SPARK-12244][SPARK-12245][STREAMING] Rename trackStateByKey to mapWithState ↵	Tathagata Das	2015-12-09	13	-382/+389
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and change tracking function signature SPARK-12244: Based on feedback from early users and personal experience attempting to explain it, the name trackStateByKey had two problem. "trackState" is a completely new term which really does not give any intuition on what the operation is the resultant data stream of objects returned by the function is called in docs as the "emitted" data for the lack of a better. "mapWithState" makes sense because the API is like a mapping function like (Key, Value) => T with State as an additional parameter. The resultant data stream is "mapped data". So both problems are solved. SPARK-12245: From initial experiences, not having the key in the function makes it hard to return mapped stuff, as the whole information of the records is not there. Basically the user is restricted to doing something like mapValue() instead of map(). So adding the key as a parameter. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #10224 from tdas/rename.
*	[SPARK-11796] Fix httpclient and httpcore depedency issues related to ↵	Mark Grover	2015-12-09	3	-2/+50
\| \| \| \| \| \| \| \| \| \|	docker-client This commit fixes dependency issues which prevented the Docker-based JDBC integration tests from running in the Maven build. Author: Mark Grover <mgrover@cloudera.com> Closes #9876 from markgrover/master_docker.
*	[SPARK-11678][SQL][DOCS] Document basePath in the programming guide.	Yin Huai	2015-12-09	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds document for `basePath`, which is a new parameter used by `HadoopFsRelation`. The compiled doc is shown below. ![image](https://cloud.githubusercontent.com/assets/2072857/11673132/1ba01192-9dcb-11e5-98d9-ac0b4e92e98c.png) JIRA: https://issues.apache.org/jira/browse/SPARK-11678 Author: Yin Huai <yhuai@databricks.com> Closes #10211 from yhuai/basePathDoc.
*	[SPARK-12165][ADDENDUM] Fix outdated comments on unroll test	Andrew Or	2015-12-09	1	-4/+9
\| \| \| \| \| \| \| \|	JoshRosen Author: Andrew Or <andrew@databricks.com> Closes #10229 from andrewor14/unroll-test-comments.
*	[SPARK-12211][DOC][GRAPHX] Fix version number in graphx doc for migration ↵	Andrew Ray	2015-12-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	from 1.1 Migration from 1.1 section added to the GraphX doc in 1.2.0 (see https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#migrating-from-spark-11) uses \{{site.SPARK_VERSION}} as the version where changes were introduced, it should be just 1.2. Author: Andrew Ray <ray.andrew@gmail.com> Closes #10206 from aray/graphx-doc-1.1-migration.
*	[SPARK-11551][DOC] Replace example code in ml-features.md using include_example	Xusen Yin	2015-12-09	52	-1061/+2820
\| \| \| \| \| \| \| \| \|	PR on behalf of somideshmukh, thanks! Author: Xusen Yin <yinxusen@gmail.com> Author: somideshmukh <somilde@us.ibm.com> Closes #10219 from yinxusen/SPARK-11551.
*	[SPARK-11824][WEBUI] WebUI does not render descriptions with 'bad' HTML, ↵	Sean Owen	2015-12-09	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	throws console error Don't warn when description isn't valid HTML since it may properly be like "SELECT ... where foo <= 1" The tests for this code indicate that it's normal to handle strings like this that don't contain HTML as a string rather than markup. Hence logging every such instance as a warning is too noisy since it's not a problem. this is an issue for stages whose name contain SQL like the above CC tdas as author of this bit of code Author: Sean Owen <sowen@cloudera.com> Closes #10159 from srowen/SPARK-11824.
*	[SPARK-12165][SPARK-12189] Fix bugs in eviction of storage memory by execution	Josh Rosen	2015-12-09	8	-204/+230
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes a bug in the eviction of storage memory by execution. ## The bug: In general, execution should be able to evict storage memory when the total storage memory usage is greater than `maxMemory * spark.memory.storageFraction`. Due to a bug, however, Spark might wind up evicting no storage memory in certain cases where the storage memory usage was between `maxMemory * spark.memory.storageFraction` and `maxMemory`. For example, here is a regression test which illustrates the bug: ```scala val maxMemory = 1000L val taskAttemptId = 0L val (mm, ms) = makeThings(maxMemory) // Since we used the default storage fraction (0.5), we should be able to allocate 500 bytes // of storage memory which are immune to eviction by execution memory pressure. // Acquire enough storage memory to exceed the storage region size assert(mm.acquireStorageMemory(dummyBlock, 750L, evictedBlocks)) assertEvictBlocksToFreeSpaceNotCalled(ms) assert(mm.executionMemoryUsed === 0L) assert(mm.storageMemoryUsed === 750L) // At this point, storage is using 250 more bytes of memory than it is guaranteed, so execution // should be able to reclaim up to 250 bytes of storage memory. // Therefore, execution should now be able to require up to 500 bytes of memory: assert(mm.acquireExecutionMemory(500L, taskAttemptId, MemoryMode.ON_HEAP) === 500L) // <--- fails by only returning 250L assert(mm.storageMemoryUsed === 500L) assert(mm.executionMemoryUsed === 500L) assertEvictBlocksToFreeSpaceCalled(ms, 250L) ``` The problem relates to the control flow / interaction between `StorageMemoryPool.shrinkPoolToReclaimSpace()` and `MemoryStore.ensureFreeSpace()`. While trying to allocate the 500 bytes of execution memory, the `UnifiedMemoryManager` discovers that it will need to reclaim 250 bytes of memory from storage, so it calls `StorageMemoryPool.shrinkPoolToReclaimSpace(250L)`. This method, in turn, calls `MemoryStore.ensureFreeSpace(250L)`. However, `ensureFreeSpace()` first checks whether the requested space is less than `maxStorageMemory - storageMemoryUsed`, which will be true if there is any free execution memory because it turns out that `MemoryStore.maxStorageMemory = (maxMemory - onHeapExecutionMemoryPool.memoryUsed)` when the `UnifiedMemoryManager` is used. The control flow here is somewhat confusing (it grew to be messy / confusing over time / as a result of the merging / refactoring of several components). In the pre-Spark 1.6 code, `ensureFreeSpace` was called directly by the `MemoryStore` itself, whereas in 1.6 it's involved in a confusing control flow where `MemoryStore` calls `MemoryManager.acquireStorageMemory`, which then calls back into `MemoryStore.ensureFreeSpace`, which, in turn, calls `MemoryManager.freeStorageMemory`. ## The solution: The solution implemented in this patch is to remove the confusing circular control flow between `MemoryManager` and `MemoryStore`, making the storage memory acquisition process much more linear / straightforward. The key changes: - Remove a layer of inheritance which made the memory manager code harder to understand (53841174760a24a0df3eb1562af1f33dbe340eb9). - Move some bounds checks earlier in the call chain (13ba7ada77f87ef1ec362aec35c89a924e6987cb). - Refactor `ensureFreeSpace()` so that the part which evicts blocks can be called independently from the part which checks whether there is enough free space to avoid eviction (7c68ca09cb1b12f157400866983f753ac863380e). - Realize that this lets us remove a layer of overloads from `ensureFreeSpace` (eec4f6c87423d5e482b710e098486b3bbc4daf06). - Realize that `ensureFreeSpace()` can simply be replaced with an `evictBlocksToFreeSpace()` method which is called [after we've already figured out](https://github.com/apache/spark/blob/2dc842aea82c8895125d46a00aa43dfb0d121de9/core/src/main/scala/org/apache/spark/memory/StorageMemoryPool.scala#L88) how much memory needs to be reclaimed via eviction; (2dc842aea82c8895125d46a00aa43dfb0d121de9). Along the way, I fixed some problems with the mocks in `MemoryManagerSuite`: the old mocks would [unconditionally](https://github.com/apache/spark/blob/80a824d36eec9d9a9f092ee1741453851218ec73/core/src/test/scala/org/apache/spark/memory/MemoryManagerSuite.scala#L84) report that a block had been evicted even if there was enough space in the storage pool such that eviction would be avoided. I also fixed a problem where `StorageMemoryPool._memoryUsed` might become negative due to freed memory being double-counted when excution evicts storage. The problem was that `StorageMemoryPoolshrinkPoolToFreeSpace` would [decrement `_memoryUsed`](https://github.com/apache/spark/commit/7c68ca09cb1b12f157400866983f753ac863380e#diff-935c68a9803be144ed7bafdd2f756a0fL133) even though `StorageMemoryPool.freeMemory` had already decremented it as each evicted block was freed. See SPARK-12189 for details. Author: Josh Rosen <joshrosen@databricks.com> Author: Andrew Or <andrew@databricks.com> Closes #10170 from JoshRosen/SPARK-12165.