spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Preparing Spark release v1.2.0-rc2v1.2.0	Patrick Wendell	2014-12-10	29	-29/+29
\|
*	Revert "Preparing Spark release v1.2.0-rc2"	Patrick Wendell	2014-12-10	29	-29/+29
\| \| \| \|	This reverts commit 2b72c569a674cccf79ebbe8d067b8dbaaf78007f.
*	Revert "Preparing development version 1.2.1-SNAPSHOT"	Patrick Wendell	2014-12-10	29	-29/+29
\| \| \| \|	This reverts commit bc05df8a23ba7ad485f6844f28f96551b13ba461.
*	[Minor] Use <sup> tag for help icon in web UI page header	Josh Rosen	2014-12-09	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This small commit makes the `(?)` web UI help link into a superscript, which should address feedback that the current design makes it look like an error occurred or like information is missing. Before: ![image](https://cloud.githubusercontent.com/assets/50748/5370611/a3ed0034-7fd9-11e4-870f-05bd9faad5b9.png) After: ![image](https://cloud.githubusercontent.com/assets/50748/5370602/6c5ca8d6-7fd9-11e4-8d1a-568d71290aa7.png) Author: Josh Rosen <joshrosen@databricks.com> Closes #3659 from JoshRosen/webui-help-sup and squashes the following commits: bd72899 [Josh Rosen] Use <sup> tag for help icon in web UI page header. (cherry picked from commit f79c1cfc997c1a7ddee480ca3d46f5341b69d3b7) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	Config updates for the new shuffle transport.	Reynold Xin	2014-12-09	3	-6/+6
\| \| \| \| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #3657 from rxin/conf-update and squashes the following commits: 7370eab [Reynold Xin] Config updates for the new shuffle transport. (cherry picked from commit 9bd9334f588dbb44d01554f9f4ca68a153a48993) Signed-off-by: Aaron Davidson <aaron@databricks.com>
*	[SPARK-4740] Create multiple concurrent connections between two peer nodes ↵	Reynold Xin	2014-12-09	3	-46/+180
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in Netty. It's been reported that when the number of disks is large and the number of nodes is small, Netty network throughput is low compared with NIO. We suspect the problem is that only a small number of disks are utilized to serve shuffle files at any given point, due to connection reuse. This patch adds a new config parameter to specify the number of concurrent connections between two peer nodes, default to 2. Author: Reynold Xin <rxin@databricks.com> Closes #3625 from rxin/SPARK-4740 and squashes the following commits: ad4241a [Reynold Xin] Updated javadoc. f33c72b [Reynold Xin] Code review feedback. 0fefabb [Reynold Xin] Use double check in synchronization. 41dfcb2 [Reynold Xin] Added test case. 9076b4a [Reynold Xin] Fixed two NPEs. 3e1306c [Reynold Xin] Minor style fix. 4f21673 [Reynold Xin] [SPARK-4740] Create multiple concurrent connections between two peer nodes in Netty. (cherry picked from commit 2b9b72682e587909a84d3ace214c22cec830eeaf) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	SPARK-4805 [CORE] BlockTransferMessage.toByteArray() trips assertion	Sean Owen	2014-12-09	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	Allocate enough room for type byte as well as message, to avoid tripping assertion about capacity of the buffer Author: Sean Owen <sowen@cloudera.com> Closes #3650 from srowen/SPARK-4805 and squashes the following commits: 9e1d502 [Sean Owen] Allocate enough room for type byte as well as message, to avoid tripping assertion about capacity of the buffer (cherry picked from commit d8f84f26e388055ca7459810e001d05ab60af15b) Signed-off-by: Aaron Davidson <aaron@databricks.com>
*	SPARK-4567. Make SparkJobInfo and SparkStageInfo serializable	Sandy Ryza	2014-12-09	2	-2/+6
\| \| \| \| \| \| \| \| \| \| \|	Author: Sandy Ryza <sandy@cloudera.com> Closes #3426 from sryza/sandy-spark-4567 and squashes the following commits: cb4b8d2 [Sandy Ryza] SPARK-4567. Make SparkJobInfo and SparkStageInfo serializable (cherry picked from commit 5e4c06f8e54265a4024857f5978ec54c936aeea2) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4765] Make GC time always shown in UI.	Kay Ousterhout	2014-12-09	3	-13/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit removes the GC time for each task from the set of optional, additional metrics, and instead always shows it for each task. cc pwendell Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #3622 from kayousterhout/gc_time and squashes the following commits: 15ac242 [Kay Ousterhout] Make TaskDetailsClassNames private[spark] e71d893 [Kay Ousterhout] [SPARK-4765] Make GC time always shown in UI. (cherry picked from commit 1f5110630c1abb13a357b463c805a39772923b82) Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com>
*	[SPARK-4785][SQL] Initilize Hive UDFs on the driver and serialize them with ↵	Cheng Hao	2014-12-09	5	-50/+173
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	a wrapper Different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. GenericUDF doesn't implement Serializable). Hive 0.13.1 solves this issue with Kryo or XML serializer. Several utility ser/de methods are provided in class o.a.h.h.q.e.Utilities for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer used here is created in Hive. Spark Kryo serializer wasn't used because there's no available SparkConf instance. Author: Cheng Hao <hao.cheng@intel.com> Author: Cheng Lian <lian@databricks.com> Closes #3640 from chenghao-intel/udf_serde and squashes the following commits: 8e13756 [Cheng Hao] Update the comment 74466a3 [Cheng Hao] refactor as feedbacks 396c0e1 [Cheng Hao] avoid Simple UDF to be serialized e9c3212 [Cheng Hao] update the comment 19cbd46 [Cheng Hao] support udf instance ser/de after initialization (cherry picked from commit 383c5555c9f26c080bc9e3a463aab21dd5b3797f) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-4769] [SQL] CTAS does not work when reading from temporary tables	Cheng Hao	2014-12-08	4	-16/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is the code refactor and follow ups for #2570 Author: Cheng Hao <hao.cheng@intel.com> Closes #3336 from chenghao-intel/createtbl and squashes the following commits: 3563142 [Cheng Hao] remove the unused variable e215187 [Cheng Hao] eliminate the compiling warning 4f97f14 [Cheng Hao] fix bug in unittest 5d58812 [Cheng Hao] revert the API changes b85b620 [Cheng Hao] fix the regression of temp tabl not found in CTAS (cherry picked from commit 51b1fe1426ffecac6c4644523633ea1562ff9a4e) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio doc...	Sandy Ryza	2014-12-08	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	...umented default is incorrect for YARN Author: Sandy Ryza <sandy@cloudera.com> Closes #3624 from sryza/sandy-spark-4770 and squashes the following commits: bd81a3a [Sandy Ryza] SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio documented default is incorrect for YARN (cherry picked from commit cda94d15ea2a70ed3f0651ba2766b1e2f80308c1) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4774] [SQL] Makes HiveFromSpark more portable	Kostas Sakellis	2014-12-08	1	-2/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	HiveFromSpark read the kv1.txt file from SPARK_HOME/examples/src/main/resources/kv1.txt which assumed you had a source tree checked out. Now we copy the kv1.txt file to a temporary file and delete it when the jvm shuts down. This allows us to run this example outside of a spark source tree. Author: Kostas Sakellis <kostas@cloudera.com> Closes #3628 from ksakellis/kostas-spark-4774 and squashes the following commits: 6770f83 [Kostas Sakellis] [SPARK-4774] [SQL] Makes HiveFromSpark more portable (cherry picked from commit d6a972b3e4dc35a2d95df47d256462b325f4bda6) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-4620] Add unpersist in Graph and GraphImpl	Takeshi Yamamuro	2014-12-07	2	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add an IF to uncache both vertices and edges of Graph/GraphImpl. This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations. Author: Takeshi Yamamuro <linguin.m.s@gmail.com> This patch had conflicts when merged, resolved by Committer: Ankur Dave <ankurdave@gmail.com> Closes #3476 from maropu/UnpersistInGraphSpike and squashes the following commits: 77a006a [Takeshi Yamamuro] Add unpersist in Graph and GraphImpl (cherry picked from commit 8817fc7fe8785d7b11138ca744f22f7e70f1f0a0) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
*	[SPARK-4646] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark	Takeshi Yamamuro	2014-12-07	2	-5/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch just replaces a native quick sorter with Sorter(TimSort) in Spark. It could get performance gains by ~8% in my quick experiments. Author: Takeshi Yamamuro <linguin.m.s@gmail.com> Closes #3507 from maropu/TimSortInEdgePartitionBuilderSpike and squashes the following commits: 8d4e5d2 [Takeshi Yamamuro] Remove a wildcard import 3527e00 [Takeshi Yamamuro] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark (cherry picked from commit 2e6b736b0e6e5920d0523533c87832a53211db42) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
*	[SPARK-3623][GraphX] GraphX should support the checkpoint operation	GuoQiang Li	2014-12-06	3	-0/+34
\| \| \| \| \| \| \| \| \| \| \| \| \|	Author: GuoQiang Li <witgo@qq.com> Closes #2631 from witgo/SPARK-3623 and squashes the following commits: a70c500 [GuoQiang Li] Remove java related 4d1e249 [GuoQiang Li] Add comments e682724 [GuoQiang Li] Graph should support the checkpoint operation (cherry picked from commit e895e0cbecbbec1b412ff21321e57826d2d0a982) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
*	Streaming doc : do you mean inadvertently?	CrazyJvm	2014-12-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	Author: CrazyJvm <crazyjvm@gmail.com> Closes #3620 from CrazyJvm/streaming-foreachRDD and squashes the following commits: b72886b [CrazyJvm] do you mean inadvertently? (cherry picked from commit 6eb1b6f6204ea3c8083af3fb9cd990d9f3dac89d) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-4761][SQL] Enables Kryo by default in Spark SQL Thrift server	Cheng Lian	2014-12-05	1	-2/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Enables Kryo and disables reference tracking by default in Spark SQL Thrift server. Configurations explicitly defined by users in `spark-defaults.conf` are respected (the Thrift server is started by `spark-submit`, which handles configuration properties properly). <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3621) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3621 from liancheng/kryo-by-default and squashes the following commits: 70c2775 [Cheng Lian] Enables Kryo by default in Spark SQL Thrift server (cherry picked from commit 6f61e1f961826a6c9e98a66d10b271b7e3c7dd55) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-4753][SQL] Use catalyst for partition pruning in newParquet.	Michael Armbrust	2014-12-04	1	-30/+28
\| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #3613 from marmbrus/parquetPartitionPruning and squashes the following commits: 4f138f8 [Michael Armbrust] Use catalyst for partition pruning in newParquet. (cherry picked from commit f5801e813f3c2573ebaf1af839341489ddd3ec78) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	Revert "SPARK-2624 add datanucleus jars to the container in yarn-cluster"	Andrew Or	2014-12-04	3	-157/+0
\| \| \| \|	This reverts commit a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53.
*	Revert "[HOT FIX] [YARN] Check whether `/lib` exists before listing its files"	Andrew Or	2014-12-04	1	-15/+12
\| \| \| \|	This reverts commit 38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce.
*	[SPARK-4464] Description about configuration options need to be modified in ↵	Masayoshi TSUZUKI	2014-12-04	1	-2/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	docs. Added description about -h and -host. Modified description about -i and -ip which are now deprecated. Added description about --properties-file. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3329 from tsudukim/feature/SPARK-4464 and squashes the following commits: 6c07caf [Masayoshi TSUZUKI] [SPARK-4464] Description about configuration options need to be modified in docs. (cherry picked from commit ca379039f701e423fa07933db4e063cb85d0236a) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	Fix typo in Spark SQL docs.	Andy Konwinski	2014-12-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	Author: Andy Konwinski <andykonwinski@gmail.com> Closes #3611 from andyk/patch-3 and squashes the following commits: 7bab333 [Andy Konwinski] Fix typo in Spark SQL docs. (cherry picked from commit 15cf3b0125fe238dea2ce13e703034ba7cef477f) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4421] Wrong link in spark-standalone.html	Masayoshi TSUZUKI	2014-12-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Modified the link of building Spark. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3279 from tsudukim/feature/SPARK-4421 and squashes the following commits: 56e31c1 [Masayoshi TSUZUKI] Modified the link of building Spark. (cherry picked from commit ddfc09c36381a0880dfa6778be2ca0bc7d80febf) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4652][DOCS] Add docs about spark-git-repo option	lewuathe	2014-12-04	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There might be some cases when WIPS spark version need to be run on EC2 cluster. In order to setup this type of cluster more easily, add --spark-git-repo option description to ec2 documentation. Author: lewuathe <lewuathe@me.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #3513 from Lewuathe/doc-for-development-spark-cluster and squashes the following commits: 6dae8ee [lewuathe] Wrap consistent with other descriptions cfaf9be [lewuathe] Add docs about spark-git-repo option (Editing / cleanup by Josh Rosen) (cherry picked from commit ab8177da2defab1ecd8bc0cd5a21f07be5b8d2c5) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4459] Change groupBy type parameter from K to U	Saldanha	2014-12-04	2	-7/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Please see https://issues.apache.org/jira/browse/SPARK-4459 Author: Saldanha <saldaal1@phusca-l24858.wlan.na.novartis.net> Closes #3327 from alokito/master and squashes the following commits: 54b1095 [Saldanha] [SPARK-4459] changed type parameter for keyBy from K to U d5f73c3 [Saldanha] [SPARK-4459] added keyBy test 316ad77 [Saldanha] SPARK-4459 changed type parameter for groupBy from K to U. 62ddd4b [Saldanha] SPARK-4459 added failing unit test (cherry picked from commit 743a889d2778f797aabc3b1e8146e7aa32b62a48) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4745] Fix get_existing_cluster() function with multiple security groups	alexdebrie	2014-12-04	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	The current get_existing_cluster() function would only find an instance belonged to a cluster if the instance's security groups == cluster_name + "-master" (or "-slaves"). This fix allows for multiple security groups by checking if the cluster_name + "-master" security group is in the list of groups for a particular instance. Author: alexdebrie <alexdebrie1@gmail.com> Closes #3596 from alexdebrie/master and squashes the following commits: 9d51232 [alexdebrie] Fix get_existing_cluster() function with multiple security groups (cherry picked from commit 794f3aec24acb578e258532ad0590554d07958ba) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	Preparing development version 1.2.1-SNAPSHOT	Patrick Wendell	2014-12-04	29	-29/+29
\|
*	Preparing Spark release v1.2.0-rc2	Patrick Wendell	2014-12-04	29	-29/+29
\|
*	[HOTFIX] Fixing two issues with the release script.	Patrick Wendell	2014-12-04	1	-11/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. The version replacement was still producing some false changes. 2. Uploads to the staging repo specifically. Author: Patrick Wendell <pwendell@gmail.com> Closes #3608 from pwendell/release-script and squashes the following commits: 3c63294 [Patrick Wendell] Fixing two issues with the release script: (cherry picked from commit 8dae26f83818ee0f5ce8e5b083625170d2e901c5) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-4253] Ignore spark.driver.host in yarn-cluster and standalone-cluster ↵	WangTaoTheTonic	2014-12-04	2	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	modes In yarn-cluster and standalone-cluster modes, we don't know where driver will run until it is launched. If the `spark.driver.host` property is set on the submitting machine and propagated to the driver through SparkConf then this will lead to errors when the driver launches. This patch fixes this issue by dropping the `spark.driver.host` property in SparkSubmit when running in a cluster deploy mode. Author: WangTaoTheTonic <barneystinson@aliyun.com> Author: WangTao <barneystinson@aliyun.com> Closes #3112 from WangTaoTheTonic/SPARK4253 and squashes the following commits: ed1a25c [WangTaoTheTonic] revert unrelated formatting issue 02c4e49 [WangTao] add comment 32a3f3f [WangTaoTheTonic] ingore it in SparkSubmit instead of SparkContext 667cf24 [WangTaoTheTonic] document fix ff8d5f7 [WangTaoTheTonic] also ignore it in standalone cluster mode 2286e6b [WangTao] ignore spark.driver.host in yarn-cluster mode (cherry picked from commit 8106b1e36b2c2b9f5dc5d7252540e48cc3fc96d5) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	Revert "Preparing Spark release v1.2.0-rc1"	Patrick Wendell	2014-12-04	29	-29/+29
\| \| \| \|	This reverts commit 1056e9ec13203d0c51564265e94d77a054498fdb.
*	Revert "Preparing development version 1.2.1-SNAPSHOT"	Patrick Wendell	2014-12-04	29	-30/+30
\| \| \| \|	This reverts commit 00316cc87983b844f6603f351a8f0b84fe1f6035.
*	Revert "HOTFIX: Rolling back incorrect version change"	Patrick Wendell	2014-12-04	1	-1/+1
\| \| \| \|	This reverts commit 3a4609eada2ee0bfbcce0f4127b6a5363ae528e5.
*	[SPARK-4683][SQL] Add a beeline.cmd to run on Windows	Cheng Lian	2014-12-04	1	-0/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Tested locally with a Win7 VM. Connected to a Spark SQL Thrift server instance running on Mac OS X with the following command line: ``` bin\beeline.cmd -u jdbc:hive2://10.0.2.2:10000 -n lian ``` <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3599) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3599 from liancheng/beeline.cmd and squashes the following commits: 79092e7 [Cheng Lian] Windows script for BeeLine (cherry picked from commit 28c7acacef974fdabd2b9ecc20d0d6cf6c58728f) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[FIX][DOC] Fix broken links in ml-guide.md	Xiangrui Meng	2014-12-04	4	-7/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and some minor changes in ScalaDoc. Author: Xiangrui Meng <meng@databricks.com> Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits: c559768 [Xiangrui Meng] minor code update ce94da8 [Xiangrui Meng] Java Bean -> JavaBean 0b5c182 [Xiangrui Meng] fix links in ml-guide (cherry picked from commit 7e758d709286e73d2c878d4a2d2b4606386142c7) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes	Joseph K. Bradley	2014-12-04	17	-24/+1205
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Documentation: * Added ml-guide.md, linked from mllib-guide.md * Updated mllib-guide.md with small section pointing to ml-guide.md Examples: * CrossValidatorExample * SimpleParamsExample * (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md) Bug fixes: * PipelineModel: did not use ParamMaps correctly * UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!) CC: mengxr shivaram etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete. Author: Joseph K. Bradley <joseph@databricks.com> Author: jkbradley <joseph.kurata.bradley@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3588 from jkbradley/ml-package-docs and squashes the following commits: d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit). updated examples for CV and Params for spark.ml c38469c [Joseph K. Bradley] Updated ml-guide with CV examples 99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params. Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold. ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs 3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype 41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works. (cherry picked from commit 469a6e5f3bdd5593b3254bc916be8236e7c6cb74) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[docs] Fix outdated comment in tuning guide	Joseph K. Bradley	2014-12-04	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When you use the SPARK_JAVA_OPTS env variable, Spark complains: ``` SPARK_JAVA_OPTS was detected (set to ' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps '). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with conf/spark-defaults.conf to set defaults for an application - ./spark-submit with --driver-java-options to set -X options for a driver - spark.executor.extraJavaOptions to set -X options for executors - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker) ``` This updates the docs to redirect the user to the relevant part of the configuration docs. CC: mengxr but please CC someone else as needed Author: Joseph K. Bradley <joseph@databricks.com> Closes #3592 from jkbradley/tuning-doc and squashes the following commits: 0760ce1 [Joseph K. Bradley] fixed outdated comment in tuning guide (cherry picked from commit 529439bd506949f272a2b6f099ea549b097428f3) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SQL] Minor: Avoid calling Seq#size in a loop	Aaron Davidson	2014-12-04	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	Just found this instance while doing some jstack-based profiling of a Spark SQL job. It is very unlikely that this is causing much of a perf issue anywhere, but it is unnecessarily suboptimal. Author: Aaron Davidson <aaron@databricks.com> Closes #3593 from aarondav/seq-opt and squashes the following commits: 962cdfc [Aaron Davidson] [SQL] Minor: Avoid calling Seq#size in a loop (cherry picked from commit c6c7165e7ecf1690027d6bd4e0620012cd0d2310) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-4685] Include all spark.ml and spark.mllib packages in JavaDoc's ↵	lewuathe	2014-12-04	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MLlib group This is #3554 from Lewuathe except that I put both `spark.ml` and `spark.mllib` in the group 'MLlib`. Closes #3554 jkbradley Author: lewuathe <lewuathe@me.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3598 from mengxr/Lewuathe-modify-javadoc-setting and squashes the following commits: 184609a [Xiangrui Meng] merge spark.ml and spark.mllib into the same group in javadoc f7535e6 [lewuathe] [SPARK-4685] Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections (cherry picked from commit 20bfea4ab7c0923e8d3f039d0c5098669db4d5b0) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[Release] Correctly translate contributors name in release notes	Andrew Or	2014-12-03	4	-56/+230
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit involves three main changes: (1) It separates the translation of contributor names from the generation of the contributors list. This is largely motivated by the Github API limit; even if we exceed this limit, we should at least be able to proceed manually as before. This is why the translation logic is abstracted into its own script translate-contributors.py. (2) When we look for candidate replacements for invalid author names, we should look for the assignees of the associated JIRAs too. As a result, the intermediate file must keep track of these. (3) This provides an interactive mode with which the user can sit at the terminal and manually pick the candidate replacement that he/she thinks makes the most sense. As before, there is a non-interactive mode that picks the first candidate that the script considers "valid." TODO: We should have a known_contributors file that stores known mappings so we don't have to go through all of this translation every time. This is also valuable because some contributors simply cannot be automatically translated. Conflicts: .gitignore
*	[SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + ↵	Joseph K. Bradley	2014-12-04	19	-182/+1140
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DecisionTree API fix Major changes: * Added programming guide sections for tree ensembles * Added examples for tree ensembles * Updated DecisionTree programming guide with more info on parameters * API change: Standardized the tree parameter for the number of classes (for classification) Minor changes: * Updated decision tree documentation * Updated existing tree and tree ensemble examples * Use train/test split, and compute test error instead of training error. * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix) Note: I know this is a lot of lines, but most is covered by: * Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.) * New examples (which were copied from the programming guide) * The "numClasses" renaming I have run all examples and relevant unit tests. CC: mengxr manishamde codedeft Author: Joseph K. Bradley <joseph@databricks.com> Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #3461 from jkbradley/ensemble-docs and squashes the following commits: 70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide 8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide 6fab846 [Joseph K. Bradley] small fixes based on review b9f8576 [Joseph K. Bradley] updated decision tree doc 375204c [Joseph K. Bradley] fixed python style 2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file. added header. Fixed small bug in same example in the programming guide. 706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small c76c823 [Joseph K. Bradley] added migration guide for mllib abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder 07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification). cdfdfbc [Joseph K. Bradley] added examples for GBT 6372a2b [Joseph K. Bradley] updated decision tree examples to use random split. tested all of them. ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide. still need to update their examples (cherry picked from commit 657a88835d8bf22488b53d50f75281d7dc32442e) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-4711] [mllib] [docs] Programming guide advice on choosing optimizer	Joseph K. Bradley	2014-12-04	2	-9/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #3569 from jkbradley/lr-doc and squashes the following commits: 654aeb5 [Joseph K. Bradley] updated section header for mllib-optimization 5035ad0 [Joseph K. Bradley] updated based on review 94f6dec [Joseph K. Bradley] Updated linear methods and optimization docs with quick advice on choosing an optimization method (cherry picked from commit 27ab0b8a03b711e8d86b6167df833f012205ccc7) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-4085] Propagate FetchFailedException when Spark fails to read local ↵	Reynold Xin	2014-12-03	3	-13/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	shuffle file. cc aarondav kayousterhout pwendell This should go into 1.2? Author: Reynold Xin <rxin@databricks.com> Closes #3579 from rxin/SPARK-4085 and squashes the following commits: 255b4fd [Reynold Xin] Updated test. f9814d9 [Reynold Xin] Code review feedback. 2afaf35 [Reynold Xin] [SPARK-4085] Propagate FetchFailedException when Spark fails to read local shuffle file. (cherry picked from commit 1826372d0a1bc80db9015106dd5d2d155ada33f5) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver ↵	Mark Hamstra	2014-12-03	2	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \|	adds Executor The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master. Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting. Author: Mark Hamstra <markhamstra@gmail.com> Closes #3550 from markhamstra/SPARK-4498 and squashes the following commits: 8f543b1 [Mark Hamstra] Don't transition ExecutorInfo to RUNNING until Executor is added by Driver
*	[SPARK-4552][SQL] Avoid exception when reading empty parquet data through Hive	Michael Armbrust	2014-12-03	3	-45/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a very small fix that catches one specific exception and returns an empty table. #3441 will address this in a more principled way. Author: Michael Armbrust <michael@databricks.com> Closes #3586 from marmbrus/fixEmptyParquet and squashes the following commits: 2781d9f [Michael Armbrust] Handle empty lists for newParquet 04dd376 [Michael Armbrust] Avoid exception when reading empty parquet data through Hive (cherry picked from commit 513ef82e85661552e596d0b483b645ac24e86d4d) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[HOT FIX] [YARN] Check whether `/lib` exists before listing its files	Andrew Or	2014-12-03	1	-12/+15
\| \| \| \| \| \| \| \| \| \| \| \| \|	This is caused by a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53 Author: Andrew Or <andrew@databricks.com> Closes #3589 from andrewor14/yarn-hot-fix and squashes the following commits: a4fad5f [Andrew Or] Check whether lib directory exists before listing its files (cherry picked from commit 90ec643e9af4c8bbb9000edca08c07afb17939c7) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-4642] Add description about spark.yarn.queue to running-on-YARN document.	Masayoshi TSUZUKI	2014-12-03	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added descriptions about these parameters. - spark.yarn.queue Modified description about the defalut value of this parameter. - spark.yarn.submit.file.replication Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3500 from tsudukim/feature/SPARK-4642 and squashes the following commits: ce99655 [Masayoshi TSUZUKI] better gramatically. 21cf624 [Masayoshi TSUZUKI] Removed intentionally undocumented properties. 88cac9b [Masayoshi TSUZUKI] [SPARK-4642] Documents about running-on-YARN needs update (cherry picked from commit 692f49378f7d384d5c9c5ab7451a1c1e66f91c50) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-4715][Core] Make sure tryToAcquire won't return a negative value	zsxwing	2014-12-03	2	-3/+19
\| \| \| \| \| \| \| \| \| \| \| \| \|	ShuffleMemoryManager.tryToAcquire may return a negative value. The unit test demonstrates this bug. It will output `0 did not equal -200 granted is negative`. Author: zsxwing <zsxwing@gmail.com> Closes #3575 from zsxwing/SPARK-4715 and squashes the following commits: a193ae6 [zsxwing] Make sure tryToAcquire won't return a negative value (cherry picked from commit edd3cd477c9d6016bd977c2fa692fdeff5a6e198) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-4701] Typo in sbt/sbt	Masayoshi TSUZUKI	2014-12-03	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Modified typo. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3560 from tsudukim/feature/SPARK-4701 and squashes the following commits: ed2a3f1 [Masayoshi TSUZUKI] Another whitespace position error. 1af3a35 [Masayoshi TSUZUKI] [SPARK-4701] Typo in sbt/sbt (cherry picked from commit 96786e3ee53a13a57463b74bec0e77b172f719a3) Signed-off-by: Andrew Or <andrew@databricks.com>