spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Preparing development version 1.4.0-SNAPSHOT	Patrick Wendell	2015-05-23	30	-30/+30
\|
*	Preparing Spark release v1.4.0-rc2	Patrick Wendell	2015-05-23	30	-30/+30
\|
*	[SPARK-7287] [HOTFIX] Disable o.a.s.deploy.SparkSubmitSuite --packages	Patrick Wendell	2015-05-23	1	-1/+2
\|
*	Preparing development version 1.4.1-SNAPSHOT	Patrick Wendell	2015-05-23	30	-30/+30
\|
*	Preparing Spark release v1.4.0-rc2-test	Patrick Wendell	2015-05-23	30	-30/+30
\|
*	Preparing development version 1.4.1-SNAPSHOT	Patrick Wendell	2015-05-23	30	-30/+30
\|
*	Preparing Spark release 1.4.0-rc2-test	Patrick Wendell	2015-05-23	30	-30/+30
\|
*	[HOTFIX] Copy SparkR lib if it exists in make-distribution	Shivaram Venkataraman	2015-05-23	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is to fix an issue reported in #6373 where the `cp` would fail if `-Psparkr` was not used in the build cc dragos pwendell Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6379 from shivaram/make-distribution-hotfix and squashes the following commits: 08eb7e4 [Shivaram Venkataraman] Copy SparkR lib if it exists in make-distribution (cherry picked from commit b231baa24857ea83c8062dd4e033db4e35bf457d) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
*	[SPARK-7654] [SQL] Move insertInto into reader/writer interface.	Yin Huai	2015-05-23	14	-89/+116
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This one continues the work of https://github.com/apache/spark/pull/6216. Author: Yin Huai <yhuai@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #6366 from yhuai/insert and squashes the following commits: 3d717fb [Yin Huai] Use insertInto to handle the casue when table exists and Append is used for saveAsTable. 56d2540 [Yin Huai] Add PreWriteCheck to HiveContext's analyzer. c636e35 [Yin Huai] Remove unnecessary empty lines. cf83837 [Yin Huai] Move insertInto to write. Also, remove the partition columns from InsertIntoHadoopFsRelation. 0841a54 [Reynold Xin] Removed experimental tag for deprecated methods. 33ed8ef [Reynold Xin] [SPARK-7654][SQL] Move insertInto into reader/writer interface. (cherry picked from commit 2b7e63585d61be2dab78b70af3867cda3983d5b1) Signed-off-by: Yin Huai <yhuai@databricks.com>
*	[SPARK-7840] add insertInto() to Writer	Davies Liu	2015-05-23	2	-8/+16
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add tests later. Author: Davies Liu <davies@databricks.com> Closes #6375 from davies/insertInto and squashes the following commits: 826423e [Davies Liu] add insertInto() to Writer (cherry picked from commit be47af1bdba469f84775c2b5936f8cb956c7c02b) Signed-off-by: Davies Liu <davies@databricks.com>
*	[SPARK-7322, SPARK-7836, SPARK-7822][SQL] DataFrame window function related ↵	Davies Liu	2015-05-23	10	-174/+464
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	updates 1. ntile should take an integer as parameter. 2. Added Python API (based on #6364) 3. Update documentation of various DataFrame Python functions. Author: Davies Liu <davies@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #6374 from rxin/window-final and squashes the following commits: 69004c7 [Reynold Xin] Style fix. 288cea9 [Reynold Xin] Update documentaiton. 7cb8985 [Reynold Xin] Merge pull request #6364 from davies/window 66092b4 [Davies Liu] update docs ed73cb4 [Reynold Xin] [SPARK-7322][SQL] Improve DataFrame window function documentation. ef55132 [Davies Liu] Merge branch 'master' of github.com:apache/spark into window4 8936ade [Davies Liu] fix maxint in python 3 2649358 [Davies Liu] update docs 778e2c0 [Davies Liu] SPARK-7836 and SPARK-7822: Python API of window functions (cherry picked from commit efe3bfdf496aa6206ace2697e31dd4c0c3c824fb) Signed-off-by: Yin Huai <yhuai@databricks.com>
*	[SPARK-7777][Streaming] Handle the case when there is no block in a batch	zsxwing	2015-05-23	2	-18/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the old implementation, if a batch has no block, `areWALRecordHandlesPresent` will be `true` and it will return `WriteAheadLogBackedBlockRDD`. This PR handles this case by returning `WriteAheadLogBackedBlockRDD` or `BlockRDD` according to the configuration. Author: zsxwing <zsxwing@gmail.com> Closes #6372 from zsxwing/SPARK-7777 and squashes the following commits: 788f895 [zsxwing] Handle the case when there is no block in a batch (cherry picked from commit ad0badba1450295982738934da2cc121cde18213) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-6811] Copy SparkR lib in make-distribution.sh	Shivaram Venkataraman	2015-05-23	6	-2/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change also remove native libraries from SparkR to make sure our distribution works across platforms Tested by building on Mac, running on Amazon Linux (CentOS), Windows VM and vice-versa (built on Linux run on Mac) I will also test this with YARN soon and update this PR. Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6373 from shivaram/sparkr-binary and squashes the following commits: ae41b5c [Shivaram Venkataraman] Remove native libraries from SparkR Also include the built SparkR package in make-distribution.sh (cherry picked from commit a40bca0111de45763c3ef4270afb2185c16b8f95) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
*	[SPARK-6806] [SPARKR] [DOCS] Fill in SparkR examples in programming guide	Davies Liu	2015-05-23	14	-323/+706
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	sqlCtx -> sqlContext You can check the docs by: ``` $ cd docs $ SKIP_SCALADOC=1 jekyll serve ``` cc shivaram Author: Davies Liu <davies@databricks.com> Closes #5442 from davies/r_docs and squashes the following commits: 7a12ec6 [Davies Liu] remove rdd in R docs 8496b26 [Davies Liu] remove the docs related to RDD e23b9d6 [Davies Liu] delete R docs for RDD API 222e4ff [Davies Liu] Merge branch 'master' into r_docs 89684ce [Davies Liu] Merge branch 'r_docs' of github.com:davies/spark into r_docs f0a10e1 [Davies Liu] address comments from @shivaram f61de71 [Davies Liu] Update pairRDD.R 3ef7cf3 [Davies Liu] use + instead of function(a,b) a+b 2f10a77 [Davies Liu] address comments from @cafreeman 9c2a062 [Davies Liu] mention R api together with Python API 23f751a [Davies Liu] Fill in SparkR examples in programming guide (cherry picked from commit 7af3818c6b2bf35bfa531ab7cc3a4a714385015e) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
*	[SPARK-7838] [STREAMING] Set scope for kinesis stream	Tathagata Das	2015-05-22	2	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \|	Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6369 from tdas/SPARK-7838 and squashes the following commits: 87d1c7f [Tathagata Das] Addressed comment 37775d8 [Tathagata Das] set scope for kinesis stream (cherry picked from commit baa89838cca96fa091c9e5ce62be01e1a265d820) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[MINOR] Add SparkR to create-release script	Shivaram Venkataraman	2015-05-22	1	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Enables the SparkR profiles for all the binary builds we create cc pwendell Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6371 from shivaram/sparkr-create-release and squashes the following commits: ca5a0b2 [Shivaram Venkataraman] Add -Psparkr to create-release.sh (cherry picked from commit 017b3404a50bd4b04ed73c5a69acb7b19a929822) Signed-off-by: Patrick Wendell <patrick@databricks.com>
*	[SPARK-7830] [DOCS] [MLLIB] Adding logistic regression to the list of ↵	Mike Dusenberry	2015-05-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Multiclass Classification Supported Methods documentation Added logistic regression to the list of Multiclass Classification Supported Methods in the MLlib Classification and Regression documentation, as it was missing. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6357 from dusenberrymw/Add_LR_To_List_Of_Multiclass_Classification_Methods and squashes the following commits: 7918650 [Mike Dusenberry] Updating broken link due to the "Binary Classification" section on the Linear Methods page being renamed to "Classification". 3005dc2 [Mike Dusenberry] Adding logistic regression to the list of Multiclass Classification Supported Methods in the MLlib Classification and Regression documentation, as it was missing. (cherry picked from commit 63a5ce75eac48a297751ac505d70ce4d47daf903) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
*	[SPARK-7224] [SPARK-7306] mock repository generator for --packages tests ↵	Burak Yavuz	2015-05-22	5	-100/+404
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	without nio.Path The previous PR for SPARK-7224 (#5790) broke JDK 6, because it used java.nio.Path, which was in jdk 7, and not in 6. This PR uses Guava's `Files` to handle directory creation, and etc... The description from the previous PR: > This patch contains an `IvyTestUtils` file, which dynamically generates jars and pom files to test the `--packages` feature without having to rely on the internet, and Maven Central. cc pwendell I also rand the flaky test about 20 times locally, it didn't fail a single time, but I think it may fail like once every 100 builds? I still haven't figured the cause yet, but the test before it, `--jars` was also failing after we turned off the `--packages` test in `SparkSubmitSuite`. It may be related to the launch of SparkSubmit. Author: Burak Yavuz <brkyvz@gmail.com> Closes #5892 from brkyvz/maven-utils and squashes the following commits: e9b1903 [Burak Yavuz] fix merge conflict 68214e0 [Burak Yavuz] remove ignore for test(neglect spark dependencies) e632381 [Burak Yavuz] fix ignore 9ef1408 [Burak Yavuz] re-enable --packages test 22eea62 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into maven-utils 05cd0de [Burak Yavuz] added mock repository generator (cherry picked from commit 8014e1f6bb871d9fd4db74106eb4425d0c1e9dd6) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-7788] Made KinesisReceiver.onStart() non-blocking	Tathagata Das	2015-05-22	1	-5/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	KinesisReceiver calls worker.run() which is a blocking call (while loop) as per source code of kinesis-client library - https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java. This results in infinite loop while calling sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) perhaps because ReceiverTracker is never able to register the receiver (it's receiverInfo field is a empty map) causing it to be stuck in infinite loop while waiting for running flag to be set to false. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6348 from tdas/SPARK-7788 and squashes the following commits: 2584683 [Tathagata Das] Added receiver id in thread name 6cf1cd4 [Tathagata Das] Made KinesisReceiver.onStart non-blocking (cherry picked from commit 1c388a9985999e043fa002618a357bc8f0a8b65a) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-7771] [SPARK-7779] Dynamic allocation: lower default timeouts further	Andrew Or	2015-05-22	2	-10/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The default add time of 5s is still too slow for small jobs. Also, the current default remove time of 10 minutes seem rather high. This patch lowers both and rephrases a few log messages. Author: Andrew Or <andrew@databricks.com> Closes #6301 from andrewor14/da-minor and squashes the following commits: 6d614a6 [Andrew Or] Lower log level 2811492 [Andrew Or] Log information when requests are canceled 5fcd3eb [Andrew Or] Fix tests 3320710 [Andrew Or] Lower timeouts + rephrase a few log messages (cherry picked from commit 3d8760d76eae41dcaab8e9aeda19619f3d5f1596) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-7834] [SQL] Better window error messages	Michael Armbrust	2015-05-22	2	-0/+18
\| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #6363 from marmbrus/windowErrors and squashes the following commits: 516b02d [Michael Armbrust] [SPARK-7834] [SQL] Better window error messages (cherry picked from commit 3c1305107a2d6d2de862e8b41dbad0e85585b1ef) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-7760] add /json back into master & worker pages; add test	Imran Rashid	2015-05-22	3	-3/+37
\| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Imran Rashid <irashid@cloudera.com> Closes #6284 from squito/SPARK-7760 and squashes the following commits: 5e02d8a [Imran Rashid] style; increase timeout 9987399 [Imran Rashid] comment 8c7ed63 [Imran Rashid] add /json back into master & worker pages; add test (cherry picked from commit 821254fb945c3e19540eb57fff1f656737ef484b) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-7270] [SQL] Consider dynamic partition when inserting into hive table	Liang-Chi Hsieh	2015-05-22	2	-5/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-7270 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5864 from viirya/dyn_partition_insert and squashes the following commits: b5627df [Liang-Chi Hsieh] For comments. 3b21e4b [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into dyn_partition_insert 8a4352d [Liang-Chi Hsieh] Consider dynamic partition when inserting into hive table. (cherry picked from commit 126d7235de649ea5619dee6ad3a70970ee90df93) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL.	Santiago M. Mola	2015-05-22	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \|	Author: Santiago M. Mola <santi@mola.io> Closes #6327 from smola/feature/catalyst-dsl-set-ops and squashes the following commits: 11db778 [Santiago M. Mola] [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL. (cherry picked from commit e4aef91fe70d6c9765d530b913a9d79103fc27ce) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-7758] [SQL] Override more configs to avoid failure when connect to a ↵	WangTaoTheTonic	2015-05-22	2	-4/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	postgre sql https://issues.apache.org/jira/browse/SPARK-7758 When initializing `executionHive`, we only masks `javax.jdo.option.ConnectionURL` to override metastore location. However, other properties that relates to the actual Hive metastore data source are not masked. For example, when using Spark SQL with a PostgreSQL backed Hive metastore, `executionHive` actually tries to use settings read from `hive-site.xml`, which talks about PostgreSQL, to connect to the temporary Derby metastore, thus causes error. To fix this, we need to mask all metastore data source properties. Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()` method] [1], all properties whose name mentions "jdo" and "datanucleus" must be included. [1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288 Have tested using postgre sql as metastore, it worked fine. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits: ca7ae7c [WangTaoTheTonic] add comments 86caf2c [WangTaoTheTonic] delete unused import e4f0feb [WangTaoTheTonic] block more data source related property 92a81fa [WangTaoTheTonic] fix style check e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql (cherry picked from commit 31d5d463e76b6611c854c6cf27059fec8198adc9) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-7766] KryoSerializerInstance reuse is unsafe when auto-reset is disabled	Josh Rosen	2015-05-22	2	-0/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <joshrosen@databricks.com> Closes #6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug (cherry picked from commit eac00691da93a94e6cff5ae0f8952e5724e78094) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-7574] [ML] [DOC] User guide for OneVsRest	Ram Sriharsha	2015-05-22	3	-1/+281
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Including Iris Dataset (after shuffling and relabeling 3 -> 0 to confirm to 0 -> numClasses-1 labeling). Could not find an existing dataset in data/mllib for multiclass classification. Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #6296 from harsha2010/SPARK-7574 and squashes the following commits: 645427c [Ram Sriharsha] cleanup 46c41b1 [Ram Sriharsha] cleanup 2f76295 [Ram Sriharsha] Code Review Fixes ebdf103 [Ram Sriharsha] Java Example c026613 [Ram Sriharsha] Code Review fixes 4b7d1a6 [Ram Sriharsha] minor cleanup 13bed9c [Ram Sriharsha] add wikipedia link bb9dbfa [Ram Sriharsha] Clean up naming 6f90db1 [Ram Sriharsha] [SPARK-7574][ml][doc] User guide for OneVsRest (cherry picked from commit 509d55ab416359fab0525189458e2ea96379cf14) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
*	Revert "[BUILD] Always run SQL tests in master build."	Patrick Wendell	2015-05-22	2	-26/+17
\| \| \| \|	This reverts commit 2be72c99aa51ba46f851348af9fbb4a39923a45a.
*	[SPARK-7404] [ML] Add RegressionEvaluator to spark.ml	Ram Sriharsha	2015-05-22	2	-0/+155
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #6344 from harsha2010/SPARK-7404 and squashes the following commits: 16b9d77 [Ram Sriharsha] consistent naming 7f100b6 [Ram Sriharsha] cleanup c46044d [Ram Sriharsha] Merge with Master + Code Review Fixes 188fa0a [Ram Sriharsha] Merge branch 'master' into SPARK-7404 f5b6a4c [Ram Sriharsha] cleanup doc 97beca5 [Ram Sriharsha] update test to use R packages 32dd310 [Ram Sriharsha] fix indentation f93b812 [Ram Sriharsha] fix test 1b6ebb3 [Ram Sriharsha] [SPARK-7404][ml] Add RegressionEvaluator to spark.ml (cherry picked from commit f490b3b4c706c92aa65d000b9d885f4d160a5f39) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-6743] [SQL] Fix empty projections of cached data	Michael Armbrust	2015-05-22	4	-3/+20
\| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #6165 from marmbrus/wrongColumn and squashes the following commits: 4fad158 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into wrongColumn aad7eab [Michael Armbrust] rxins comments f1e8df1 [Michael Armbrust] [SPARK-6743][SQL] Fix empty projections of cached data (cherry picked from commit 3b68cb0430067059e9c7b9a86dbea4865e29bf78) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[MINOR] [SQL] Ignores Thrift server UISeleniumSuite	Cheng Lian	2015-05-22	1	-11/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	This Selenium test case has been flaky for a while and led to frequent Jenkins build failure. Let's disable it temporarily until we figure out a proper solution. Author: Cheng Lian <lian@databricks.com> Closes #6345 from liancheng/ignore-selenium-test and squashes the following commits: 09996fe [Cheng Lian] Ignores Thrift server UISeleniumSuite (cherry picked from commit 4e5220c3171b6a2f4970409bd16be2db930df65d) Signed-off-by: Cheng Lian <lian@databricks.com>
*	[SPARK-7322][SQL] Window functions in DataFrame	Cheng Hao	2015-05-22	13	-7/+807
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This closes #6104. Author: Cheng Hao <hao.cheng@intel.com> Author: Reynold Xin <rxin@databricks.com> Closes #6343 from rxin/window-df and squashes the following commits: 026d587 [Reynold Xin] Address code review feedback. dc448fe [Reynold Xin] Fixed Hive tests. 9794d9d [Reynold Xin] Moved Java test package. 9331605 [Reynold Xin] Refactored API. 3313e2a [Reynold Xin] Merge pull request #6104 from chenghao-intel/df_window d625a64 [Cheng Hao] Update the dataframe window API as suggsted c141fb1 [Cheng Hao] hide all of properties of the WindowFunctionDefinition 3b1865f [Cheng Hao] scaladoc typos f3fd2d0 [Cheng Hao] polish the unit test 6847825 [Cheng Hao] Add additional analystcs functions 57e3bc0 [Cheng Hao] typos 24a08ec [Cheng Hao] scaladoc 28222ed [Cheng Hao] fix bug of range/row Frame 1d91865 [Cheng Hao] style issue 53f89f2 [Cheng Hao] remove the over from the functions.scala 964c013 [Cheng Hao] add more unit tests and window functions 64e18a7 [Cheng Hao] Add Window Function support for DataFrame (cherry picked from commit f6f2eeb17910b5d446dfd61839e37dd698d0860f) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-7578] [ML] [DOC] User guide for spark.ml Normalizer, IDF, StandardScaler	Joseph K. Bradley	2015-05-21	4	-32/+351
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added user guide sections with code examples. Also added small Java unit tests to test Java example in guide. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #6127 from jkbradley/feature-guide-2 and squashes the following commits: cd47f4b [Joseph K. Bradley] Updated based on code review f16bcec [Joseph K. Bradley] Fixed merge issues and update Python examples print calls for Python 3 0a862f9 [Joseph K. Bradley] Added Normalizer, StandardScaler to ml-features doc, plus small Java unit tests a21c2d6 [Joseph K. Bradley] Updated ml-features.md with IDF (cherry picked from commit 2728c3df6690c2fcd4af3bd1c604c98ef6d509a5) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4	Xiangrui Meng	2015-05-21	16	-80/+84
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some changes to the pipeilne APIs: 1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does. 1. Move Evaluator to ml.evaluation. 1. Mention larger metric values are better. 1. PipelineModel doc. “compiled” -> “fitted” 1. Hide object PolynomialExpansion. 1. Hide object VectorAssembler. 1. Word2Vec.minCount (and other) -> group param 1. ParamValidators -> DeveloperApi 1. Hide MetadataUtils/SchemaUtils. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6322 from mengxr/SPARK-7535.0 and squashes the following commits: 9e9c7da [Xiangrui Meng] move JavaEvaluator to ml.evaluation as well e179480 [Xiangrui Meng] move Evaluation to ml.evaluation in PySpark 08ef61f [Xiangrui Meng] update pipieline APIs (cherry picked from commit 8f11c6116bf8c7246682cbb2d6f27bf0f1531c6d) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[DOCS] [MLLIB] Fixing broken link in MLlib Linear Methods documentation.	Mike Dusenberry	2015-05-21	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Just a small change: fixed a broken link in the MLlib Linear Methods documentation by removing a newline character between the link title and link address. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6340 from dusenberrymw/Fix_MLlib_Linear_Methods_link and squashes the following commits: 0a57818 [Mike Dusenberry] Fixing broken link in MLlib Linear Methods documentation. (cherry picked from commit e4136ea6c457bc74cee312aa14974498ab4633eb) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-7219] [MLLIB] Output feature attributes in HashingTF	Xiangrui Meng	2015-05-21	3	-8/+101
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR updates `HashingTF` to output ML attributes that tell the number of features in the output column. We need to expand `UnaryTransformer` to support output metadata. A `df outputMetadata: Metadata` is not sufficient because the metadata may also depends on the input data. Though this is not true for `HashingTF`, I think it is reasonable to update `UnaryTransformer` in a separate PR. `checkParams` is added to verify common requirements for params. I will send a separate PR to use it in other test suites. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6308 from mengxr/SPARK-7219 and squashes the following commits: 9bd2922 [Xiangrui Meng] address comments e82a68a [Xiangrui Meng] remove sqlContext from test suite 995535b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7219 2194703 [Xiangrui Meng] add test for attributes 178ae23 [Xiangrui Meng] update HashingTF with tests 91a6106 [Xiangrui Meng] WIP (cherry picked from commit 85b96372cf0fd055f89fc639f45c1f2cb02a378f) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
*	[SPARK-7794] [MLLIB] update RegexTokenizer default settings	Xiangrui Meng	2015-05-21	3	-46/+44
\| \| \| \| \| \| \| \| \| \| \| \| \|	The previous default is `{gaps: false, pattern: "\\p{L}+\|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6330 from mengxr/SPARK-7794 and squashes the following commits: 5ee7cde [Xiangrui Meng] update RegexTokenizer default settings (cherry picked from commit f5db4b416c922db7a8f1b0c098b4f08647106231) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-7776] [STREAMING] Added shutdown hook to StreamingContext	Tathagata Das	2015-05-21	1	-1/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Shutdown hook to stop SparkContext was added recently. This results in ugly errors when a streaming application is terminated by ctrl-C. ``` Exception in thread "Thread-27" org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:736) at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:735) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:735) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1468) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1403) at org.apache.spark.SparkContext.stop(SparkContext.scala:1642) at org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:559) at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2266) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2236) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1764) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2236) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2236) at org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2218) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) ``` This is because the Spark's shutdown hook stops the context, and the streaming jobs fail in the middle. The correct solution is to stop the streaming context before the spark context. This PR adds the shutdown hook to do so with a priority higher than the SparkContext's shutdown hooks priority. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6307 from tdas/SPARK-7776 and squashes the following commits: e3d5475 [Tathagata Das] Added conf to specify graceful shutdown 4c18652 [Tathagata Das] Added shutdown hook to StreamingContxt. (cherry picked from commit d68ea24d60ce1aa55b06a8c107f42544d696eb41) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-7783] [SQL] [PySpark] add DataFrame.rollup/cube in Python	Davies Liu	2015-05-21	1	-2/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Davies Liu <davies@databricks.com> Closes #6311 from davies/rollup and squashes the following commits: 0261db1 [Davies Liu] use @since a51ca6b [Davies Liu] Merge branch 'master' of github.com:apache/spark into rollup 8ad5af4 [Davies Liu] Update dataframe.py ade3841 [Davies Liu] add DataFrame.rollup/cube in Python (cherry picked from commit 17791a58159b3e4619d0367f54a4c5332342658b) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-7737] [SQL] Use leaf dirs having data files to discover partitions.	Yin Huai	2015-05-22	2	-6/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-7737 cc liancheng Author: Yin Huai <yhuai@databricks.com> Closes #6329 from yhuai/spark-7737 and squashes the following commits: 7e0dfc7 [Yin Huai] Use leaf dirs having data files to discover partitions. (cherry picked from commit 347b50106bd1bcd40049f1ca29cefbb0baf53413) Signed-off-by: Cheng Lian <lian@databricks.com>
*	[BUILD] Always run SQL tests in master build.	Yin Huai	2015-05-21	2	-17/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Seems our master build does not run HiveCompatibilitySuite (because _RUN_SQL_TESTS is not set). This PR introduces a property `AMP_JENKINS_PRB` to differentiate a PR build and a regular build. If a build is a regular one, we always set _RUN_SQL_TESTS to true. cc JoshRosen nchammas Author: Yin Huai <yhuai@databricks.com> Closes #5955 from yhuai/runSQLTests and squashes the following commits: 3d399bc [Yin Huai] Always run SQL tests in master build. (cherry picked from commit 147b6be3b6d464dfc14836c08e690ab021a600de) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-7800] isDefined should not marked too early in putNewKey	Liang-Chi Hsieh	2015-05-21	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-7800 `isDefined` is marked as true twice in `Location.putNewKey`. The first one is unnecessary and will cause problem because it is too early and before some assert checking. E.g., if an attempt with incorrect `keyLengthBytes` marks `isDefined` as true, the location can not be used later. ping JoshRosen Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6324 from viirya/dup_isdefined and squashes the following commits: cbfe03b [Liang-Chi Hsieh] isDefined should not marked too early in putNewKey. (cherry picked from commit 5a3c04bb92e21bd221a75c4ae13a71f7d4716b44) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-7718] [SQL] Speed up partitioning by avoiding closure cleaning	Andrew Or	2015-05-21	4	-55/+83
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning. Author: Andrew Or <andrew@databricks.com> Closes #6256 from andrewor14/sql-partition-speed-up and squashes the following commits: a82b451 [Andrew Or] Fix style 10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures 17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up 523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too f7fe143 [Andrew Or] Avoid unnecessary closure cleaning (cherry picked from commit 5287eec5a6948c0c6e0baaebf35f512324c0679a) Signed-off-by: Yin Huai <yhuai@databricks.com>
*	[SPARK-7711] Add a startTime property to match the corresponding one in Scala	Holden Karau	2015-05-21	2	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Holden Karau <holden@pigscanfly.ca> Closes #6275 from holdenk/SPARK-771-startTime-is-missing-from-pyspark and squashes the following commits: 06662dc [Holden Karau] add mising blank line for style checks 7a87410 [Holden Karau] add back missing newline 7a7876b [Holden Karau] Add a startTime property to match the corresponding one in the Scala SparkContext (cherry picked from commit 6b18cdc1b1284b1d48d637d06a1e64829aeb6202) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-7478] [SQL] Added SQLContext.getOrCreate	Tathagata Das	2015-05-21	2	-1/+95
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf. rxin marmbrus Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6006 from tdas/SPARK-7478 and squashes the following commits: 25f4da9 [Tathagata Das] Addressed comments. 79fe069 [Tathagata Das] Added comments. c66ca76 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478 48adb14 [Tathagata Das] Removed HiveContext.getOrCreate bf8cf50 [Tathagata Das] Fix more bug dec5594 [Tathagata Das] Fixed bug b4e9721 [Tathagata Das] Remove unnecessary import 4ef513b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478 d3ea8e4 [Tathagata Das] Added HiveContext 83bc950 [Tathagata Das] Updated tests f82ae81 [Tathagata Das] Fixed test bc72868 [Tathagata Das] Added SQLContext.getOrCreate (cherry picked from commit 3d0cccc85850ca9c79f3e5ff7395bd04d212b063) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-7763] [SPARK-7616] [SQL] Persists partition columns into metastore	Yin Huai	2015-05-21	12	-56/+211
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Yin Huai <yhuai@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #6285 from liancheng/spark-7763 and squashes the following commits: bb2829d [Yin Huai] Fix hashCode. d677f7d [Cheng Lian] Fixes Scala style issue 44b283f [Cheng Lian] Adds test case for SPARK-7616 6733276 [Yin Huai] Fix a bug that potentially causes https://issues.apache.org/jira/browse/SPARK-7616. 6cabf3c [Yin Huai] Update unit test. 7e02910 [Yin Huai] Use metastore partition columns and do not hijack maybePartitionSpec. e9a03ec [Cheng Lian] Persists partition columns into metastore (cherry picked from commit 30f3f556f7161a49baf145c0cbba8c088b512a6a) Signed-off-by: Yin Huai <yhuai@databricks.com>
*	[SPARK-7722] [STREAMING] Added Kinesis to style checker	Tathagata Das	2015-05-21	2	-5/+5
\| \| \| \| \| \| \| \| \| \| \|	Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6325 from tdas/SPARK-7722 and squashes the following commits: 9ab35b2 [Tathagata Das] Fixed styles in Kinesis (cherry picked from commit 311fab6f1b00db1a581d77be5196dd045f93d83d) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-7498] [MLLIB] add varargs back to setDefault	Xiangrui Meng	2015-05-21	2	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	We removed `varargs` due to Java compilation issues. That was a false alarm because I didn't run `build/sbt clean`. So this PR reverts the changes. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6320 from mengxr/SPARK-7498 and squashes the following commits: 74a7259 [Xiangrui Meng] add varargs back to setDefault (cherry picked from commit cdc7c055c931c4c931a11b510de473455f3256da) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-7585] [ML] [DOC] VectorIndexer user guide section	Joseph K. Bradley	2015-05-21	3	-1/+96
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added VectorIndexer section to ML user guide. Also added javaCategoryMaps() method and Java unit test for it. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #6255 from jkbradley/vector-indexer-guide and squashes the following commits: dbb8c4c [Joseph K. Bradley] simplified VectorIndexerModel.javaCategoryMaps f692084 [Joseph K. Bradley] Added VectorIndexer section to ML user guide. Also added javaCategoryMaps() method and Java unit test for it. (cherry picked from commit 6d75ed7e5ccf6c58143de4608115f9a2b3ff6cf4) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SQL] [TEST] udf_java_method failed due to jdk version	scwf	2015-05-21	3	-7/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	java.lang.Math.exp(1.0) has different result between jdk versions. so do not use createQueryTest, write a separate test for it. ``` jdk version result 1.7.0_11 2.7182818284590455 1.7.0_05 2.7182818284590455 1.7.0_71 2.718281828459045 ``` Author: scwf <wangfei1@huawei.com> Closes #6274 from scwf/java_method and squashes the following commits: 3dd2516 [scwf] address comments 5fa1459 [scwf] style df46445 [scwf] fix test error fcb6d22 [scwf] fix udf_java_method (cherry picked from commit f6c486aa4b0d3a50b53c110fd63d226fffeb87f7) Signed-off-by: Michael Armbrust <michael@databricks.com>