spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[SPARK-16154][MLLIB] Update spark.ml and spark.mllib package docs	Xiangrui Meng	2016-06-23	5	-9/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Since we decided to switch spark.mllib package into maintenance mode in 2.0, it would be nice to update the package docs to reflect this change. ## How was this patch tested? Manually checked generated APIs. Author: Xiangrui Meng <meng@databricks.com> Closes #13859 from mengxr/SPARK-16154.
*	[SPARK-16138] Try to cancel executor requests only if we have at least 1	Peter Ableda	2016-06-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Adding additional check to if statement ## How was this patch tested? I built and deployed to internal cluster to observe behaviour. After the change the invalid logging is gone: ``` 16/06/22 08:46:36 INFO yarn.YarnAllocator: Driver requested a total number of 1 executor(s). 16/06/22 08:46:36 INFO yarn.YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 1 executors. 16/06/22 08:46:36 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s). 16/06/22 08:47:36 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 1. ``` Author: Peter Ableda <abledapeter@gmail.com> Closes #13850 from peterableda/patch-2.
*	[SPARK-15660][CORE] Update RDD `variance/stdev` description and add ↵	Dongjoon Hyun	2016-06-23	5	-8/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	popVariance/popStdev ## What changes were proposed in this pull request? In Spark-11490, `variance/stdev` are redefined as the sample `variance/stdev` instead of population ones. This PR updates the other old documentations to prevent users from misunderstanding. This will update the following Scala/Java API docs. - http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.api.java.JavaDoubleRDD - http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.rdd.DoubleRDDFunctions - http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.util.StatCounter - http://spark.apache.org/docs/2.0.0-preview/api/java/org/apache/spark/api/java/JavaDoubleRDD.html - http://spark.apache.org/docs/2.0.0-preview/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html - http://spark.apache.org/docs/2.0.0-preview/api/java/org/apache/spark/util/StatCounter.html Also, this PR adds them `popVariance` and `popStdev` functions clearly. ## How was this patch tested? Pass the updated Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13403 from dongjoon-hyun/SPARK-15660.
*	[SPARK-16162] Remove dead code OrcTableScan.	Brian Cho	2016-06-22	1	-66/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? SPARK-14535 removed all calls to class OrcTableScan. This removes the dead code. ## How was this patch tested? Existing unit tests. Author: Brian Cho <bcho@fb.com> Closes #13869 from dafrista/clean-up-orctablescan.
*	[SQL][MINOR] Fix minor formatting issues in SHOW CREATE TABLE output	Cheng Lian	2016-06-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR fixes two minor formatting issues appearing in `SHOW CREATE TABLE` output. Before: ``` CREATE EXTERNAL TABLE ... ... WITH SERDEPROPERTIES ('serialization.format' = '1' ) ... TBLPROPERTIES ('avro.schema.url' = '/tmp/avro/test.avsc', 'transient_lastDdlTime' = '1466638180') ``` After: ``` CREATE EXTERNAL TABLE ... ... WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) ... TBLPROPERTIES ( 'avro.schema.url' = '/tmp/avro/test.avsc', 'transient_lastDdlTime' = '1466638180' ) ``` ## How was this patch tested? Manually tested. Author: Cheng Lian <lian@databricks.com> Closes #13864 from liancheng/show-create-table-format-fix.
*	[SPARK-15230][SQL] distinct() does not handle column name with dot properly	bomeng	2016-06-23	2	-1/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? When table is created with column name containing dot, distinct() will fail to run. For example, ```scala val rowRDD = sparkContext.parallelize(Seq(Row(1), Row(1), Row(2))) val schema = StructType(Array(StructField("column.with.dot", IntegerType, nullable = false))) val df = spark.createDataFrame(rowRDD, schema) ``` running the following will have no problem: ```scala df.select(new Column("`column.with.dot`")) ``` but running the query with additional distinct() will cause exception: ```scala df.select(new Column("`column.with.dot`")).distinct() ``` The issue is that distinct() will try to resolve the column name, but the column name in the schema does not have backtick with it. So the solution is to add the backtick before passing the column name to resolve(). ## How was this patch tested? Added a new test case. Author: bomeng <bmeng@us.ibm.com> Closes #13140 from bomeng/SPARK-15230.
*	[SPARK-16159][SQL] Move RDD creation logic from FileSourceStrategy.apply	Reynold Xin	2016-06-22	2	-112/+154
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We embed partitioning logic in FileSourceStrategy.apply, making the function very long. This is a small refactoring to move it into its own functions. Eventually we would be able to move the partitioning functions into a physical operator, rather than doing it in physical planning. ## How was this patch tested? This is a simple code move. Author: Reynold Xin <rxin@databricks.com> Closes #13862 from rxin/SPARK-16159.
*	[SPARK-16024][SQL][TEST] Verify Column Comment for Data Source Tables	gatorsmile	2016-06-23	3	-3/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? This PR is to improve test coverage. It verifies whether `Comment` of `Column` can be appropriate handled. The test cases verify the related parts in Parser, both SQL and DataFrameWriter interface, and both Hive Metastore catalog and In-memory catalog. #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #13764 from gatorsmile/dataSourceComment.
*	[SPARK-15956][SQL] When unwrapping ORC avoid pattern matching at runtime	Brian Cho	2016-06-22	5	-150/+314
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Extend the returning of unwrapper functions from primitive types to all types. This PR is based on https://github.com/apache/spark/pull/13676. It only fixes a bug with scala-2.10 compilation. All credit should go to dafrista. ## How was this patch tested? The patch should pass all unit tests. Reading ORC files with non-primitive types with this change reduced the read time by ~15%. Author: Brian Cho <bcho@fb.com> Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13854 from hvanhovell/SPARK-15956-scala210.
*	[SPARK-16131] initialize internal logger lazily in Scala preferred way	Prajwal Tuladhar	2016-06-22	2	-12/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Initialize logger instance lazily in Scala preferred way ## How was this patch tested? By running `./build/mvn clean test` locally Author: Prajwal Tuladhar <praj@infynyxx.com> Closes #13842 from infynyxx/spark_internal_logger.
*	[SPARK-16155][DOC] remove package grouping in Java docs	Xiangrui Meng	2016-06-22	1	-20/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In 1.4 and earlier releases, we have package grouping in the generated Java API docs. See http://spark.apache.org/docs/1.4.0/api/java/index.html. However, this disappeared in 1.5.0: http://spark.apache.org/docs/1.5.0/api/java/index.html. Rather than fixing it, I'd suggest removing grouping. Because it might take some time to fix and it is a manual process to update the grouping in `SparkBuild.scala`. I didn't find anyone complaining about missing groups since 1.5.0 on Google. Manually checked the generated Java API docs and confirmed that they are the same as in master. Author: Xiangrui Meng <meng@databricks.com> Closes #13856 from mengxr/SPARK-16155.
*	[SPARK-16153][MLLIB] switch to multi-line doc to avoid a genjavadoc bug	Xiangrui Meng	2016-06-22	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We recently deprecated setLabelCol in ChiSqSelectorModel (#13823): ~~~scala /** group setParam / Since("1.6.0") deprecated("labelCol is not used by ChiSqSelectorModel.", "2.0.0") def setLabelCol(value: String): this.type = set(labelCol, value) ~~~ This unfortunately hit a genjavadoc bug and broken doc generation. This is the generated Java code: ~~~java /* group setParam / public org.apache.spark.ml.feature.ChiSqSelectorModel setOutputCol (java.lang.String value) { throw new RuntimeException(); } * deprecated labelCol is not used by ChiSqSelectorModel. Since 2.0.0. */ public org.apache.spark.ml.feature.ChiSqSelectorModel setLabelCol (java.lang.String value) { throw new RuntimeException(); } ~~~ Switching to multiline is a workaround. Author: Xiangrui Meng <meng@databricks.com> Closes #13855 from mengxr/SPARK-16153.
*	[SPARK-16078][SQL] from_utc_timestamp/to_utc_timestamp should not depends on ↵	Davies Liu	2016-06-22	3	-36/+73
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	local timezone ## What changes were proposed in this pull request? Currently, we use local timezone to parse or format a timestamp (TimestampType), then use Long as the microseconds since epoch UTC. In from_utc_timestamp() and to_utc_timestamp(), we did not consider the local timezone, they could return different results with different local timezone. This PR will do the conversion based on human time (in local timezone), it should return same result in whatever timezone. But because the mapping from absolute timestamp to human time is not exactly one-to-one mapping, it will still return wrong result in some timezone (also in the begging or ending of DST). This PR is kind of the best effort fix. In long term, we should make the TimestampType be timezone aware to fix this totally. ## How was this patch tested? Tested these function in all timezone. Author: Davies Liu <davies@databricks.com> Closes #13784 from davies/convert_tz.
*	[SPARK-15672][R][DOC] R programming guide update	Kai Jiang	2016-06-22	2	-1/+78
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Guide for - UDFs with dapply, dapplyCollect - spark.lapply for running parallel R functions ## How was this patch tested? build locally <img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png"> Author: Kai Jiang <jiangkai@gmail.com> Closes #13660 from vectorijk/spark-15672-R-guide-update.
*	[SPARK-16003] SerializationDebugger runs into infinite loop	Eric Liang	2016-06-22	2	-6/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This fixes SerializationDebugger to not recurse forever when `writeReplace` returns an object of the same class, which is the case for at least the `SQLMetrics` class. See also the OpenJDK unit tests on the behavior of recursive `writeReplace()`: https://github.com/openjdk-mirror/jdk7u-jdk/blob/f4d80957e89a19a29bb9f9807d2a28351ed7f7df/test/java/io/Serializable/nestedReplace/NestedReplace.java cc davies cloud-fan ## How was this patch tested? Unit tests for SerializationDebugger. Author: Eric Liang <ekl@databricks.com> Closes #13814 from ericl/spark-16003.
*	[SPARK-15956][SQL] Revert "[] When unwrapping ORC avoid pattern matching…	Herman van Hovell	2016-06-22	5	-314/+150
\| \| \| \| \| \| \| \|	This reverts commit 0a9c02759515c41de37db6381750bc3a316c860c. It breaks the 2.10 build, I'll fix this in a different PR. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13853 from hvanhovell/SPARK-15956-revert.
*	[SPARK-16120][STREAMING] getCurrentLogFiles in ReceiverSuite WAL generating ↵	Ahmed Mahran	2016-06-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and cleaning case uses external variable instead of the passed parameter ## What changes were proposed in this pull request? In `ReceiverSuite.scala`, in the test case "write ahead log - generating and cleaning", the inner method `getCurrentLogFiles` uses external variable `logDirectory1` instead of the passed parameter `logDirectory`. This PR fixes this by using the passed method argument instead of variable from the outer scope. ## How was this patch tested? The unit test was re-run and the output logs were checked for the correct paths used. tdas Author: Ahmed Mahran <ahmed.mahran@mashin.io> Closes #13825 from ahmed-mahran/b-receiver-suite-wal-gen-cln.
*	[SPARK-15956][SQL] When unwrapping ORC avoid pattern matching at runtime	Brian Cho	2016-06-22	5	-150/+314
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Extend the returning of unwrapper functions from primitive types to all types. ## How was this patch tested? The patch should pass all unit tests. Reading ORC files with non-primitive types with this change reduced the read time by ~15%. === The github diff is very noisy. Attaching the screenshots below for improved readability: ![screen shot 2016-06-14 at 5 33 16 pm](https://cloud.githubusercontent.com/assets/1514239/16064580/4d6f7a98-3257-11e6-9172-65e4baff948b.png) ![screen shot 2016-06-14 at 5 33 28 pm](https://cloud.githubusercontent.com/assets/1514239/16064587/5ae6c244-3257-11e6-8460-69eee70de219.png) Author: Brian Cho <bcho@fb.com> Closes #13676 from dafrista/improve-orc-master.
*	[MINOR][MLLIB] DefaultParamsReadable/Writable should be DeveloperApi	Xiangrui Meng	2016-06-22	1	-8/+5
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `DefaultParamsReadable/Writable` are not user-facing. Only developers who implement `Transformer/Estimator` would use it. So this PR changes the annotation to `DeveloperApi`. Author: Xiangrui Meng <meng@databricks.com> Closes #13828 from mengxr/default-readable-should-be-developer-api.
*	[SPARK-16127][ML][PYPSARK] Audit @Since annotations related to ml.linalg	Nick Pentreath	2016-06-22	14	-37/+41
\| \| \| \| \| \| \| \| \| \| \| \|	[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13840 from MLnick/SPARK-16127-ml-linalg-since.
*	[SPARK-16107][R] group glm methods in documentation	Junyang Qian	2016-06-22	1	-44/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This groups GLM methods (spark.glm, summary, print, predict and write.ml) in the documentation. The example code was updated. ## How was this patch tested? N/A (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) ![screen shot 2016-06-21 at 2 31 37 pm](https://cloud.githubusercontent.com/assets/15318264/16247077/f6eafc04-37bc-11e6-89a8-7898ff3e4078.png) ![screen shot 2016-06-21 at 2 31 45 pm](https://cloud.githubusercontent.com/assets/15318264/16247078/f6eb1c16-37bc-11e6-940a-2b595b10617c.png) Author: Junyang Qian <junyangq@databricks.com> Author: Junyang Qian <junyangq@Junyangs-MacBook-Pro.local> Closes #13820 from junyangq/SPARK-16107.
*	[SPARK-15783][CORE] Fix Flakiness in BlacklistIntegrationSuite	Imran Rashid	2016-06-22	4	-26/+76
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Three changes here -- first two were causing failures w/ BlacklistIntegrationSuite 1. The testing framework didn't include the reviveOffers thread, so the test which involved delay scheduling might never submit offers late enough for the delay scheduling to kick in. So added in the periodic revive offers, just like the real scheduler. 2. `assertEmptyDataStructures` would occasionally fail, because it appeared there was still an active job. This is because in DAGScheduler, the jobWaiter is notified of the job completion before the data structures are cleaned up. Most of the time the test code that is waiting on the jobWaiter won't become active until after the data structures are cleared, but occasionally the race goes the other way, and the assertions fail. 3. `DAGSchedulerSuite` was not stopping all the inner parts it was setting up, so each test was leaking a number of threads. So we stop those parts too. 4. Turns out that `assertMapOutputAvailable` is not terribly useful in this framework -- most of the places I was trying to use it suffer from some race. 5. When there is an exception in the backend, try to improve the error msg a little bit. Before the exception was printed to the console, but the test would fail w/ a timeout, and the logs wouldn't show anything. ## How was this patch tested? I ran all the tests in `BlacklistIntegrationSuite` 5k times and everything in `DAGSchedulerSuite` 1k times on my laptop. Also I ran a full jenkins build with `BlacklistIntegrationSuite` 500 times and `DAGSchedulerSuite` 50 times, see https://github.com/apache/spark/pull/13548. (I tried more times but jenkins timed out.) To check for more leaked threads, I added some code to dump the list of all threads at the end of each test in DAGSchedulerSuite, which is how I discovered the mapOutputTracker and eventLoop were leaking threads. (I removed that code from the final pr, just part of the testing.) And I'll run Jenkins on this a couple of times to do one more check. Author: Imran Rashid <irashid@cloudera.com> Closes #13565 from squito/blacklist_extra_tests.
*	[SPARK-16097][SQL] Encoders.tuple should handle null object correctly	Wenchen Fan	2016-06-22	2	-13/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Although the top level input object can not be null, but when we use `Encoders.tuple` to combine 2 encoders, their input objects are not top level anymore and can be null. We should handle this case. ## How was this patch tested? new test in DatasetSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #13807 from cloud-fan/bug.
*	[SPARK-16121] ListingFileCatalog does not list in parallel anymore	Yin Huai	2016-06-22	3	-9/+101
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Seems the fix of SPARK-14959 breaks the parallel partitioning discovery. This PR fixes the problem ## How was this patch tested? Tested manually. (This PR also adds a proper test for SPARK-14959) Author: Yin Huai <yhuai@databricks.com> Closes #13830 from yhuai/SPARK-16121.
*	[SPARK-15162][SPARK-15164][PYSPARK][DOCS][ML] update some pydocs	Holden Karau	2016-06-22	3	-5/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc. ## How was this patch tested? Built docs locally & PySpark SQL tests Author: Holden Karau <holden@us.ibm.com> Closes #12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.
*	[SPARK-15644][MLLIB][SQL] Replace SQLContext with SparkSession in MLlib	gatorsmile	2016-06-21	31	-81/+100
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? This PR is to use the latest `SparkSession` to replace the existing `SQLContext` in `MLlib`. `SQLContext` is removed from `MLlib`. Also fix a test case issue in `BroadcastJoinSuite`. BTW, `SQLContext` is not being used in the `MLlib` test suites. #### How was this patch tested? Existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13380 from gatorsmile/sqlContextML.
*	[SPARK-16104] [SQL] Do not creaate CSV writer object for every flush when ↵	hyukjinkwon	2016-06-21	2	-11/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	writing ## What changes were proposed in this pull request? This PR let `CsvWriter` object is not created for each time but able to be reused. This way was taken after from JSON data source. Original `CsvWriter` was being created for each row but it was enhanced in https://github.com/apache/spark/pull/13229. However, it still creates `CsvWriter` object for each `flush()` in `LineCsvWriter`. It seems it does not have to close the object and re-create this for every flush. It follows the original logic as it is but `CsvWriter` is reused by reseting `CharArrayWriter`. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13809 from HyukjinKwon/write-perf.
*	[MINOR][MLLIB] deprecate setLabelCol in ChiSqSelectorModel	Xiangrui Meng	2016-06-21	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Deprecate `labelCol`, which is not used by ChiSqSelectorModel. Author: Xiangrui Meng <meng@databricks.com> Closes #13823 from mengxr/deprecate-setLabelCol-in-ChiSqSelectorModel.
*	[SQL][DOC] SQL programming guide add deprecated methods in 2.0.0	Felix Cheung	2016-06-22	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Doc changes ## How was this patch tested? manual liancheng Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13827 from felixcheung/sqldocdeprecate.
*	[SPARK-16118][MLLIB] add getDropLast to OneHotEncoder	Xiangrui Meng	2016-06-21	2	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We forgot the getter of `dropLast` in `OneHotEncoder` ## How was this patch tested? unit test Author: Xiangrui Meng <meng@databricks.com> Closes #13821 from mengxr/SPARK-16118.
*	[SPARK-16117][MLLIB] hide LibSVMFileFormat and move its doc to LibSVMDataSource	Xiangrui Meng	2016-06-21	2	-38/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? LibSVMFileFormat implements data source for LIBSVM format. However, users do not really need to call its APIs to use it. So we should hide it in the public API docs. The main issue is that we still need to put the documentation and example code somewhere. The proposal it to have a dummy class to hold the documentation, as a workaround to https://issues.scala-lang.org/browse/SI-8124. ## How was this patch tested? Manually checked the generated API doc and tested loading LIBSVM data. Author: Xiangrui Meng <meng@databricks.com> Closes #13819 from mengxr/SPARK-16117.
*	[SPARK-16096][SPARKR] add union and deprecate unionAll	Felix Cheung	2016-06-21	5	-13/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? add union and deprecate unionAll, separate roxygen2 doc for rbind (since their usage and parameter lists are quite different) `explode` is also deprecated - but seems like replacement is a combination of calls; not sure if we should deprecate it in SparkR, yet. ## How was this patch tested? unit tests, manual checks for r doc Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13805 from felixcheung/runion.
*	[MINOR][MLLIB] move setCheckpointInterval to non-expert setters	Xiangrui Meng	2016-06-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The `checkpointInterval` is a non-expert param. This PR moves its setter to non-expert group. Author: Xiangrui Meng <meng@databricks.com> Closes #13813 from mengxr/checkpoint-non-expert.
*	[SPARK-16002][SQL] Sleep when no new data arrives to avoid 100% CPU usage	Shixiong Zhu	2016-06-21	7	-7/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add a configuration to allow people to set a minimum polling delay when no new data arrives (default is 10ms). This PR also cleans up some INFO logs. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13718 from zsxwing/SPARK-16002.
*	[SPARK-16037][SQL] Follow-up: add DataFrameWriter.insertInto() test cases ↵	Cheng Lian	2016-06-21	1	-0/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	for by position resolution ## What changes were proposed in this pull request? This PR migrates some test cases introduced in #12313 as a follow-up of #13754 and #13766. These test cases cover `DataFrameWriter.insertInto()`, while the former two only cover SQL `INSERT` statements. Note that the `testPartitionedTable` utility method tests both Hive SerDe tables and data source tables. ## How was this patch tested? N/A Author: Cheng Lian <lian@databricks.com> Closes #13810 from liancheng/spark-16037-follow-up-tests.
*	[SPARK-15741][PYSPARK][ML] Pyspark cleanup of set default seed to None	Bryan Cutler	2016-06-21	4	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Several places set the seed Param default value to None which will translate to a zero value on the Scala side. This is unnecessary because a default fixed value already exists and if a test depends on a zero valued seed, then it should explicitly set it to zero instead of relying on this translation. These cases can be safely removed except for the ALS doc test, which has been changed to set the seed value to zero. ## How was this patch tested? Ran PySpark tests locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #13672 from BryanCutler/pyspark-cleanup-setDefault-seed-SPARK-15741.
*	[SPARK-16109][SPARKR][DOC] R more doc fixes	Felix Cheung	2016-06-21	5	-23/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Found these issues while reviewing for SPARK-16090 ## How was this patch tested? roxygen2 doc gen, checked output html Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13803 from felixcheung/rdocrd.
*	[SPARK-16086] [SQL] [PYSPARK] create Row without any fields	Davies Liu	2016-06-21	2	-6/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR allows us to create a Row without any fields. ## How was this patch tested? Added a test for empty row and udf without arguments. Author: Davies Liu <davies@databricks.com> Closes #13812 from davies/no_argus.
*	[SPARK-16080][YARN] Set correct link name for conf archive in executors.	Marcelo Vanzin	2016-06-21	2	-4/+18
\| \| \| \| \| \| \| \| \| \|	This makes sure the files are in the executor's classpath as they're expected to be. Also update the unit test to make sure the files are there as expected. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #13792 from vanzin/SPARK-16080.
*	[SPARK-13792][SQL] Addendum: Fix Python API	Reynold Xin	2016-06-21	1	-21/+33
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This is a follow-up to https://github.com/apache/spark/pull/13795 to properly set CSV options in Python API. As part of this, I also make the Python option setting for both CSV and JSON more robust against positional errors. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #13800 from rxin/SPARK-13792-2.
*	[SPARK-15177][.1][R] make SparkR model params and default values consistent ↵	Xiangrui Meng	2016-06-21	4	-46/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	with MLlib ## What changes were proposed in this pull request? This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation. Main changes: * `spark.glm`: epsilon -> tol, maxit -> maxIter * `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means\|\|" * `spark.naiveBayes`: laplace -> smoothing, default 1.0 ## How was this patch tested? Existing unit tests. Author: Xiangrui Meng <meng@databricks.com> Closes #13801 from mengxr/SPARK-15177.1.
*	[SPARK-16084][SQL] Minor comments update for "DESCRIBE" table	bomeng	2016-06-21	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1. FORMATTED is actually supported, but partition is not supported; 2. Remove parenthesis as it is not necessary just like anywhere else. ## How was this patch tested? Minor issue. I do not think it needs a test case! Author: bomeng <bmeng@us.ibm.com> Closes #13791 from bomeng/SPARK-16084.
*	[SPARK-16045][ML][DOC] Spark 2.0 ML.feature: doc update for stopwords and ↵	Yuhao Yang	2016-06-21	1	-6/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	binarizer ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-16045 2.0 Audit: Update document for StopWordsRemover and Binarizer. ## How was this patch tested? manual review for doc Author: Yuhao Yang <hhbyyh@gmail.com> Author: Yuhao Yang <yuhao.yang@intel.com> Closes #13375 from hhbyyh/stopdoc.
*	[SPARK-10258][DOC][ML] Add @Since annotations to ml.feature	Nick Pentreath	2016-06-21	28	-68/+362
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds missing `Since` annotations to `ml.feature` package. Closes #8505. ## How was this patch tested? Existing tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13641 from MLnick/add-since-annotations.
*	Revert "[SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)"	Xiangrui Meng	2016-06-21	2	-8/+6
\| \| \| \|	This reverts commit a46553cbacf0e4012df89fe55385dec5beaa680a.
*	[SPARK-15319][SPARKR][DOCS] Fix SparkR doc layout for corr and other ↵	Felix Cheung	2016-06-21	2	-23/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DataFrame stats functions ## What changes were proposed in this pull request? Doc only changes. Please see screenshots. Before: http://spark.apache.org/docs/latest/api/R/statfunctions.html ![image](https://cloud.githubusercontent.com/assets/8969467/15264110/cd458826-1924-11e6-85bd-8ee2e2e1a85f.png) After ![image](https://cloud.githubusercontent.com/assets/8969467/16218452/b9e89f08-3732-11e6-969d-a3a1796e7ad0.png) (please ignore the style differences - this is due to not having the css in my local copy) This is still a bit weird. As discussed in SPARK-15237, I think the better approach is to separate out the DataFrame stats function instead of putting everything on one page. At least now it is clearer which description is on which function. ## How was this patch tested? Build doc Author: Felix Cheung <felixcheung_m@hotmail.com> Author: felixcheung <felixcheung_m@hotmail.com> Closes #13109 from felixcheung/rstatdoc.
*	[SPARKR][DOCS] R code doc cleanup	Felix Cheung	2016-06-20	8	-84/+70
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? I ran a full pass from A to Z and fixed the obvious duplications, improper grouping etc. There are still more doc issues to be cleaned up. ## How was this patch tested? manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13798 from felixcheung/rdocseealso.
*	[SPARK-15894][SQL][DOC] Update docs for controlling #partitions	Takeshi YAMAMURO	2016-06-21	1	-0/+17
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Update docs for two parameters `spark.sql.files.maxPartitionBytes` and `spark.sql.files.openCostInBytes ` in Other Configuration Options. ## How was this patch tested? N/A Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13797 from maropu/SPARK-15894-2.
*	[SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include ↵	Felix Cheung	2016-06-21	2	-19/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	sparkSession in R ## What changes were proposed in this pull request? Update doc as per discussion in PR #13592 ## How was this patch tested? manual shivaram liancheng Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13799 from felixcheung/rsqlprogrammingguide.
*	[SPARK-16025][CORE] Document OFF_HEAP storage level in 2.0	Eric Liang	2016-06-20	1	-0/+5
\| \| \| \| \| \| \| \|	This has changed from 1.6, and now stores memory off-heap using spark's off-heap support instead of in tachyon. Author: Eric Liang <ekl@databricks.com> Closes #13744 from ericl/spark-16025.