spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-17095] [Documentation] [Latex and Scala doc do not play nicely]	Jagadeesan	2016-08-23	4	-13/+24
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In Latex, it is common to find "}}}" when closing several expressions at once. [SPARK-16822](https://issues.apache.org/jira/browse/SPARK-16822) added Mathjax to render Latex equations in scaladoc. However, when scala doc sees "}}}" or "{{{" it treats it as a special character for code block. This results in some very strange output. Author: Jagadeesan <as2@us.ibm.com> Closes #14688 from jagadeesanas2/SPARK-17095.
*	[SPARK-17199] Use CatalystConf.resolver for case-sensitivity comparison	Jacek Laskowski	2016-08-23	4	-27/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Use `CatalystConf.resolver` consistently for case-sensitivity comparison (removed dups). ## How was this patch tested? Local build. Waiting for Jenkins to ensure clean build and test. Author: Jacek Laskowski <jacek@japila.pl> Closes #14771 from jaceklaskowski/17199-catalystconf-resolver.
*	[SPARK-17188][SQL] Moves class QuantileSummaries to project catalyst for ↵	Sean Zhong	2016-08-23	3	-251/+267
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	implementing percentile_approx ## What changes were proposed in this pull request? This is a sub-task of [SPARK-16283](https://issues.apache.org/jira/browse/SPARK-16283) (Implement percentile_approx SQL function), which moves class QuantileSummaries to project catalyst so that it can be reused when implementing aggregation function `percentile_approx`. ## How was this patch tested? This PR only does class relocation, class implementation is not changed. Author: Sean Zhong <seanzhong@databricks.com> Closes #14754 from clockfly/move_QuantileSummaries_to_catalyst.
*	[SPARKR][MINOR] Update R DESCRIPTION file	Felix Cheung	2016-08-22	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Update DESCRIPTION ## How was this patch tested? Run install and CRAN tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14764 from felixcheung/rpackagedescription.
*	[SPARK-17182][SQL] Mark Collect as non-deterministic	Cheng Lian	2016-08-23	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR marks the abstract class `Collect` as non-deterministic since the results of `CollectList` and `CollectSet` depend on the actual order of input rows. ## How was this patch tested? Existing test cases should be enough. Author: Cheng Lian <lian@databricks.com> Closes #14749 from liancheng/spark-17182-non-deterministic-collect.
*	[SPARK-16577][SPARKR] Add CRAN documentation checks to run-tests.sh	Shivaram Venkataraman	2016-08-22	2	-6/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? This change adds CRAN documentation checks to be run as a part of `R/run-tests.sh` . As this script is also used by Jenkins this means that we will get documentation checks on every PR going forward. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14759 from shivaram/sparkr-cran-jenkins.
*	[SPARK-17090][FOLLOW-UP][ML] Add expert param support to SharedParamsCodeGen	hqzizania	2016-08-22	2	-6/+12
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add expert param support to SharedParamsCodeGen where aggregationDepth a expert param is added. Author: hqzizania <hqzizania@gmail.com> Closes #14738 from hqzizania/SPARK-17090-minor.
*	[SPARK-17144][SQL] Removal of useless CreateHiveTableAsSelectLogicalPlan	gatorsmile	2016-08-23	1	-18/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `CreateHiveTableAsSelectLogicalPlan` is a dead code after refactoring. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #14707 from gatorsmile/removeCreateHiveTable.
*	[SPARK-16550][SPARK-17042][CORE] Certain classes fail to deserialize in ↵	Eric Liang	2016-08-22	4	-58/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	block manager replication ## What changes were proposed in this pull request? This is a straightforward clone of JoshRosen 's original patch. I have follow-up changes to fix block replication for repl-defined classes as well, but those appear to be flaking tests so I'm going to leave that for SPARK-17042 ## How was this patch tested? End-to-end test in ReplSuite (also more tests in DistributedSuite from the original patch). Author: Eric Liang <ekl@databricks.com> Closes #14311 from ericl/spark-16550.
*	[SPARK-16508][SPARKR] doc updates and more CRAN check fixes	Felix Cheung	2016-08-22	12	-114/+119
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? replace ``` ` ``` in code doc with `\code{thing}` remove added `...` for drop(DataFrame) fix remaining CRAN check warnings ## How was this patch tested? create doc with knitr junyangq Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14734 from felixcheung/rdoccleanup.
*	[SPARK-17162] Range does not support SQL generation	Eric Liang	2016-08-22	8	-18/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The range operator previously didn't support SQL generation, which made it not possible to use in views. ## How was this patch tested? Unit tests. cc hvanhovell Author: Eric Liang <ekl@databricks.com> Closes #14724 from ericl/spark-17162.
*	[MINOR][SQL] Fix some typos in comments and test hints	Sean Zhong	2016-08-22	3	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix some typos in comments and test hints ## How was this patch tested? N/A. Author: Sean Zhong <seanzhong@databricks.com> Closes #14755 from clockfly/fix_minor_typo.
*	[SPARKR][MINOR] Add Xiangrui and Felix to maintainers	Shivaram Venkataraman	2016-08-22	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This change adds Xiangrui Meng and Felix Cheung to the maintainers field in the package description. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14758 from shivaram/sparkr-maintainers.
*	[SPARK-17173][SPARKR] R MLlib refactor, cleanup, reformat, fix deprecation ↵	Felix Cheung	2016-08-22	2	-117/+98
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in test ## What changes were proposed in this pull request? refactor, cleanup, reformat, fix deprecation in test ## How was this patch tested? unit tests, manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14735 from felixcheung/rmllibutil.
*	[SPARK-16320][DOC] Document G1 heap region's effect on spark 2.0 vs 1.6	Sean Owen	2016-08-22	1	-19/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Collect GC discussion in one section, and documenting findings about G1 GC heap region size. ## How was this patch tested? Jekyll doc build Author: Sean Owen <sowen@cloudera.com> Closes #14732 from srowen/SPARK-16320.
*	[SPARKR][MINOR] Fix Cache Folder Path in Windows	Junyang Qian	2016-08-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR tries to fix the scheme of local cache folder in Windows. The name of the environment variable should be `LOCALAPPDATA` rather than `%LOCALAPPDATA%`. ## How was this patch tested? Manual test in Windows 7. Author: Junyang Qian <junyangq@databricks.com> Closes #14743 from junyangq/SPARKR-FixWindowsInstall.
*	[SPARK-15113][PYSPARK][ML] Add missing num features num classes	Holden Karau	2016-08-22	4	-11/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add missing `numFeatures` and `numClasses` to the wrapped Java models in PySpark ML pipelines. Also tag `DecisionTreeClassificationModel` as Expiremental to match Scala doc. ## How was this patch tested? Extended doctests Author: Holden Karau <holden@us.ibm.com> Closes #12889 from holdenk/SPARK-15113-add-missing-numFeatures-numClasses.
*	[SPARK-17085][STREAMING][DOCUMENTATION AND ACTUAL CODE DIFFERS - UNSUPPORTED ↵	Jagadeesan	2016-08-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	OPERATIONS] Changes in Spark Stuctured Streaming doc in this link https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html#unsupported-operations Author: Jagadeesan <as2@us.ibm.com> Closes #14715 from jagadeesanas2/SPARK-17085.
*	[SPARK-17115][SQL] decrease the threshold when split expressions	Davies Liu	2016-08-22	3	-5/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In 2.0, we change the threshold of splitting expressions from 16K to 64K, which cause very bad performance on wide table, because the generated method can't be JIT compiled by default (above the limit of 8K bytecode). This PR will decrease it to 1K, based on the benchmark results for a wide table with 400 columns of LongType. It also fix a bug around splitting expression in whole-stage codegen (it should not split them). ## How was this patch tested? Added benchmark suite. Author: Davies Liu <davies@databricks.com> Closes #14692 from davies/split_exprs.
*	[SPARK-16968] Document additional options in jdbc Writer	GraceH	2016-08-22	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) This is the document for previous JDBC Writer options. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Unit test has been added in previous PR. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: GraceH <jhuang1@paypal.com> Closes #14683 from GraceH/jdbc_options.
*	[SPARK-17127] Make unaligned access in unsafe available for AArch64	Richael	2016-08-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## # What changes were proposed in this pull request? From the spark of version 2.0.0 , when MemoryMode.OFF_HEAP is set , whether the architecture supports unaligned access or not is checked. If the check doesn't pass, exception is raised. We know that AArch64 also supports unaligned access , but now only i386, x86, amd64, and X86_64 are included. I think we should include aarch64 when performing the check. ## How was this patch tested? Unit test suite Author: Richael <Richael.Zhuang@arm.com> Closes #14700 from yimuxi/zym_change_unsafe.
*	[SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalog	Wenchen Fan	2016-08-21	24	-653/+536
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties. This PR moves these hacks to `HiveExternalCatalog`, tries to isolate hive specific logic in one place. changes overview: 1. before this PR: we need to put metadata(schema, partition columns, etc.) of data source tables to table properties before saving it to external catalog, even the external catalog doesn't use hive metastore(e.g. `InMemoryCatalog`) after this PR: the table properties tricks are only in `HiveExternalCatalog`, the caller side doesn't need to take care of it anymore. 2. before this PR: because the table properties tricks are done outside of external catalog, so we also need to revert these tricks when we read the table metadata from external catalog and use it. e.g. in `DescribeTableCommand` we will read schema and partition columns from table properties. after this PR: The table metadata read from external catalog is exactly the same with what we saved to it. bonus: now we can create data source table using `SessionCatalog`, if schema is specified. breaks: `schemaStringLengthThreshold` is not configurable anymore. `hive.default.rcfile.serde` is not configurable anymore. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14155 from cloud-fan/catalog-table.
*	[SPARK-17098][SQL] Fix `NullPropagation` optimizer to handle `COUNT(NULL) ↵	Dongjoon Hyun	2016-08-21	3	-0/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	OVER` correctly ## What changes were proposed in this pull request? Currently, `NullPropagation` optimizer replaces `COUNT` on null literals in a bottom-up fashion. During that, `WindowExpression` is not covered properly. This PR adds the missing propagation logic. Before ```scala scala> sql("SELECT COUNT(1 + NULL) OVER ()").show java.lang.UnsupportedOperationException: Cannot evaluate expression: cast(0 as bigint) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ``` After ```scala scala> sql("SELECT COUNT(1 + NULL) OVER ()").show +----------------------------------------------------------------------------------------------+ \|count((1 + CAST(NULL AS INT))) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)\| +----------------------------------------------------------------------------------------------+ \| 0\| +----------------------------------------------------------------------------------------------+ ``` ## How was this patch tested? Pass the Jenkins test with a new test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14689 from dongjoon-hyun/SPARK-17098.
*	[MINOR][R] add SparkR.Rcheck/ and SparkR_*.tar.gz to R/.gitignore	Xiangrui Meng	2016-08-21	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Ignore temp files generated by `check-cran.sh`. Author: Xiangrui Meng <meng@databricks.com> Closes #14740 from mengxr/R-gitignore.
*	[SPARK-17002][CORE] Document that spark.ssl.protocol. is required for SSL	wm624@hotmail.com	2016-08-21	2	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `spark.ssl.enabled`=true, but failing to set `spark.ssl.protocol` will fail and throw meaningless exception. `spark.ssl.protocol` is required when `spark.ssl.enabled`. Improvement: require `spark.ssl.protocol` when initializing SSLContext, otherwise throws an exception to indicate that. Remove the OrElse("default"). Document this requirement in configure.md ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual tests: Build document and check document Configure `spark.ssl.enabled` only, it throws exception below: 6/08/16 16:04:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mwang); groups with view permissions: Set(); users with modify permissions: Set(mwang); groups with modify permissions: Set() Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: spark.ssl.protocol is required when enabling SSL connections. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:285) at org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1026) at org.apache.spark.deploy.master.Master$.main(Master.scala:1011) at org.apache.spark.deploy.master.Master.main(Master.scala) Configure `spark.ssl.protocol` and `spark.ssl.protocol` It works fine. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14674 from wangmiao1981/ssl.
*	[SPARK-16961][FOLLOW-UP][SPARKR] More robust test case for ↵	Yanbo Liang	2016-08-21	1	-22/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	spark.gaussianMixture. ## What changes were proposed in this pull request? #14551 fixed off-by-one bug in ```randomizeInPlace``` and some test failure caused by this fix. But for SparkR ```spark.gaussianMixture``` test case, the fix is inappropriate. It only changed the output result of native R which should be compared by SparkR, however, it did not change the R code in annotation which is used for reproducing the result in native R. It will confuse users who can not reproduce the same result in native R. This PR sends a more robust test case which can produce same result between SparkR and native R. ## How was this patch tested? Unit test update. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14730 from yanboliang/spark-16961-followup.
*	[SPARK-17090][ML] Make tree aggregation level in linear/logistic regression ↵	hqzizania	2016-08-20	5	-17/+74
\| \| \| \| \| \| \| \| \| \| \| \|	configurable ## What changes were proposed in this pull request? Linear/logistic regression use treeAggregate with default depth (always = 2) for collecting coefficient gradient updates to the driver. For high dimensional problems, this can cause OOM error on the driver. This patch makes it configurable to avoid this problem if users' input data has many features. It adds a HasTreeDepth API in `sharedParams.scala`, and extends it to both Linear regression and logistic regression in .ml Author: hqzizania <hqzizania@gmail.com> Closes #14717 from hqzizania/SPARK-17090.
*	[SPARK-12666][CORE] SparkSubmit packages fix for when 'default' conf doesn't ↵	Bryan Cutler	2016-08-20	1	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	exist in dependent module ## What changes were proposed in this pull request? Adding a "(runtime)" to the dependency configuration will set a fallback configuration to be used if the requested one is not found. E.g. with the setting "default(runtime)", Ivy will look for the conf "default" in the module ivy file and if not found will look for the conf "runtime". This can help with the case when using "sbt publishLocal" which does not write a "default" conf in the published ivy.xml file. ## How was this patch tested? used spark-submit with --packages option for a package published locally with no default conf, and a package resolved from Maven central. Author: Bryan Cutler <cutlerb@gmail.com> Closes #13428 from BryanCutler/fallback-package-conf-SPARK-12666.
*	[SPARK-17124][SQL] RelationalGroupedDataset.agg should preserve order and ↵	petermaxlee	2016-08-21	2	-2/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	allow multiple aggregates per column ## What changes were proposed in this pull request? This patch fixes a longstanding issue with one of the RelationalGroupedDataset.agg function. Even though the signature accepts vararg of pairs, the underlying implementation turns the seq into a map, and thus not order preserving nor allowing multiple aggregates per column. This change also allows users to use this function to run multiple different aggregations for a single column, e.g. ``` agg("age" -> "max", "age" -> "count") ``` ## How was this patch tested? Added a test case in DataFrameAggregateSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #14697 from petermaxlee/SPARK-17124.
*	[SPARK-17104][SQL] LogicalRelation.newInstance should follow the semantics ↵	Liang-Chi Hsieh	2016-08-20	2	-2/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	of MultiInstanceRelation ## What changes were proposed in this pull request? Currently `LogicalRelation.newInstance()` simply creates another `LogicalRelation` object with the same parameters. However, the `newInstance()` method inherited from `MultiInstanceRelation` should return a copy of object with unique expression ids. Current `LogicalRelation.newInstance()` can cause failure when doing self-join. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #14682 from viirya/fix-localrelation.
*	[SPARKR][EXAMPLE] change example APP name	wm624@hotmail.com	2016-08-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) For R SQL example, appname is "MyApp". While examples in scala, Java and python, the appName is "x Spark SQL basic example". I made the R example consistent with other examples. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual test (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14703 from wangmiao1981/example.
*	[SPARK-16508][SPARKR] Fix CRAN undocumented/duplicated arguments warnings.	Junyang Qian	2016-08-20	11	-267/+419
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR tries to fix all the remaining "undocumented/duplicated arguments" warnings given by CRAN-check. One left is doc for R `stats::glm` exported in SparkR. To mute that warning, we have to also provide document for all arguments of that non-SparkR function. Some previous conversation is in #14558. ## How was this patch tested? R unit test and `check-cran.sh` script (with no-test). Author: Junyang Qian <junyangq@databricks.com> Closes #14705 from junyangq/SPARK-16508-master.
*	[SPARK-15018][PYSPARK][ML] Improve handling of PySpark Pipeline when used ↵	Bryan Cutler	2016-08-19	2	-8/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	without stages ## What changes were proposed in this pull request? When fitting a PySpark Pipeline without the `stages` param set, a confusing NoneType error is raised as attempts to iterate over the pipeline stages. A pipeline with no stages should act as an identity transform, however the `stages` param still needs to be set to an empty list. This change improves the error output when the `stages` param is not set and adds a better description of what the API expects as input. Also minor cleanup of related code. ## How was this patch tested? Added new unit tests to verify an empty Pipeline acts as an identity transformer Author: Bryan Cutler <cutlerb@gmail.com> Closes #12790 from BryanCutler/pipeline-identity-SPARK-15018.
*	[SPARK-17150][SQL] Support SQL generation for inline tables	petermaxlee	2016-08-20	4	-2/+30
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch adds support for SQL generation for inline tables. With this, it would be possible to create a view that depends on inline tables. ## How was this patch tested? Added a test case in LogicalPlanToSQLSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #14709 from petermaxlee/SPARK-17150.
*	[SPARK-17158][SQL] Change error message for out of range numeric literals	Srinath Shankar	2016-08-19	3	-17/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Modifies error message for numeric literals to Numeric literal <literal> does not fit in range [min, max] for type <T> ## How was this patch tested? Fixed up the error messages for literals.sql in SqlQueryTestSuite and re-ran via sbt. Also fixed up error messages in ExpressionParserSuite Author: Srinath Shankar <srinath@databricks.com> Closes #14721 from srinathshankar/sc4296.
*	[SPARK-17149][SQL] array.sql for testing array related functions	petermaxlee	2016-08-19	7	-33/+248
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch creates array.sql in SQLQueryTestSuite for testing array related functions, including: - indexing - array creation - size - array_contains - sort_array ## How was this patch tested? The patch itself is about adding tests. Author: petermaxlee <petermaxlee@gmail.com> Closes #14708 from petermaxlee/SPARK-17149.
*	[SPARK-16443][SPARKR] Alternating Least Squares (ALS) wrapper	Junyang Qian	2016-08-19	6	-5/+322
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add Alternating Least Squares wrapper in SparkR. Unit tests have been updated. ## How was this patch tested? SparkR unit tests. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) ![screen shot 2016-07-27 at 3 50 31 pm](https://cloud.githubusercontent.com/assets/15318264/17195347/f7a6352a-5411-11e6-8e21-61a48070192a.png) ![screen shot 2016-07-27 at 3 50 46 pm](https://cloud.githubusercontent.com/assets/15318264/17195348/f7a7d452-5411-11e6-845f-6d292283bc28.png) Author: Junyang Qian <junyangq@databricks.com> Closes #14384 from junyangq/SPARK-16443.
*	[SPARK-17113] [SHUFFLE] Job failure due to Executor OOM in offheap mode	Sital Kedia	2016-08-19	2	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR fixes executor OOM in offheap mode due to bug in Cooperative Memory Management for UnsafeExternSorter. UnsafeExternalSorter was checking if memory page is being used by upstream by comparing the base object address of the current page with the base object address of upstream. However, in case of offheap memory allocation, the base object addresses are always null, so there was no spilling happening and eventually the operator would OOM. Following is the stack trace this issue addresses - java.lang.OutOfMemoryError: Unable to acquire 1220 bytes of memory, got 0 at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:341) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:362) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:93) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170) ## How was this patch tested? Tested by running the failing job. Author: Sital Kedia <skedia@fb.com> Closes #14693 from sitalkedia/fix_offheap_oom.
*	[SPARK-11227][CORE] UnknownHostException can be thrown when NameNode HA is ↵	Kousuke Saruta	2016-08-19	1	-1/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	enabled. ## What changes were proposed in this pull request? If the following conditions are satisfied, executors don't load properties in `hdfs-site.xml` and UnknownHostException can be thrown. (1) NameNode HA is enabled (2) spark.eventLogging is disabled or logging path is NOT on HDFS (3) Using Standalone or Mesos for the cluster manager (4) There are no code to load `HdfsCondition` class in the driver regardless of directly or indirectly. (5) The tasks access to HDFS (There might be some more conditions...) For example, following code causes UnknownHostException when the conditions above are satisfied. ``` sc.textFile("<path on HDFS>").collect ``` ``` java.lang.IllegalArgumentException: java.net.UnknownHostException: hacluster at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:170) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:438) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:986) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:986) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) at scala.Option.map(Option.scala:146) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:177) at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:213) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.UnknownHostException: hacluster ``` But following code doesn't cause the Exception because `textFile` method loads `HdfsConfiguration` indirectly. ``` sc.textFile("<path on HDFS>").collect ``` When a job includes some operations which access to HDFS, the object of `org.apache.hadoop.Configuration` is wrapped by `SerializableConfiguration`, serialized and broadcasted from driver to executors and each executor deserialize the object with `loadDefaults` false so HDFS related properties should be set before broadcasted. ## How was this patch tested? Tested manually on my standalone cluster. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #13738 from sarutak/SPARK-11227.
*	[SPARK-16673][WEB UI] New Executor Page removed conditional for Logs and ↵	Alex Bozarth	2016-08-19	2	-11/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Thread Dump columns ## What changes were proposed in this pull request? When #13670 switched `ExecutorsPage` to use JQuery DataTables it incidentally removed the conditional for the Logs and Thread Dump columns. I reimplemented the conditional display of the Logs and Thread dump columns as it was before the switch. ## How was this patch tested? Manually tested and dev/run-tests ![both](https://cloud.githubusercontent.com/assets/13952758/17186879/da8dd1a8-53eb-11e6-8b0c-d0ff0156a9a7.png) ![dump](https://cloud.githubusercontent.com/assets/13952758/17186881/dab08a04-53eb-11e6-8b1c-50ffd0bf2ae8.png) ![logs](https://cloud.githubusercontent.com/assets/13952758/17186880/dab04d00-53eb-11e6-8754-68dd64d6d9f4.png) Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14382 from ajbozarth/spark16673.
*	[SPARK-16994][SQL] Whitelist operators for predicate pushdown	Reynold Xin	2016-08-19	4	-7/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch changes predicate pushdown optimization rule (PushDownPredicate) from using a blacklist to a whitelist. That is to say, operators must be explicitly allowed. This approach is more future-proof: previously it was possible for us to introduce a new operator and then render the optimization rule incorrect. This also fixes the bug that previously we allowed pushing filter beneath limit, which was incorrect. That is to say, before this patch, the optimizer would rewrite ``` select * from (select * from range(10) limit 5) where id > 3 to select * from range(10) where id > 3 limit 5 ``` ## How was this patch tested? - a unit test case in FilterPushdownSuite - an end-to-end test in limit.sql Author: Reynold Xin <rxin@databricks.com> Closes #14713 from rxin/SPARK-16994.
*	[SPARK-16965][MLLIB][PYSPARK] Fix bound checking for SparseVector.	Jeff Zhang	2016-08-19	3	-15/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1. In scala, add negative low bound checking and put all the low/upper bound checking in one place 2. In python, add low/upper bound checking of indices. ## How was this patch tested? unit test added Author: Jeff Zhang <zjffdu@apache.org> Closes #14555 from zjffdu/SPARK-16965.
*	[SPARK-17141][ML] MinMaxScaler should remain NaN value.	Yanbo Liang	2016-08-19	2	-2/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In the existing code, ```MinMaxScaler``` handle ```NaN``` value indeterminately. * If a column has identity value, that is ```max == min```, ```MinMaxScalerModel``` transformation will output ```0.5``` for all rows even the original value is ```NaN```. * Otherwise, it will remain ```NaN``` after transformation. I think we should unify the behavior by remaining ```NaN``` value at any condition, since we don't know how to transform a ```NaN``` value. In Python sklearn, it will throw exception when there is ```NaN``` in the dataset. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14716 from yanboliang/spark-17141.
*	[SPARK-16961][CORE] Fixed off-by-one error that biased randomizeInPlace	Nick Lavers	2016-08-19	6	-15/+50
\| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA issue link: https://issues.apache.org/jira/browse/SPARK-16961 Changed one line of Utils.randomizeInPlace to allow elements to stay in place. Created a unit test that runs a Pearson's chi squared test to determine whether the output diverges significantly from a uniform distribution. Author: Nick Lavers <nick.lavers@videoamp.com> Closes #14551 from nicklavers/SPARK-16961-randomizeInPlace.
*	[SPARK-7159][ML] Add multiclass logistic regression to Spark ML	sethah	2016-08-18	4	-88/+2062
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch adds a new estimator/transformer `MultinomialLogisticRegression` to spark ML. JIRA: [SPARK-7159](https://issues.apache.org/jira/browse/SPARK-7159) ## How was this patch tested? Added new test suite `MultinomialLogisticRegressionSuite`. ## Approach ### Do not use a "pivot" class in the algorithm formulation Many implementations of multinomial logistic regression treat the problem as K - 1 independent binary logistic regression models where K is the number of possible outcomes in the output variable. In this case, one outcome is chosen as a "pivot" and the other K - 1 outcomes are regressed against the pivot. This is somewhat undesirable since the coefficients returned will be different for different choices of pivot variables. An alternative approach to the problem models class conditional probabilites using the softmax function and will return uniquely identifiable coefficients (assuming regularization is applied). This second approach is used in R's glmnet and was also recommended by dbtsai. ### Separate multinomial logistic regression and binary logistic regression The initial design makes multinomial logistic regression a separate estimator/transformer than the existing LogisticRegression estimator/transformer. An alternative design would be to merge them into one. Arguments for: * The multinomial case without pivot is distinctly different than the current binary case since the binary case uses a pivot class. * The current logistic regression model in ML uses a vector of coefficients and a scalar intercept. In the multinomial case, we require a matrix of coefficients and a vector of intercepts. There are potential workarounds for this issue if we were to merge the two estimators, but none are particularly elegant. Arguments against: * It may be inconvenient for users to have to switch the estimator class when transitioning between binary and multiclass (although the new multinomial estimator can be used for two class outcomes). * Some portions of the code are repeated. This is a major design point and warrants more discussion. ### Mean centering When no regularization is applied, the coefficients will not be uniquely identifiable. This is not hard to show and is discussed in further detail [here](https://core.ac.uk/download/files/153/6287975.pdf). R's glmnet deals with this by choosing the minimum l2 regularized solution (i.e. mean centering). Additionally, the intercepts are never regularized so they are always mean centered. This is the approach taken in this PR as well. ### Feature scaling In current ML logistic regression, the features are always standardized when running the optimization algorithm. They are always returned to the user in the original feature space, however. This same approach is maintained in this patch as well, but the implementation details are different. In ML logistic regression, the unregularized feature values are divided by the column standard deviation in every gradient update iteration. In contrast, MLlib transforms the entire input dataset to the scaled space _before_ optimizaton. In ML, this means that `numFeatures * numClasses` extra scalar divisions are required in every iteration. Performance testing shows that this has significant (4x in some cases) slow downs in each iteration. This can be avoided by transforming the input to the scaled space ala MLlib once, before iteration begins. This does add some overhead initially, but can make significant time savings in some cases. One issue with this approach is that if the input data is already cached, there may not be enough memory to cache the transformed data, which would make the algorithm _much_ slower. The tradeoffs here merit more discussion. ### Specifying and inferring the number of outcome classes The estimator checks the dataframe label column for metadata which specifies the number of values. If they are not specified, the length of the `histogram` variable is used, which is essentially the maximum value found in the column. The assumption then, is that the labels are zero-indexed when they are provided to the algorithm. ## Performance Below are some performance tests I have run so far. I am happy to add more cases or trials if we deem them necessary. Test cluster: 4 bare metal nodes, 128 GB RAM each, 48 cores each Notes: * Time in units of seconds * Metric is classification accuracy \| algo \| elasticNetParam \| fitIntercept \| metric \| maxIter \| numPoints \| numClasses \| numFeatures \| time \| standardization \| regParam \| \|--------\|-------------------\|----------------\|----------\|-----------\|-------------\|--------------\|---------------\|---------\|-------------------\|------------\| \| ml \| 0 \| true \| 0.746415 \| 30 \| 100000 \| 3 \| 100000 \| 327.923 \| true \| 0 \| \| mllib \| 0 \| true \| 0.743785 \| 30 \| 100000 \| 3 \| 100000 \| 390.217 \| true \| 0 \| \| algo \| elasticNetParam \| fitIntercept \| metric \| maxIter \| numPoints \| numClasses \| numFeatures \| time \| standardization \| regParam \| \|--------\|-------------------\|----------------\|----------\|-----------\|-------------\|--------------\|---------------\|---------\|-------------------\|------------\| \| ml \| 0 \| true \| 0.973238 \| 30 \| 2000000 \| 3 \| 10000 \| 385.476 \| true \| 0 \| \| mllib \| 0 \| true \| 0.949828 \| 30 \| 2000000 \| 3 \| 10000 \| 550.403 \| true \| 0 \| \| algo \| elasticNetParam \| fitIntercept \| metric \| maxIter \| numPoints \| numClasses \| numFeatures \| time \| standardization \| regParam \| \|--------\|-------------------\|----------------\|----------\|-----------\|-------------\|--------------\|---------------\|---------\|-------------------\|------------\| \| mllib \| 0 \| true \| 0.864358 \| 30 \| 2000000 \| 3 \| 10000 \| 543.359 \| true \| 0.1 \| \| ml \| 0 \| true \| 0.867418 \| 30 \| 2000000 \| 3 \| 10000 \| 401.955 \| true \| 0.1 \| \| algo \| elasticNetParam \| fitIntercept \| metric \| maxIter \| numPoints \| numClasses \| numFeatures \| time \| standardization \| regParam \| \|--------\|-------------------\|----------------\|----------\|-----------\|-------------\|--------------\|---------------\|---------\|-------------------\|------------\| \| ml \| 1 \| true \| 0.807449 \| 30 \| 2000000 \| 3 \| 10000 \| 334.892 \| true \| 0.05 \| \| algo \| elasticNetParam \| fitIntercept \| metric \| maxIter \| numPoints \| numClasses \| numFeatures \| time \| standardization \| regParam \| \|--------\|-------------------\|----------------\|----------\|-----------\|-------------\|--------------\|---------------\|---------\|-------------------\|------------\| \| ml \| 0 \| true \| 0.602006 \| 30 \| 2000000 \| 500 \| 100 \| 112.319 \| true \| 0 \| \| mllib \| 0 \| true \| 0.567226 \| 30 \| 2000000 \| 500 \| 100 \| 263.768 \| true \| 0 \|e \| 0.567226 \| 30 \| 2000000 \| 500 \| 100 \| 263.768 \| true \| 0 \| ## References Friedman, et al. ["Regularization Paths for Generalized Linear Models via Coordinate Descent"](https://core.ac.uk/download/files/153/6287975.pdf) [http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html](http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html) ## Follow up items * Consider using level 2 BLAS routines in the gradient computations - [SPARK-17134](https://issues.apache.org/jira/browse/SPARK-17134) * Add model summary for MLOR - [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139) * Add initial model to MLOR and add test for intercept priors - [SPARK-17140](https://issues.apache.org/jira/browse/SPARK-17140) * Python API - [SPARK-17138](https://issues.apache.org/jira/browse/SPARK-17138) * Consider changing the tree aggregation level for MLOR/BLOR or making it user configurable to avoid memory problems with high dimensional data - [SPARK-17090](https://issues.apache.org/jira/browse/SPARK-17090) * Refactor helper classes out of `LogisticRegression.scala` - [SPARK-17135](https://issues.apache.org/jira/browse/SPARK-17135) * Design optimizer interface for added flexibility in ML algos - [SPARK-17136](https://issues.apache.org/jira/browse/SPARK-17136) * Support compressing the coefficients and intercepts for MLOR models - [SPARK-17137](https://issues.apache.org/jira/browse/SPARK-17137) Author: sethah <seth.hendrickson16@gmail.com> Closes #13796 from sethah/SPARK-7159_M.
*	HOTFIX: compilation broken due to protected ctor.	Reynold Xin	2016-08-18	1	-2/+1
\|
*	[SPARK-16947][SQL] Support type coercion and foldable expression for inline ↵	petermaxlee	2016-08-19	9	-46/+452
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tables ## What changes were proposed in this pull request? This patch improves inline table support with the following: 1. Support type coercion. 2. Support using foldable expressions. Previously only literals were supported. 3. Improve error message handling. 4. Improve test coverage. ## How was this patch tested? Added a new unit test suite ResolveInlineTablesSuite and a new file-based end-to-end test inline-table.sql. Author: petermaxlee <petermaxlee@gmail.com> Closes #14676 from petermaxlee/SPARK-16947.
*	[SPARK-16447][ML][SPARKR] LDA wrapper in SparkR	Xusen Yin	2016-08-18	7	-2/+490
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add LDA Wrapper in SparkR with the following interfaces: - spark.lda(data, ...) - spark.posterior(object, newData, ...) - spark.perplexity(object, ...) - summary(object) - write.ml(object) - read.ml(path) ## How was this patch tested? Test with SparkR unit test. Author: Xusen Yin <yinxusen@gmail.com> Closes #14229 from yinxusen/SPARK-16447.
*	[SPARK-17117][SQL] 1 / NULL should not fail analysis	petermaxlee	2016-08-18	4	-23/+89
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch fixes the problem described in SPARK-17117, i.e. "SELECT 1 / NULL" throws an analysis exception: ``` org.apache.spark.sql.AnalysisException: cannot resolve '(1 / NULL)' due to data type mismatch: differing types in '(1 / NULL)' (int and null). ``` The problem is that division type coercion did not take null type into account. ## How was this patch tested? A unit test for the type coercion, and a few end-to-end test cases using SQLQueryTestSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #14695 from petermaxlee/SPARK-17117.
*	[SPARK-17069] Expose spark.range() as table-valued function in SQL	Eric Liang	2016-08-18	8	-1/+267
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This adds analyzer rules for resolving table-valued functions, and adds one builtin implementation for range(). The arguments for range() are the same as those of `spark.range()`. ## How was this patch tested? Unit tests. cc hvanhovell Author: Eric Liang <ekl@databricks.com> Closes #14656 from ericl/sc-4309.