aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARKR][MINOR] Fix doc for show methodJunyang Qian2016-08-241-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The original doc of `show` put methods for multiple classes together but the text only talks about `SparkDataFrame`. This PR tries to fix this problem. ## How was this patch tested? Manual test. Author: Junyang Qian <junyangq@databricks.com> Closes #14776 from junyangq/SPARK-FixShowDoc.
* [MINOR][DOC] Fix wrong ml.feature.Normalizer document.Yanbo Liang2016-08-241-1/+1
| | | | | | | | | | | | | ## What changes were proposed in this pull request? The ```ml.feature.Normalizer``` examples illustrate L1 norm rather than L2, we should correct corresponding document. ![image](https://cloud.githubusercontent.com/assets/1962026/17928637/85aec284-69b0-11e6-9b13-d465ee560581.png) ## How was this patch tested? Doc change, no test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14787 from yanboliang/normalizer.
* [SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer ↵VinceShieh2016-08-242-1/+25
| | | | | | | | | | | | | | | | | | | | | | when some quantiles are duplicated ## What changes were proposed in this pull request? In cases when QuantileDiscretizerSuite is called upon a numeric array with duplicated elements, we will take the unique elements generated from approxQuantiles as input for Bucketizer. ## How was this patch tested? An unit test is added in QuantileDiscretizerSuite QuantileDiscretizer.fit will throw an illegal exception when calling setSplits on a list of splits with duplicated elements. Bucketizer.setSplits should only accept either a numeric vector of two or more unique cut points, although that may produce less number of buckets than requested. Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #14747 from VinceShieh/SPARK-17086.
* [MINOR][BUILD] Fix Java CheckStyle ErrorWeiqing Yang2016-08-242-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? As Spark 2.0.1 will be released soon (mentioned in the spark dev mailing list), besides the critical bugs, it's better to fix the code style errors before the release. Before: ``` ./dev/lint-java Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[525] (sizes) LineLength: Line is longer than 100 characters (found 119). [ERROR] src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103). ``` After: ``` ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` ## How was this patch tested? Manual. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #14768 from Sherry302/fixjavastyle.
* [SPARK-17186][SQL] remove catalog table type INDEXWenchen Fan2016-08-235-10/+6
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Actually Spark SQL doesn't support index, the catalog table type `INDEX` is from Hive. However, most operations in Spark SQL can't handle index table, e.g. create table, alter table, etc. Logically index table should be invisible to end users, and Hive also generates special table name for index table to avoid users accessing it directly. Hive has special SQL syntax to create/show/drop index tables. At Spark SQL side, although we can describe index table directly, but the result is unreadable, we should use the dedicated SQL syntax to do it(e.g. `SHOW INDEX ON tbl`). Spark SQL can also read index table directly, but the result is always empty.(Can hive read index table directly?) This PR remove the table type `INDEX`, to make it clear that Spark SQL doesn't support index currently. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14752 from cloud-fan/minor2.
* [MINOR][SQL] Remove implemented functions from comments of ↵Weiqing Yang2016-08-231-4/+2
| | | | | | | | | | | | | | 'HiveSessionCatalog.scala' ## What changes were proposed in this pull request? This PR removes implemented functions from comments of `HiveSessionCatalog.scala`: `java_method`, `posexplode`, `str_to_map`. ## How was this patch tested? Manual. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #14769 from Sherry302/cleanComment.
* [SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader`Tejas Patil2016-08-231-1/+21
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Jira: https://issues.apache.org/jira/browse/SPARK-16862 `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k buffer to read data off disk. This PR makes it configurable to improve on disk reads. I have made the default value to be 1 MB as with that value I observed improved performance. ## How was this patch tested? I am relying on the existing unit tests. ## Performance After deploying this change to prod and setting the config to 1 mb, there was a 12% reduction in the CPU time and 19.5% reduction in CPU reservation time. Author: Tejas Patil <tejasp@fb.com> Closes #14726 from tejasapatil/spill_buffer_2.
* [SPARK-17194] Use single quotes when generating SQL for string literalsJosh Rosen2016-08-2320-22/+23
| | | | | | | | When Spark emits SQL for a string literal, it should wrap the string in single quotes, not double quotes. Databases which adhere more strictly to the ANSI SQL standards, such as Postgres, allow only single-quotes to be used for denoting string literals (see http://stackoverflow.com/a/1992331/590203). Author: Josh Rosen <joshrosen@databricks.com> Closes #14763 from JoshRosen/SPARK-17194.
* [TRIVIAL] Typo FixZheng RuiFeng2016-08-231-1/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? Fix a typo ## How was this patch tested? no tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #14772 from zhengruifeng/minor_numClasses.
* [MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in ↵hyukjinkwon2016-08-231-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | structured streaming programming guides ## What changes were proposed in this pull request? This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`). This will be a actual problem when users copy and paste the examples. This would not work. This seems only happening in `structured-streaming-programming-guide.md`. ## How was this patch tested? Manually built. This will change some examples to be correctly marked down as below: ![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png) to ![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png) Author: hyukjinkwon <gurwls223@gmail.com> Closes #14770 from HyukjinKwon/minor-quotes.
* [SPARKR][MINOR] Remove reference link for common Windows environment variablesJunyang Qian2016-08-231-3/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The PR removes reference link in the doc for environment variables for common Windows folders. The cran check gave code 503: service unavailable on the original link. ## How was this patch tested? Manual check. Author: Junyang Qian <junyangq@databricks.com> Closes #14767 from junyangq/SPARKR-RemoveLink.
* [SPARK-13286] [SQL] add the next expression of SQLException as causeDavies Liu2016-08-231-2/+13
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Some JDBC driver (for example PostgreSQL) does not use the underlying exception as cause, but have another APIs (getNextException) to access that, so it it's included in the error logging, making us hard to find the root cause, especially in batch mode. This PR will pull out the next exception and add it as cause (if it's different) or suppressed (if there is another different cause). ## How was this patch tested? Can't reproduce this on the default JDBC driver, so did not add a regression test. Author: Davies Liu <davies@databricks.com> Closes #14722 from davies/keep_cause.
* [SPARK-17095] [Documentation] [Latex and Scala doc do not play nicely]Jagadeesan2016-08-234-13/+24
| | | | | | | | | | ## What changes were proposed in this pull request? In Latex, it is common to find "}}}" when closing several expressions at once. [SPARK-16822](https://issues.apache.org/jira/browse/SPARK-16822) added Mathjax to render Latex equations in scaladoc. However, when scala doc sees "}}}" or "{{{" it treats it as a special character for code block. This results in some very strange output. Author: Jagadeesan <as2@us.ibm.com> Closes #14688 from jagadeesanas2/SPARK-17095.
* [SPARK-17199] Use CatalystConf.resolver for case-sensitivity comparisonJacek Laskowski2016-08-234-27/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Use `CatalystConf.resolver` consistently for case-sensitivity comparison (removed dups). ## How was this patch tested? Local build. Waiting for Jenkins to ensure clean build and test. Author: Jacek Laskowski <jacek@japila.pl> Closes #14771 from jaceklaskowski/17199-catalystconf-resolver.
* [SPARK-17188][SQL] Moves class QuantileSummaries to project catalyst for ↵Sean Zhong2016-08-233-251/+267
| | | | | | | | | | | | | | | | implementing percentile_approx ## What changes were proposed in this pull request? This is a sub-task of [SPARK-16283](https://issues.apache.org/jira/browse/SPARK-16283) (Implement percentile_approx SQL function), which moves class QuantileSummaries to project catalyst so that it can be reused when implementing aggregation function `percentile_approx`. ## How was this patch tested? This PR only does class relocation, class implementation is not changed. Author: Sean Zhong <seanzhong@databricks.com> Closes #14754 from clockfly/move_QuantileSummaries_to_catalyst.
* [SPARKR][MINOR] Update R DESCRIPTION fileFelix Cheung2016-08-221-4/+9
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Update DESCRIPTION ## How was this patch tested? Run install and CRAN tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14764 from felixcheung/rpackagedescription.
* [SPARK-17182][SQL] Mark Collect as non-deterministicCheng Lian2016-08-231-0/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR marks the abstract class `Collect` as non-deterministic since the results of `CollectList` and `CollectSet` depend on the actual order of input rows. ## How was this patch tested? Existing test cases should be enough. Author: Cheng Lian <lian@databricks.com> Closes #14749 from liancheng/spark-17182-non-deterministic-collect.
* [SPARK-16577][SPARKR] Add CRAN documentation checks to run-tests.shShivaram Venkataraman2016-08-222-6/+39
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? This change adds CRAN documentation checks to be run as a part of `R/run-tests.sh` . As this script is also used by Jenkins this means that we will get documentation checks on every PR going forward. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14759 from shivaram/sparkr-cran-jenkins.
* [SPARK-17090][FOLLOW-UP][ML] Add expert param support to SharedParamsCodeGenhqzizania2016-08-222-6/+12
| | | | | | | | | | ## What changes were proposed in this pull request? Add expert param support to SharedParamsCodeGen where aggregationDepth a expert param is added. Author: hqzizania <hqzizania@gmail.com> Closes #14738 from hqzizania/SPARK-17090-minor.
* [SPARK-17144][SQL] Removal of useless CreateHiveTableAsSelectLogicalPlangatorsmile2016-08-231-18/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? `CreateHiveTableAsSelectLogicalPlan` is a dead code after refactoring. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #14707 from gatorsmile/removeCreateHiveTable.
* [SPARK-16550][SPARK-17042][CORE] Certain classes fail to deserialize in ↵Eric Liang2016-08-224-58/+60
| | | | | | | | | | | | | | | | block manager replication ## What changes were proposed in this pull request? This is a straightforward clone of JoshRosen 's original patch. I have follow-up changes to fix block replication for repl-defined classes as well, but those appear to be flaking tests so I'm going to leave that for SPARK-17042 ## How was this patch tested? End-to-end test in ReplSuite (also more tests in DistributedSuite from the original patch). Author: Eric Liang <ekl@databricks.com> Closes #14311 from ericl/spark-16550.
* [SPARK-16508][SPARKR] doc updates and more CRAN check fixesFelix Cheung2016-08-2212-114/+119
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? replace ``` ` ``` in code doc with `\code{thing}` remove added `...` for drop(DataFrame) fix remaining CRAN check warnings ## How was this patch tested? create doc with knitr junyangq Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14734 from felixcheung/rdoccleanup.
* [SPARK-17162] Range does not support SQL generationEric Liang2016-08-228-18/+44
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The range operator previously didn't support SQL generation, which made it not possible to use in views. ## How was this patch tested? Unit tests. cc hvanhovell Author: Eric Liang <ekl@databricks.com> Closes #14724 from ericl/spark-17162.
* [MINOR][SQL] Fix some typos in comments and test hintsSean Zhong2016-08-223-7/+7
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix some typos in comments and test hints ## How was this patch tested? N/A. Author: Sean Zhong <seanzhong@databricks.com> Closes #14755 from clockfly/fix_minor_typo.
* [SPARKR][MINOR] Add Xiangrui and Felix to maintainersShivaram Venkataraman2016-08-221-0/+2
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This change adds Xiangrui Meng and Felix Cheung to the maintainers field in the package description. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14758 from shivaram/sparkr-maintainers.
* [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reformat, fix deprecation ↵Felix Cheung2016-08-222-117/+98
| | | | | | | | | | | | | | | | in test ## What changes were proposed in this pull request? refactor, cleanup, reformat, fix deprecation in test ## How was this patch tested? unit tests, manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14735 from felixcheung/rmllibutil.
* [SPARK-16320][DOC] Document G1 heap region's effect on spark 2.0 vs 1.6Sean Owen2016-08-221-19/+17
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Collect GC discussion in one section, and documenting findings about G1 GC heap region size. ## How was this patch tested? Jekyll doc build Author: Sean Owen <sowen@cloudera.com> Closes #14732 from srowen/SPARK-16320.
* [SPARKR][MINOR] Fix Cache Folder Path in WindowsJunyang Qian2016-08-221-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR tries to fix the scheme of local cache folder in Windows. The name of the environment variable should be `LOCALAPPDATA` rather than `%LOCALAPPDATA%`. ## How was this patch tested? Manual test in Windows 7. Author: Junyang Qian <junyangq@databricks.com> Closes #14743 from junyangq/SPARKR-FixWindowsInstall.
* [SPARK-15113][PYSPARK][ML] Add missing num features num classesHolden Karau2016-08-224-11/+66
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add missing `numFeatures` and `numClasses` to the wrapped Java models in PySpark ML pipelines. Also tag `DecisionTreeClassificationModel` as Expiremental to match Scala doc. ## How was this patch tested? Extended doctests Author: Holden Karau <holden@us.ibm.com> Closes #12889 from holdenk/SPARK-15113-add-missing-numFeatures-numClasses.
* [SPARK-17085][STREAMING][DOCUMENTATION AND ACTUAL CODE DIFFERS - UNSUPPORTED ↵Jagadeesan2016-08-221-2/+2
| | | | | | | | | | | OPERATIONS] Changes in Spark Stuctured Streaming doc in this link https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html#unsupported-operations Author: Jagadeesan <as2@us.ibm.com> Closes #14715 from jagadeesanas2/SPARK-17085.
* [SPARK-17115][SQL] decrease the threshold when split expressionsDavies Liu2016-08-223-5/+59
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In 2.0, we change the threshold of splitting expressions from 16K to 64K, which cause very bad performance on wide table, because the generated method can't be JIT compiled by default (above the limit of 8K bytecode). This PR will decrease it to 1K, based on the benchmark results for a wide table with 400 columns of LongType. It also fix a bug around splitting expression in whole-stage codegen (it should not split them). ## How was this patch tested? Added benchmark suite. Author: Davies Liu <davies@databricks.com> Closes #14692 from davies/split_exprs.
* [SPARK-16968] Document additional options in jdbc WriterGraceH2016-08-221-0/+14
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) This is the document for previous JDBC Writer options. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Unit test has been added in previous PR. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: GraceH <jhuang1@paypal.com> Closes #14683 from GraceH/jdbc_options.
* [SPARK-17127] Make unaligned access in unsafe available for AArch64Richael2016-08-221-1/+1
| | | | | | | | | | | | | | | | | | ## # What changes were proposed in this pull request? From the spark of version 2.0.0 , when MemoryMode.OFF_HEAP is set , whether the architecture supports unaligned access or not is checked. If the check doesn't pass, exception is raised. We know that AArch64 also supports unaligned access , but now only i386, x86, amd64, and X86_64 are included. I think we should include aarch64 when performing the check. ## How was this patch tested? Unit test suite Author: Richael <Richael.Zhuang@arm.com> Closes #14700 from yimuxi/zym_change_unsafe.
* [SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalogWenchen Fan2016-08-2124-653/+536
| | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties. This PR moves these hacks to `HiveExternalCatalog`, tries to isolate hive specific logic in one place. changes overview: 1. **before this PR**: we need to put metadata(schema, partition columns, etc.) of data source tables to table properties before saving it to external catalog, even the external catalog doesn't use hive metastore(e.g. `InMemoryCatalog`) **after this PR**: the table properties tricks are only in `HiveExternalCatalog`, the caller side doesn't need to take care of it anymore. 2. **before this PR**: because the table properties tricks are done outside of external catalog, so we also need to revert these tricks when we read the table metadata from external catalog and use it. e.g. in `DescribeTableCommand` we will read schema and partition columns from table properties. **after this PR**: The table metadata read from external catalog is exactly the same with what we saved to it. bonus: now we can create data source table using `SessionCatalog`, if schema is specified. breaks: `schemaStringLengthThreshold` is not configurable anymore. `hive.default.rcfile.serde` is not configurable anymore. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14155 from cloud-fan/catalog-table.
* [SPARK-17098][SQL] Fix `NullPropagation` optimizer to handle `COUNT(NULL) ↵Dongjoon Hyun2016-08-213-0/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OVER` correctly ## What changes were proposed in this pull request? Currently, `NullPropagation` optimizer replaces `COUNT` on null literals in a bottom-up fashion. During that, `WindowExpression` is not covered properly. This PR adds the missing propagation logic. **Before** ```scala scala> sql("SELECT COUNT(1 + NULL) OVER ()").show java.lang.UnsupportedOperationException: Cannot evaluate expression: cast(0 as bigint) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ``` **After** ```scala scala> sql("SELECT COUNT(1 + NULL) OVER ()").show +----------------------------------------------------------------------------------------------+ |count((1 + CAST(NULL AS INT))) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)| +----------------------------------------------------------------------------------------------+ | 0| +----------------------------------------------------------------------------------------------+ ``` ## How was this patch tested? Pass the Jenkins test with a new test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14689 from dongjoon-hyun/SPARK-17098.
* [MINOR][R] add SparkR.Rcheck/ and SparkR_*.tar.gz to R/.gitignoreXiangrui Meng2016-08-211-0/+2
| | | | | | | | | | ## What changes were proposed in this pull request? Ignore temp files generated by `check-cran.sh`. Author: Xiangrui Meng <meng@databricks.com> Closes #14740 from mengxr/R-gitignore.
* [SPARK-17002][CORE] Document that spark.ssl.protocol. is required for SSLwm624@hotmail.com2016-08-212-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `spark.ssl.enabled`=true, but failing to set `spark.ssl.protocol` will fail and throw meaningless exception. `spark.ssl.protocol` is required when `spark.ssl.enabled`. Improvement: require `spark.ssl.protocol` when initializing SSLContext, otherwise throws an exception to indicate that. Remove the OrElse("default"). Document this requirement in configure.md ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual tests: Build document and check document Configure `spark.ssl.enabled` only, it throws exception below: 6/08/16 16:04:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mwang); groups with view permissions: Set(); users with modify permissions: Set(mwang); groups with modify permissions: Set() Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: spark.ssl.protocol is required when enabling SSL connections. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:285) at org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1026) at org.apache.spark.deploy.master.Master$.main(Master.scala:1011) at org.apache.spark.deploy.master.Master.main(Master.scala) Configure `spark.ssl.protocol` and `spark.ssl.protocol` It works fine. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14674 from wangmiao1981/ssl.
* [SPARK-16961][FOLLOW-UP][SPARKR] More robust test case for ↵Yanbo Liang2016-08-211-22/+25
| | | | | | | | | | | | | | | spark.gaussianMixture. ## What changes were proposed in this pull request? #14551 fixed off-by-one bug in ```randomizeInPlace``` and some test failure caused by this fix. But for SparkR ```spark.gaussianMixture``` test case, the fix is inappropriate. It only changed the output result of native R which should be compared by SparkR, however, it did not change the R code in annotation which is used for reproducing the result in native R. It will confuse users who can not reproduce the same result in native R. This PR sends a more robust test case which can produce same result between SparkR and native R. ## How was this patch tested? Unit test update. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14730 from yanboliang/spark-16961-followup.
* [SPARK-17090][ML] Make tree aggregation level in linear/logistic regression ↵hqzizania2016-08-205-17/+74
| | | | | | | | | | | | configurable ## What changes were proposed in this pull request? Linear/logistic regression use treeAggregate with default depth (always = 2) for collecting coefficient gradient updates to the driver. For high dimensional problems, this can cause OOM error on the driver. This patch makes it configurable to avoid this problem if users' input data has many features. It adds a HasTreeDepth API in `sharedParams.scala`, and extends it to both Linear regression and logistic regression in .ml Author: hqzizania <hqzizania@gmail.com> Closes #14717 from hqzizania/SPARK-17090.
* [SPARK-12666][CORE] SparkSubmit packages fix for when 'default' conf doesn't ↵Bryan Cutler2016-08-201-4/+7
| | | | | | | | | | | | | | | exist in dependent module ## What changes were proposed in this pull request? Adding a "(runtime)" to the dependency configuration will set a fallback configuration to be used if the requested one is not found. E.g. with the setting "default(runtime)", Ivy will look for the conf "default" in the module ivy file and if not found will look for the conf "runtime". This can help with the case when using "sbt publishLocal" which does not write a "default" conf in the published ivy.xml file. ## How was this patch tested? used spark-submit with --packages option for a package published locally with no default conf, and a package resolved from Maven central. Author: Bryan Cutler <cutlerb@gmail.com> Closes #13428 from BryanCutler/fallback-package-conf-SPARK-12666.
* [SPARK-17124][SQL] RelationalGroupedDataset.agg should preserve order and ↵petermaxlee2016-08-212-2/+14
| | | | | | | | | | | | | | | | | | | allow multiple aggregates per column ## What changes were proposed in this pull request? This patch fixes a longstanding issue with one of the RelationalGroupedDataset.agg function. Even though the signature accepts vararg of pairs, the underlying implementation turns the seq into a map, and thus not order preserving nor allowing multiple aggregates per column. This change also allows users to use this function to run multiple different aggregations for a single column, e.g. ``` agg("age" -> "max", "age" -> "count") ``` ## How was this patch tested? Added a test case in DataFrameAggregateSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #14697 from petermaxlee/SPARK-17124.
* [SPARK-17104][SQL] LogicalRelation.newInstance should follow the semantics ↵Liang-Chi Hsieh2016-08-202-2/+16
| | | | | | | | | | | | | | | | of MultiInstanceRelation ## What changes were proposed in this pull request? Currently `LogicalRelation.newInstance()` simply creates another `LogicalRelation` object with the same parameters. However, the `newInstance()` method inherited from `MultiInstanceRelation` should return a copy of object with unique expression ids. Current `LogicalRelation.newInstance()` can cause failure when doing self-join. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #14682 from viirya/fix-localrelation.
* [SPARKR][EXAMPLE] change example APP namewm624@hotmail.com2016-08-201-1/+1
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) For R SQL example, appname is "MyApp". While examples in scala, Java and python, the appName is "x Spark SQL basic example". I made the R example consistent with other examples. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual test (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14703 from wangmiao1981/example.
* [SPARK-16508][SPARKR] Fix CRAN undocumented/duplicated arguments warnings.Junyang Qian2016-08-2011-267/+419
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR tries to fix all the remaining "undocumented/duplicated arguments" warnings given by CRAN-check. One left is doc for R `stats::glm` exported in SparkR. To mute that warning, we have to also provide document for all arguments of that non-SparkR function. Some previous conversation is in #14558. ## How was this patch tested? R unit test and `check-cran.sh` script (with no-test). Author: Junyang Qian <junyangq@databricks.com> Closes #14705 from junyangq/SPARK-16508-master.
* [SPARK-15018][PYSPARK][ML] Improve handling of PySpark Pipeline when used ↵Bryan Cutler2016-08-192-8/+14
| | | | | | | | | | | | | | | without stages ## What changes were proposed in this pull request? When fitting a PySpark Pipeline without the `stages` param set, a confusing NoneType error is raised as attempts to iterate over the pipeline stages. A pipeline with no stages should act as an identity transform, however the `stages` param still needs to be set to an empty list. This change improves the error output when the `stages` param is not set and adds a better description of what the API expects as input. Also minor cleanup of related code. ## How was this patch tested? Added new unit tests to verify an empty Pipeline acts as an identity transformer Author: Bryan Cutler <cutlerb@gmail.com> Closes #12790 from BryanCutler/pipeline-identity-SPARK-15018.
* [SPARK-17150][SQL] Support SQL generation for inline tablespetermaxlee2016-08-204-2/+30
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds support for SQL generation for inline tables. With this, it would be possible to create a view that depends on inline tables. ## How was this patch tested? Added a test case in LogicalPlanToSQLSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #14709 from petermaxlee/SPARK-17150.
* [SPARK-17158][SQL] Change error message for out of range numeric literalsSrinath Shankar2016-08-193-17/+27
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Modifies error message for numeric literals to Numeric literal <literal> does not fit in range [min, max] for type <T> ## How was this patch tested? Fixed up the error messages for literals.sql in SqlQueryTestSuite and re-ran via sbt. Also fixed up error messages in ExpressionParserSuite Author: Srinath Shankar <srinath@databricks.com> Closes #14721 from srinathshankar/sc4296.
* [SPARK-17149][SQL] array.sql for testing array related functionspetermaxlee2016-08-197-33/+248
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch creates array.sql in SQLQueryTestSuite for testing array related functions, including: - indexing - array creation - size - array_contains - sort_array ## How was this patch tested? The patch itself is about adding tests. Author: petermaxlee <petermaxlee@gmail.com> Closes #14708 from petermaxlee/SPARK-17149.
* [SPARK-16443][SPARKR] Alternating Least Squares (ALS) wrapperJunyang Qian2016-08-196-5/+322
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add Alternating Least Squares wrapper in SparkR. Unit tests have been updated. ## How was this patch tested? SparkR unit tests. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) ![screen shot 2016-07-27 at 3 50 31 pm](https://cloud.githubusercontent.com/assets/15318264/17195347/f7a6352a-5411-11e6-8e21-61a48070192a.png) ![screen shot 2016-07-27 at 3 50 46 pm](https://cloud.githubusercontent.com/assets/15318264/17195348/f7a7d452-5411-11e6-845f-6d292283bc28.png) Author: Junyang Qian <junyangq@databricks.com> Closes #14384 from junyangq/SPARK-16443.
* [SPARK-17113] [SHUFFLE] Job failure due to Executor OOM in offheap modeSital Kedia2016-08-192-1/+8
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes executor OOM in offheap mode due to bug in Cooperative Memory Management for UnsafeExternSorter. UnsafeExternalSorter was checking if memory page is being used by upstream by comparing the base object address of the current page with the base object address of upstream. However, in case of offheap memory allocation, the base object addresses are always null, so there was no spilling happening and eventually the operator would OOM. Following is the stack trace this issue addresses - java.lang.OutOfMemoryError: Unable to acquire 1220 bytes of memory, got 0 at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:341) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:362) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:93) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170) ## How was this patch tested? Tested by running the failing job. Author: Sital Kedia <skedia@fb.com> Closes #14693 from sitalkedia/fix_offheap_oom.