spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-13096][TEST] Fix flaky verifyPeakExecutionMemorySet	Andrew Or	2016-01-29	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	Previously we would assert things before all events are guaranteed to have been processed. To fix this, just block until all events are actually processed, i.e. until the listener queue is empty. https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/79/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling/ Author: Andrew Or <andrew@databricks.com> Closes #10990 from andrewor14/accum-suite-less-flaky.
*	[SPARK-13076][SQL] Rename ClientInterface -> HiveClient	Reynold Xin	2016-01-29	12	-42/+41
\| \| \| \| \| \| \| \| \| \|	And ClientWrapper -> HiveClientImpl. I have some followup pull requests to introduce a new internal catalog, and I think this new naming reflects better the functionality of the two classes. Author: Reynold Xin <rxin@databricks.com> Closes #10981 from rxin/SPARK-13076.
*	[SPARK-13055] SQLHistoryListener throws ClassCastException	Andrew Or	2016-01-29	13	-45/+133
\| \| \| \| \| \| \| \| \| \|	This is an existing issue uncovered recently by #10835. The reason for the exception was because the `SQLHistoryListener` gets all sorts of accumulators, not just the ones that represent SQL metrics. For example, the listener gets the `internal.metrics.shuffleRead.remoteBlocksFetched`, which is an Int, then it proceeds to cast the Int to a Long, which fails. The fix is to mark accumulators representing SQL metrics using some internal metadata. Then we can identify which ones are SQL metrics and only process those in the `SQLHistoryListener`. Author: Andrew Or <andrew@databricks.com> Closes #10971 from andrewor14/fix-sql-history.
*	[SPARK-12818] Polishes spark-sketch module	Cheng Lian	2016-01-29	6	-83/+110
\| \| \| \| \| \| \| \|	Fixes various minor code and Javadoc styling issues. Author: Cheng Lian <lian@databricks.com> Closes #10985 from liancheng/sketch-polishing.
*	[SPARK-12656] [SQL] Implement Intersect with Left-semi Join	gatorsmile	2016-01-29	11	-122/+211
\| \| \| \| \| \| \| \| \| \| \| \|	Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: https://github.com/apache/spark/pull/10566 Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10630 from gatorsmile/IntersectBySemiJoin.
*	[SPARK-13072] [SQL] simplify and improve murmur3 hash expression codegen	Wenchen Fan	2016-01-29	1	-86/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	simplify(remove several unnecessary local variables) the generated code of hash expression, and avoid null check if possible. generated code comparison for `hash(int, double, string, array<string>)`: before: ``` public UnsafeRow apply(InternalRow i) { /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) / int value1 = 42; / input[0, int] / int value3 = i.getInt(0); if (!false) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); } / input[1, double] / double value5 = i.getDouble(1); if (!false) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1); } / input[2, string] / boolean isNull6 = i.isNullAt(2); UTF8String value7 = isNull6 ? null : (i.getUTF8String(2)); if (!isNull6) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1); } / input[3, array<int>] / boolean isNull8 = i.isNullAt(3); ArrayData value9 = isNull8 ? null : (i.getArray(3)); if (!isNull8) { int result10 = value1; for (int index11 = 0; index11 < value9.numElements(); index11++) { if (!value9.isNullAt(index11)) { final int element12 = value9.getInt(index11); result10 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element12, result10); } } value1 = result10; } } ``` after:* ``` public UnsafeRow apply(InternalRow i) { /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) / int value1 = 42; / input[0, int] / int value3 = i.getInt(0); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); / input[1, double] / double value5 = i.getDouble(1); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1); / input[2, string] / boolean isNull6 = i.isNullAt(2); UTF8String value7 = isNull6 ? null : (i.getUTF8String(2)); if (!isNull6) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1); } / input[3, array<int>] */ boolean isNull8 = i.isNullAt(3); ArrayData value9 = isNull8 ? null : (i.getArray(3)); if (!isNull8) { for (int index10 = 0; index10 < value9.numElements(); index10++) { final int element11 = value9.getInt(index10); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element11, value1); } } rowWriter14.write(0, value1); return result12; } ``` Author: Wenchen Fan <wenchen@databricks.com> Closes #10974 from cloud-fan/codegen.
*	[SPARK-10873] Support column sort and search for History Server.	zhuol	2016-01-29	28	-202/+1721
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	[SPARK-10873] Support column sort and search for History Server using jQuery DataTable and REST API. Before this commit, the history server was generated hard-coded html and can not support search, also, the sorting was disabled if there is any application that has more than one attempt. Supporting search and sort (over all applications rather than the 20 entries in the current page) in any case will greatly improve user experience. 1. Create the historypage-template.html for displaying application information in datables. 2. historypage.js uses jQuery to access the data from /api/v1/applications REST API, and use DataTable to display each application's information. For application that has more than one attempt, the RowsGroup is used to merge such entries while at the same time supporting sort and search. 3. "duration" and "lastUpdated" rest API are added to application's "attempts". 4. External javascirpt and css files for datatables, RowsGroup and jquery plugins are added with licenses clarified. Snapshots for how it looks like now: History page view: ![historypage](https://cloud.githubusercontent.com/assets/11683054/12184383/89bad774-b55a-11e5-84e4-b0276172976f.png) Search: ![search](https://cloud.githubusercontent.com/assets/11683054/12184385/8d3b94b0-b55a-11e5-869a-cc0ef0a4242a.png) Sort by started time: ![sort-by-started-time](https://cloud.githubusercontent.com/assets/11683054/12184387/8f757c3c-b55a-11e5-98c8-577936366566.png) Author: zhuol <zhuol@yahoo-inc.com> Closes #10648 from zhuoliu/10873.
*	[SPARK-13032][ML][PYSPARK] PySpark support model export/import and take ↵	Yanbo Liang	2016-01-29	5	-29/+236
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	LinearRegression as example * Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark. * Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #10469 from yanboliang/spark-11939.
*	[SPARK-13031][SQL] cleanup codegen and improve test coverage	Davies Liu	2016-01-29	11	-205/+350
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. enable whole stage codegen during tests even there is only one operator supports that. 2. split doProduce() into two APIs: upstream() and doProduce() 3. generate prefix for fresh names of each operator 4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again) 5. fix bugs and tests. This PR re-open #10944 and fix the bug. Author: Davies Liu <davies@databricks.com> Closes #10977 from davies/gen_refactor.
*	[SPARK-13050][BUILD] Scalatest tags fail build with the addition of the ↵	Alex Bozarth	2016-01-28	1	-0/+7
\| \| \| \| \| \| \| \| \| \|	sketch module A dependency on the spark test tags was left out of the sketch module pom file causing builds to fail when test tags were used. This dependency is found in the pom file for every other module in spark. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #10954 from ajbozarth/spark13050.
*	[SPARK-13067] [SQL] workaround for a weird scala reflection problem	Wenchen Fan	2016-01-28	2	-6/+23
\| \| \| \| \| \| \| \| \|	A simple workaround to avoid getting parameter types when convert a logical plan to json. Author: Wenchen Fan <wenchen@databricks.com> Closes #10970 from cloud-fan/reflection.
*	[SPARK-12968][SQL] Implement command to set current database	Liang-Chi Hsieh	2016-01-28	9	-3/+50
\| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-12968 Implement command to set current database. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10916 from viirya/ddl-use-database.
*	Revert "[SPARK-13031] [SQL] cleanup codegen and improve test coverage"	Davies Liu	2016-01-28	9	-334/+202
\| \| \| \|	This reverts commit cc18a7199240bf3b03410c1ba6704fe7ce6ae38e.
*	[SPARK-11955][SQL] Mark optional fields in merging schema for safely ↵	Liang-Chi Hsieh	2016-01-28	6	-29/+117
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pushdowning filters in Parquet JIRA: https://issues.apache.org/jira/browse/SPARK-11955 Currently we simply skip pushdowning filters in parquet if we enable schema merging. However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet. Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #9940 from viirya/safe-pushdown-parquet-filters.
*	[SPARK-12749][SQL] add json option to parse floating-point types as DecimalType	Brandon Bradley	2016-01-28	5	-2/+40
\| \| \| \| \| \| \| \| \| \|	I tried to add this via `USE_BIG_DECIMAL_FOR_FLOATS` option from Jackson with no success. Added test for non-complex types. Should I add a test for complex types? Author: Brandon Bradley <bradleytastic@gmail.com> Closes #10936 from blbradley/spark-12749.
*	[SPARK-12401][SQL] Add integration tests for postgres enum types	Takeshi YAMAMURO	2016-01-28	1	-6/+9
\| \| \| \| \| \| \| \| \|	We can handle posgresql-specific enum types as strings in jdbc. So, we should just add tests and close the corresponding JIRA ticket. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10596 from maropu/AddTestsInIntegration.
*	[SPARK-9835][ML] Implement IterativelyReweightedLeastSquares solver	Yanbo Liang	2016-01-28	3	-1/+314
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement ```IterativelyReweightedLeastSquares``` solver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep it ```private[ml]```. There are two limitations in the current implementation compared with R: * It can not support ```Tuple``` as response for ```Binomial``` family, such as the following code: ``` glm( cbind(using, notUsing) ~ age + education + wantsMore , family = binomial) ``` * It does not support ```offset```. Because I considered that ```RFormula``` did not support ```Tuple``` as label and ```offset``` keyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS. The implementation refers R, [statsmodels](https://github.com/statsmodels/statsmodels) and [sparkGLM](https://github.com/AlteryxLabs/sparkGLM). Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10639 from yanboliang/spark-9835.
*	[SPARK-13031] [SQL] cleanup codegen and improve test coverage	Davies Liu	2016-01-28	9	-202/+334
\| \| \| \| \| \| \| \| \| \| \| \|	1. enable whole stage codegen during tests even there is only one operator supports that. 2. split doProduce() into two APIs: upstream() and doProduce() 3. generate prefix for fresh names of each operator 4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again) 5. fix bugs and tests. Author: Davies Liu <davies@databricks.com> Closes #10944 from davies/gen_refactor.
*	[SPARK-12926][SQL] SQLContext to display warning message when non-sql ↵	Tejas Patil	2016-01-28	1	-3/+11
\| \| \| \| \| \| \| \| \| \|	configs are being set Users unknowingly try to set core Spark configs in SQLContext but later realise that it didn't work. eg. sqlContext.sql("SET spark.shuffle.memoryFraction=0.4"). This PR adds a warning message when such operations are done. Author: Tejas Patil <tejasp@fb.com> Closes #10849 from tejasapatil/SPARK-12926.
*	[SPARK-12818][SQL] Specialized integral and string types for Count-min Sketch	Cheng Lian	2016-01-28	3	-35/+99
\| \| \| \| \| \| \| \|	This PR is a follow-up of #10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`. Author: Cheng Lian <lian@databricks.com> Closes #10968 from liancheng/cms-specialized.
*	Provide same info as in spark-submit --help	James Lohse	2016-01-28	1	-2/+3
\| \| \| \| \| \| \| \|	this is stated for --packages and --repositories. Without stating it for --jars, people expect a standard java classpath to work, with expansion and using a different delimiter than a comma. Currently this is only state in the --help for spark-submit "Comma-separated list of local jars to include on the driver and executor classpaths." Author: James Lohse <jimlohse@users.noreply.github.com> Closes #10890 from jimlohse/patch-1.
*	[SPARK-13045] [SQL] Remove ColumnVector.Struct in favor of ColumnarBatch.Row	Nong Li	2016-01-27	3	-120/+32
\| \| \| \| \| \| \| \|	These two classes became identical as the implementation progressed. Author: Nong Li <nong@databricks.com> Closes #10952 from nongli/spark-13045.
*	[HOTFIX] Fix Scala 2.11 compilation	Andrew Or	2016-01-27	1	-1/+2
\| \| \| \| \| \| \| \| \| \|	by explicitly marking annotated parameters as vals (SI-8813). Caused by #10835. Author: Andrew Or <andrew@databricks.com> Closes #10955 from andrewor14/fix-scala211.
*	[SPARK-12865][SPARK-12866][SQL] Migrate SparkSQLParser/ExtendedHiveQlParser ↵	Herman van Hovell	2016-01-27	16	-226/+161
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	commands to new Parser This PR moves all the functionality provided by the SparkSQLParser/ExtendedHiveQlParser to the new Parser hierarchy (SparkQl/HiveQl). This also improves the current SET command parsing: the current implementation swallows ```set role ...``` and ```set autocommit ...``` commands, this PR respects these commands (and passes them on to Hive). This PR and https://github.com/apache/spark/pull/10723 end the use of Parser-Combinator parsers for SQL parsing. As a result we can also remove the ```AbstractSQLParser``` in Catalyst. The PR is marked WIP as long as it doesn't pass all tests. cc rxin viirya winningsix (this touches https://github.com/apache/spark/pull/10144) Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10905 from hvanhovell/SPARK-12866.
*	[SPARK-12938][SQL] DataFrame API for Bloom filter	Wenchen Fan	2016-01-27	7	-93/+306
\| \| \| \| \| \| \| \| \| \|	This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs. This PR also add 2 specify `put` version(`putBinary` and `putLong`) into `BloomFilter`, which makes it easier to build a Bloom filter over a `DataFrame`. Author: Wenchen Fan <wenchen@databricks.com> Closes #10937 from cloud-fan/bloom-filter.
*	[SPARK-13021][CORE] Fail fast when custom RDDs violate RDD.partition's API ↵	Josh Rosen	2016-01-27	2	-0/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	contract Spark's `Partition` and `RDD.partitions` APIs have a contract which requires custom implementations of `RDD.partitions` to ensure that for all `x`, `rdd.partitions(x).index == x`; in other words, the `index` reported by a repartition needs to match its position in the partitions array. If a custom RDD implementation violates this contract, then Spark has the potential to become stuck in an infinite recomputation loop when recomputing a subset of an RDD's partitions, since the tasks that are actually run will not correspond to the missing output partitions that triggered the recomputation. Here's a link to a notebook which demonstrates this problem: https://rawgit.com/JoshRosen/e520fb9a64c1c97ec985/raw/5e8a5aa8d2a18910a1607f0aa4190104adda3424/Violating%2520RDD.partitions%2520contract.html In order to guard against this infinite loop behavior, this patch modifies Spark so that it fails fast and refuses to compute RDDs' whose `partitions` violate the API contract. Author: Josh Rosen <joshrosen@databricks.com> Closes #10932 from JoshRosen/SPARK-13021.
*	[SPARK-12895][SPARK-12896] Migrate TaskMetrics to accumulators	Andrew Or	2016-01-27	70	-1141/+3012
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The high level idea is that instead of having the executors send both accumulator updates and TaskMetrics, we should have them send only accumulator updates. This eliminates the need to maintain both code paths since one can be implemented in terms of the other. This effort is split into two parts: SPARK-12895: Implement TaskMetrics using accumulators. TaskMetrics is basically just a bunch of accumulable fields. This patch makes TaskMetrics a syntactic wrapper around a collection of accumulators so we don't need to send TaskMetrics from the executors to the driver. SPARK-12896: Send only accumulator updates to the driver. Now that TaskMetrics are expressed in terms of accumulators, we can capture all TaskMetrics values if we just send accumulator updates from the executors to the driver. This completes the parent issue SPARK-10620. While an effort has been made to preserve as much of the public API as possible, there were a few known breaking DeveloperApi changes that would be very awkward to maintain. I will gather the full list shortly and post it here. Note: This was once part of #10717. This patch is split out into its own patch from there to make it easier for others to review. Other smaller pieces of already been merged into master. Author: Andrew Or <andrew@databricks.com> Closes #10835 from andrewor14/task-metrics-use-accums.
*	[SPARK-10847][SQL][PYSPARK] Pyspark - DataFrame - Optional Metadata with ↵	Jason Lee	2016-01-27	2	-1/+13
\| \| \| \| \| \| \| \| \| \|	`None` triggers cryptic failure The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works. Author: Jason Lee <cjlee@us.ibm.com> Closes #8969 from jasoncl/SPARK-10847.
*	[SPARK-13023][PROJECT INFRA] Fix handling of root module in modules_to_test()	Josh Rosen	2016-01-27	1	-5/+5
\| \| \| \| \| \| \| \|	There's a minor bug in how we handle the `root` module in the `modules_to_test()` function in `dev/run-tests.py`: since `root` now depends on `build` (since every test needs to run on any build test), we now need to check for the presence of root in `modules_to_test` instead of `changed_modules`. Author: Josh Rosen <joshrosen@databricks.com> Closes #10933 from JoshRosen/build-module-fix.
*	[SPARK-1680][DOCS] Explain environment variables for running on YARN in ↵	Andrew	2016-01-27	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	cluster mode JIRA 1680 added a property called spark.yarn.appMasterEnv. This PR draws users' attention to this special case by adding an explanation in configuration.html#environment-variables Author: Andrew <weiner.andrew.j@gmail.com> Closes #10869 from weineran/branch-yarn-docs.
*	[SPARK-12983][CORE][DOC] Correct metrics.properties.template	BenFradet	2016-01-27	1	-34/+37
\| \| \| \| \| \| \| \|	There are some typos or plain unintelligible sentences in the metrics template. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10902 from BenFradet/SPARK-12983.
*	[SPARK-12780] Inconsistency returning value of ML python models' properties	Xusen Yin	2016-01-26	1	-2/+3
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes #10724 from yinxusen/SPARK-12780.
*	[SPARK-12967][NETTY] Avoid NettyRpc error message during sparkContext shutdown	Nishkam Ravi	2016-01-26	4	-5/+32
\| \| \| \| \| \| \| \| \| \| \|	If there's an RPC issue while sparkContext is alive but stopped (which would happen only when executing SparkContext.stop), log a warning instead. This is a common occurrence. vanzin Author: Nishkam Ravi <nishkamravi@gmail.com> Author: nishkamravi2 <nishkamravi@gmail.com> Closes #10881 from nishkamravi2/master_netty.
*	[SPARK-12728][SQL] Integrates SQL generation with native view	Cheng Lian	2016-01-26	6	-95/+200
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR is a follow-up of PR #10541. It integrates the newly introduced SQL generation feature with native view to make native view canonical. In this PR, a new SQL option `spark.sql.nativeView.canonical` is added. When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to handle `CREATE VIEW` DDL statements using SQL query strings generated from view definition logical plans. If we failed to map the plan to SQL, we fallback to the original native view approach. One important issue this PR fixes is that, now we can use CTE when defining a view. Originally, when native view is turned on, we wrap the view definition text with an extra `SELECT`. However, HiveQL parser doesn't allow CTE appearing as a subquery. Namely, something like this is disallowed: ```sql SELECT n FROM ( WITH w AS (SELECT 1 AS n) SELECT * FROM w ) v ``` This PR fixes this issue because the extra `SELECT` is no longer needed (also, CTE expressions are inlined as subqueries during analysis phase, thus there won't be CTE expressions in the generated SQL query string). Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view.
*	[SPARK-12935][SQL] DataFrame API for Count-Min Sketch	Cheng Lian	2016-01-26	7	-37/+205
\| \| \| \| \| \| \| \|	This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs. Author: Cheng Lian <lian@databricks.com> Closes #10911 from liancheng/cms-df-api.
*	[SPARK-12903][SPARKR] Add covar_samp and covar_pop for SparkR	Yanbo Liang	2016-01-26	5	-2/+73
\| \| \| \| \| \| \| \| \| \| \|	Add ```covar_samp``` and ```covar_pop``` for SparkR. Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10829 from yanboliang/spark-12903.
*	[SPARK-7780][MLLIB] intercept in logisticregressionwith lbfgs should not be ↵	Holden Karau	2016-01-26	6	-28/+179
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	regularized The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. Previously partially reviewed at https://github.com/apache/spark/pull/6386#issuecomment-168781424 re-opening for dbtsai to review. Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes #10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.
*	[SPARK-12854][SQL] Implement complex types support in ColumnarBatch	Nong Li	2016-01-26	16	-90/+1671
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs and arrays. There is a simple mapping between the richer catalyst types to these two. Strings are treated as an array of bytes. ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists of just leaf nodes. Structs represent an internal node with one child for each field. Arrays are internal nodes with one child. Structs just contain nullability. Arrays contain offsets and lengths into the child array. This structure is able to handle arbitrary nesting. It has the key property that we maintain columnar throughout and that primitive types are only stored in the leaf nodes and contiguous across rows. For example, if the schema is ``` array<array<int>> ``` There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively. As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v) vs appendLong(v)). These APIs are necessary when the batch contains variable length elements. The vectors are not fixed length and will grow as necessary. This should make the usage a lot simpler for the writer. Author: Nong Li <nong@databricks.com> Closes #10820 from nongli/spark-12854.
*	[SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and…	Jeff Zhang	2016-01-26	3	-16/+113
\| \| \| \| \| \| \| \| \| \| \| \|	… Add LibSVMOutputWriter The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter * Partition is still not supported * Multiple input paths is not supported Author: Jeff Zhang <zjffdu@apache.org> Closes #9595 from zjffdu/SPARK-11622.
*	[SPARK-12614][CORE] Don't throw non fatal exception from ask	Shixiong Zhu	2016-01-26	1	-25/+29
\| \| \| \| \| \| \| \|	Right now RpcEndpointRef.ask may throw exception in some corner cases, such as calling ask after stopping RpcEnv. It's better to avoid throwing exception from RpcEndpointRef.ask. We can send the exception to the future for `ask`. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10568 from zsxwing/send-ask-fail.
*	[SPARK-10509][PYSPARK] Reduce excessive param boiler plate code	Holden Karau	2016-01-26	12	-317/+43
\| \| \| \| \| \| \| \|	The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh). Author: Holden Karau <holden@us.ibm.com> Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
*	[SPARK-12993][PYSPARK] Remove usage of ADD_FILES in pyspark	Jeff Zhang	2016-01-26	1	-10/+1
\| \| \| \| \| \| \| \|	environment variable ADD_FILES is created for adding python files on spark context to be distributed to executors (SPARK-865), this is deprecated now. User are encouraged to use --py-files for adding python files. Author: Jeff Zhang <zjffdu@apache.org> Closes #10913 from zjffdu/SPARK-12993.
*	[SQL] Minor Scaladoc format fix	Cheng Lian	2016-01-26	1	-4/+4
\| \| \| \| \| \| \| \|	Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag. Author: Cheng Lian <lian@databricks.com> Closes #10926 from liancheng/agg-doc-fix.
*	[SPARK-8725][PROJECT-INFRA] Test modules in topologically-sorted order in ↵	Josh Rosen	2016-01-26	4	-18/+162
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	dev/run-tests This patch improves our `dev/run-tests` script to test modules in a topologically-sorted order based on modules' dependencies. This will help to ensure that bugs in upstream projects are not misattributed to downstream projects because those projects' tests were the first ones to exhibit the failure Topological sorting is also useful for shortening the feedback loop when testing pull requests: if I make a change in SQL then the SQL tests should run before MLlib, not after. In addition, this patch also updates our test module definitions to split `sql` into `catalyst`, `sql`, and `hive` in order to allow more tests to be skipped when changing only `hive/` files. Author: Josh Rosen <joshrosen@databricks.com> Closes #10885 from JoshRosen/SPARK-8725.
*	[SPARK-12952] EMLDAOptimizer initialize() should return EMLDAOptimizer other ↵	Xusen Yin	2016-01-26	1	-1/+3
\| \| \| \| \| \| \| \| \| \|	than its parent class https://issues.apache.org/jira/browse/SPARK-12952 Author: Xusen Yin <yinxusen@gmail.com> Closes #10863 from yinxusen/SPARK-12952.
*	[SPARK-11923][ML] Python API for ml.feature.ChiSqSelector	Xusen Yin	2016-01-26	1	-1/+97
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-11923 Author: Xusen Yin <yinxusen@gmail.com> Closes #10186 from yinxusen/SPARK-11923.
*	[SPARK-7799][STREAMING][DOCUMENT] Add the linking and deploying instructions ↵	Shixiong Zhu	2016-01-26	1	-37/+44
\| \| \| \| \| \| \| \| \| \| \| \|	for streaming-akka project Since `actorStream` is an external project, we should add the linking and deploying instructions for it. A follow up PR of #10744 Author: Shixiong Zhu <shixiong@databricks.com> Closes #10856 from zsxwing/akka-link-instruction.
*	[SPARK-12682][SQL] Add support for (optionally) not storing tables in hive ↵	Sameer Agarwal	2016-01-26	2	-0/+39
\| \| \| \| \| \| \| \| \| \|	metadata format This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL. Author: Sameer Agarwal <sameer@databricks.com> Closes #10826 from sameeragarwal/skip-hive-metadata.
*	[SPARK-10911] Executors should System.exit on clean shutdown.	zhuol	2016-01-26	1	-0/+1
\| \| \| \| \| \| \| \|	Call system.exit explicitly to make sure non-daemon user threads terminate. Without this, user applications might live forever if the cluster manager does not appropriately kill them. E.g., YARN had this bug: HADOOP-12441. Author: zhuol <zhuol@yahoo-inc.com> Closes #9946 from zhuoliu/10911.
*	[SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable is ↵	Sean Owen	2016-01-26	30	-110/+146
\| \| \| \| \| \| \| \| \| \| \| \|	inconsistent with Scala's Iterator->Iterator Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable. CC rxin pwendell for API change; tdas since it also touches streaming. Author: Sean Owen <sowen@cloudera.com> Closes #10413 from srowen/SPARK-3369.