spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-9316] [SPARKR] Add support for filtering using `[` (synonym for ↵	felixcheung	2015-08-25	2	-1/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	filter / select) Add support for ``` df[df$name == "Smith", c(1,2)] df[df$age %in% c(19, 30), 1:2] ``` shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #8394 from felixcheung/rsubset.
*	[SPARK-10236] [MLLIB] update since versions in mllib.feature	Xiangrui Meng	2015-08-25	8	-16/+21
\| \| \| \| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.feature`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8449 from mengxr/SPARK-10236.feature and squashes the following commits: 0e8d658 [Xiangrui Meng] remove unnecessary comment ad70b03 [Xiangrui Meng] update since versions in mllib.feature
*	[SPARK-10235] [MLLIB] update since versions in mllib.regression	Xiangrui Meng	2015-08-25	8	-29/+47
\| \| \| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.regression`. cc freeman-lab dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8426 from mengxr/SPARK-10235 and squashes the following commits: 6cd28e4 [Xiangrui Meng] update since versions in mllib.regression
*	[SPARK-10243] [MLLIB] update since versions in mllib.tree	Xiangrui Meng	2015-08-25	12	-44/+57
\| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.tree`. cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8442 from mengxr/SPARK-10236.
*	[SPARK-10234] [MLLIB] update since version in mllib.clustering	Xiangrui Meng	2015-08-25	7	-23/+44
\| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.clustering`. cc feynmanliang yu-iskw Author: Xiangrui Meng <meng@databricks.com> Closes #8435 from mengxr/SPARK-10234.
*	[SPARK-10240] [SPARK-10242] [MLLIB] update since versions in mlilb.random ↵	Xiangrui Meng	2015-08-25	4	-25/+117
\| \| \| \| \| \| \| \| \| \| \| \|	and mllib.stat The same as #8241 but for `mllib.stat` and `mllib.random`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8439 from mengxr/SPARK-10242.
*	[SPARK-10238] [MLLIB] update since versions in mllib.linalg	Xiangrui Meng	2015-08-25	8	-31/+64
\| \| \| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.linalg`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8440 from mengxr/SPARK-10238 and squashes the following commits: b38437e [Xiangrui Meng] update since versions in mllib.linalg
*	[SPARK-10233] [MLLIB] update since version in mllib.evaluation	Xiangrui Meng	2015-08-25	4	-7/+27
\| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.evaluation`. cc avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8423 from mengxr/SPARK-10233.
*	[SPARK-9888] [MLLIB] User guide for new LDA features	Feynman Liang	2015-08-25	3	-20/+117
\| \| \| \| \| \| \| \| \| \| \| \|	* Adds two new sections to LDA's user guide; one for each optimizer/model * Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization) * Cleans up a TODO and sets a default parameter in LDA code jkbradley hhbyyh Author: Feynman Liang <fliang@databricks.com> Closes #8254 from feynmanliang/SPARK-9888.
*	[SPARK-10215] [SQL] Fix precision of division (follow the rule in Hive)	Davies Liu	2015-08-25	4	-13/+39
\| \| \| \| \| \| \| \| \| \|	Follow the rule in Hive for decimal division. see https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFOPDivide.java#L113 cc chenghao-intel Author: Davies Liu <davies@databricks.com> Closes #8415 from davies/decimal_div2.
*	[SPARK-10245] [SQL] Fix decimal literals with precision < scale	Davies Liu	2015-08-25	3	-6/+19
\| \| \| \| \| \| \| \|	In BigDecimal or java.math.BigDecimal, the precision could be smaller than scale, for example, BigDecimal("0.001") has precision = 1 and scale = 3. But DecimalType require that the precision should be larger than scale, so we should use the maximum of precision and scale when inferring the schema from decimal literal. Author: Davies Liu <davies@databricks.com> Closes #8428 from davies/smaller_decimal.
*	[SPARK-10239] [SPARK-10244] [MLLIB] update since versions in mllib.pmml and ↵	Xiangrui Meng	2015-08-25	9	-11/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	mllib.util Same as #8421 but for `mllib.pmml` and `mllib.util`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8430 from mengxr/SPARK-10239 and squashes the following commits: a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util
*	[SPARK-9797] [MLLIB] [DOC] ↵	Feynman Liang	2015-08-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	StreamingLinearRegressionWithSGD.setConvergenceTol default value Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc Author: Feynman Liang <fliang@databricks.com> Closes #8424 from feynmanliang/SPARK-9797.
*	[SPARK-10237] [MLLIB] update since versions in mllib.fpm	Xiangrui Meng	2015-08-25	3	-7/+32
\| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.fpm`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8429 from mengxr/SPARK-10237.
*	[SPARK-9800] Adds docs for GradientDescent$.runMiniBatchSGD alias	Feynman Liang	2015-08-25	1	-1/+4
\| \| \| \| \| \| \| \| \|	* Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol * Cleans up a note in code Author: Feynman Liang <fliang@databricks.com> Closes #8425 from feynmanliang/SPARK-9800.
*	[SPARK-10048] [SPARKR] Support arbitrary nested Java array in serde.	Sun Rui	2015-08-25	8	-127/+216
\| \| \| \| \| \| \| \| \| \| \|	This PR: 1. supports transferring arbitrary nested array from JVM to R side in SerDe; 2. based on 1, collect() implemenation is improved. Now it can support collecting data of complex types from a DataFrame. Author: Sun Rui <rui.sun@intel.com> Closes #8276 from sun-rui/SPARK-10048.
*	[SPARK-10231] [MLLIB] update @Since annotation for mllib.classification	Xiangrui Meng	2015-08-25	5	-21/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Update `Since` annotation in `mllib.classification`: 1. add version to classes, objects, constructors, and public variables declared in constructors 2. correct some versions 3. remove `Since` on `toString` MechCoder dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8421 from mengxr/SPARK-10231 and squashes the following commits: b2dce80 [Xiangrui Meng] update @Since annotation for mllib.classification
*	[SPARK-10230] [MLLIB] Rename optimizeAlpha to optimizeDocConcentration	Feynman Liang	2015-08-25	2	-9/+9
\| \| \| \| \| \| \| \| \| \|	See [discussion](https://github.com/apache/spark/pull/8254#discussion_r37837770) CC jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8422 from feynmanliang/SPARK-10230.
*	[SPARK-8531] [ML] Update ML user guide for MinMaxScaler	Yuhao Yang	2015-08-25	1	-0/+71
\| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-8531 Update ML user guide for MinMaxScaler Author: Yuhao Yang <hhbyyh@gmail.com> Author: unknown <yuhaoyan@yuhaoyan-MOBL1.ccr.corp.intel.com> Closes #7211 from hhbyyh/minmaxdoc.
*	[SPARK-10198] [SQL] Turn off partition verification by default	Michael Armbrust	2015-08-25	2	-31/+35
\| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #8404 from marmbrus/turnOffPartitionVerification.
*	[SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses ↵	Sean Owen	2015-08-25	171	-880/+863
\| \| \| \| \| \| \| \| \| \| \| \|	to JavaConverters Replace `JavaConversions` implicits with `JavaConverters` Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet. Author: Sean Owen <sowen@cloudera.com> Closes #8033 from srowen/SPARK-9613.
*	Fixed a typo in DAGScheduler.	ehnalis	2015-08-25	1	-7/+20
\| \| \| \| \| \|	Author: ehnalis <zoltan.zvara@gmail.com> Closes #8308 from ehnalis/master.
*	[DOC] add missing parameters in SparkContext.scala for scala doc	Zhang, Liye	2015-08-25	1	-1/+14
\| \| \| \| \| \|	Author: Zhang, Liye <liye.zhang@intel.com> Closes #8412 from liyezhang556520/minorDoc.
*	[SPARK-10197] [SQL] Add null check in wrapperFor (inside HiveInspectors).	Yin Huai	2015-08-25	2	-5/+53
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10197 Author: Yin Huai <yhuai@databricks.com> Closes #8407 from yhuai/ORCSPARK-10197.
*	[SPARK-10195] [SQL] Data sources Filter should not expose internal types	Josh Rosen	2015-08-25	4	-41/+54
\| \| \| \| \| \| \| \| \| \|	Spark SQL's data sources API exposes Catalyst's internal types through its Filter interfaces. This is a problem because types like UTF8String are not stable developer APIs and should not be exposed to third-parties. This issue caused incompatibilities when upgrading our `spark-redshift` library to work against Spark 1.5.0. To avoid these issues in the future we should only expose public types through these Filter objects. This patch accomplishes this by using CatalystTypeConverters to add the appropriate conversions. Author: Josh Rosen <joshrosen@databricks.com> Closes #8403 from JoshRosen/datasources-internal-vs-external-types.
*	[SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive	Davies Liu	2015-08-25	3	-8/+14
\| \| \| \| \| \| \| \| \| \| \|	We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly. In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5). Author: Davies Liu <davies@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #8400 from davies/timestamp_parquet.
*	[SPARK-10210] [STREAMING] Filter out non-existent blocks before creating ↵	Tathagata Das	2015-08-25	3	-2/+166
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	BlockRDD When write ahead log is not enabled, a recovered streaming driver still tries to run jobs using pre-failure block ids, and fails as the block do not exists in-memory any more (and cannot be recovered as receiver WAL is not enabled). This occurs because the driver-side WAL of ReceivedBlockTracker is recovers that past block information, and ReceiveInputDStream creates BlockRDDs even if those blocks do not exist. The solution in this PR is to filter out block ids that do not exist before creating the BlockRDD. In addition, it adds unit tests to verify other logic in ReceiverInputDStream. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8405 from tdas/SPARK-10210.
*	[SPARK-6196] [BUILD] Remove MapR profiles in favor of hadoop-provided	Sean Owen	2015-08-25	1	-38/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Follow up to https://github.com/apache/spark/pull/7047 pwendell mentioned that MapR should use `hadoop-provided` now, and indeed the new build script does not produce `mapr3`/`mapr4` artifacts anymore. Hence the action seems to be to remove the profiles, which are now not used. CC trystanleftwich Author: Sean Owen <sowen@cloudera.com> Closes #8338 from srowen/SPARK-6196.
*	[SPARK-10214] [SPARKR] [DOCS] Improve SparkR Column, DataFrame API docs	Yu ISHIKAWA	2015-08-25	3	-34/+109
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cc: shivaram ## Summary - Add name tags to each methods in DataFrame.R and column.R - Replace `rdname column` with `rdname {each_func}`. i.e. alias method : `rdname column` => `rdname alias` ## Generated PDF File https://drive.google.com/file/d/0B9biIZIU47lLNHN2aFpnQXlSeGs/view?usp=sharing ## JIRA [[SPARK-10214] Improve SparkR Column, DataFrame API docs - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10214) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8414 from yu-iskw/SPARK-10214.
*	[SPARK-9293] [SPARK-9813] Analysis should check that set operations are only ↵	Josh Rosen	2015-08-25	6	-32/+48
\| \| \| \| \| \| \| \| \| \| \| \|	performed on tables with equal numbers of columns This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions. I also performed a bit of cleanup to refactor some of those logical operators' code into a common `SetOperation` base class. Author: Josh Rosen <joshrosen@databricks.com> Closes #7631 from JoshRosen/SPARK-9293.
*	[SPARK-10136] [SQL] A more robust fix for SPARK-10136	Cheng Lian	2015-08-25	1	-10/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause. The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules. Let me have a try to give an explanation here. The structure of the problematic Parquet schema generated by parquet-avro is something like this: ``` message m { <repetition> group f (LIST) { // Level 1 repeated group array (LIST) { // Level 2 repeated <primitive-type> array; // Level 3 } } } ``` (The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.) This structure consists of two nested legacy 2-level `LIST`-like structures: 1. The repeated group type at level 2 is the element type of the outer array defined at level 1 This group should map to an `CatalystArrayConverter.ElementConverter` when building converters. 2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2 This group should also map to an `CatalystArrayConverter.ElementConverter`. The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1. Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it. According to parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group. PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix. (I didn't realize this when authoring #8341 though.) As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec: > If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. (The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.) This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2]. This PR delivers a more robust fix by adding this rule in the latter method. Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3]. [1]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305 [2]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463 [3]: https://issues.apache.org/jira/browse/PARQUET-364 Author: Cheng Lian <lian@databricks.com> Closes #8361 from liancheng/spark-10136/proper-version.
*	[SPARK-10196] [SQL] Correctly saving decimals in internal rows to JSON.	Yin Huai	2015-08-24	2	-1/+28
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10196 Author: Yin Huai <yhuai@databricks.com> Closes #8408 from yhuai/DecimalJsonSPARK-10196.
*	[SPARK-10137] [STREAMING] Avoid to restart receivers if scheduleReceivers ↵	zsxwing	2015-08-24	3	-57/+120
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	returns balanced results This PR fixes the following cases for `ReceiverSchedulingPolicy`. 1) Assume there are 4 executors: host1, host2, host3, host4, and 5 receivers: r1, r2, r3, r4, r5. Then `ReceiverSchedulingPolicy.scheduleReceivers` will return (r1 -> host1, r2 -> host2, r3 -> host3, r4 -> host4, r5 -> host1). Let's assume r1 starts at first on `host1` as `scheduleReceivers` suggested, and try to register with ReceiverTracker. But the previous `ReceiverSchedulingPolicy.rescheduleReceiver` will return (host2, host3, host4) according to the current executor weights (host1 -> 1.0, host2 -> 0.5, host3 -> 0.5, host4 -> 0.5), so ReceiverTracker will reject `r1`. This is unexpected since r1 is starting exactly where `scheduleReceivers` suggested. This case can be fixed by ignoring the information of the receiver that is rescheduling in `receiverTrackingInfoMap`. 2) Assume there are 3 executors (host1, host2, host3) and each executors has 3 cores, and 3 receivers: r1, r2, r3. Assume r1 is running on host1. Now r2 is restarting, the previous `ReceiverSchedulingPolicy.rescheduleReceiver` will always return (host1, host2, host3). So it's possible that r2 will be scheduled to host1 by TaskScheduler. r3 is similar. Then at last, it's possible that there are 3 receivers running on host1, while host2 and host3 are idle. This issue can be fixed by returning only executors that have the minimum wight rather than returning at least 3 executors. Author: zsxwing <zsxwing@gmail.com> Closes #8340 from zsxwing/fix-receiver-scheduling.
*	[SPARK-9786] [STREAMING] [KAFKA] fix backpressure so it works with defa…	cody koeninger	2015-08-24	1	-2/+7
\| \| \| \| \| \| \| \|	…ult maxRatePerPartition setting of 0 Author: cody koeninger <cody@koeninger.org> Closes #8413 from koeninger/backpressure-testing-master.
*	[SPARK-10178] [SQL] HiveComparisionTest should print out dependent tables	Michael Armbrust	2015-08-24	1	-0/+36
\| \| \| \| \| \| \| \|	In `HiveComparisionTest`s it is possible to fail a query of the form `SELECT * FROM dest1`, where `dest1` is the query that is actually computing the incorrect results. To aid debugging this patch improves the harness to also print these query plans and their results. Author: Michael Armbrust <michael@databricks.com> Closes #8388 from marmbrus/generatedTables.
*	[SPARK-10121] [SQL] Thrift server always use the latest class loader ↵	Yin Huai	2015-08-25	2	-0/+60
\| \| \| \| \| \| \| \| \| \| \| \|	provided by the conf of executionHive's state https://issues.apache.org/jira/browse/SPARK-10121 Looks like the problem is that if we add a jar through another thread, the thread handling the JDBC session will not get the latest classloader. Author: Yin Huai <yhuai@databricks.com> Closes #8368 from yhuai/SPARK-10121.
*	[SQL] [MINOR] [DOC] Clarify docs for inferring DataFrame from RDD of Products	Feynman Liang	2015-08-24	2	-2/+2
\| \| \| \| \| \| \| \| \|	* Makes `SQLImplicits.rddToDataFrameHolder` scaladoc consistent with `SQLContext.createDataFrame[A <: Product](rdd: RDD[A])` since the former is essentially a wrapper for the latter * Clarifies `createDataFrame[A <: Product]` scaladoc to apply for any `RDD[Product]`, not just case classes Author: Feynman Liang <fliang@databricks.com> Closes #8406 from feynmanliang/sql-doc-fixes.
*	[SPARK-10118] [SPARKR] [DOCS] Improve SparkR API docs for 1.5 release	Yu ISHIKAWA	2015-08-24	4	-227/+1595
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cc: shivaram ## Summary - Modify `tdname` of expression functions. i.e. `ascii`: `rdname functions` => `rdname ascii` - Replace the dynamical function definitions to the static ones because of thir documentations. ## Generated PDF File https://drive.google.com/file/d/0B9biIZIU47lLX2t6ZjRoRnBTSEU/view?usp=sharing ## JIRA [[SPARK-10118] Improve SparkR API docs for 1.5 release - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10118) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8386 from yu-iskw/SPARK-10118.
*	[SPARK-10165] [SQL] Await child resolution in ResolveFunctions	Michael Armbrust	2015-08-24	2	-44/+77
\| \| \| \| \| \| \| \| \| \|	Currently, we eagerly attempt to resolve functions, even before their children are resolved. However, this is not valid in cases where we need to know the types of the input arguments (i.e. when resolving Hive UDFs). As a fix, this PR delays function resolution until the functions children are resolved. This change also necessitates a change to the way we resolve aggregate expressions that are not in aggregate operators (e.g., in `HAVING` or `ORDER BY` clauses). Specifically, we can't assume that these misplaced functions will be resolved, allowing us to differentiate aggregate functions from normal functions. To compensate for this change we now attempt to resolve these unresolved expressions in the context of the aggregate operator, before checking to see if any aggregate expressions are present. Author: Michael Armbrust <michael@databricks.com> Closes #8371 from marmbrus/hiveUDFResolution.
*	[SPARK-10190] Fix NPE in CatalystTypeConverters Decimal toScala converter	Josh Rosen	2015-08-24	2	-2/+7
\| \| \| \| \| \| \| \|	This adds a missing null check to the Decimal `toScala` converter in `CatalystTypeConverters`, fixing an NPE. Author: Josh Rosen <joshrosen@databricks.com> Closes #8401 from JoshRosen/SPARK-10190.
*	[SPARK-10061] [DOC] ML ensemble docs	Joseph K. Bradley	2015-08-24	2	-51/+976
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	User guide for spark.ml GBTs and Random Forests. The examples are copied from the decision tree guide and modified to run. I caught some issues I had somehow missed in the tree guide as well. I have run all examples, including Java ones. (Of course, I thought I had previously as well...) CC: mengxr manishamde yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8369 from jkbradley/ml-ensemble-docs.
*	[SPARK-9758] [TEST] [SQL] Compilation issue for hive test / wrong package?	Sean Owen	2015-08-24	10	-9/+6
\| \| \| \| \| \| \| \| \| \| \|	Move `test.org.apache.spark.sql.hive` package tests to apparent intended `org.apache.spark.sql.hive` as they don't intend to test behavior from outside org.apache.spark.* Alternate take, per discussion at https://github.com/apache/spark/pull/8051 I think this is what vanzin and I had in mind but also CC rxin to cross-check, as this does indeed depend on whether these tests were accidentally in this package or not. Testing from a `test.org.apache.spark` package is legitimate but didn't seem to be the intent here. Author: Sean Owen <sowen@cloudera.com> Closes #8307 from srowen/SPARK-9758.
*	[SPARK-8580] [SQL] Refactors ParquetHiveCompatibilitySuite and adds more ↵	Cheng Lian	2015-08-24	1	-39/+93
\| \| \| \| \| \| \| \| \| \| \| \|	test cases This PR refactors `ParquetHiveCompatibilitySuite` so that it's easier to add new test cases. Hit two bugs, SPARK-10177 and HIVE-11625, while working on this, added test cases for them and marked as ignored for now. SPARK-10177 will be addressed in a separate PR. Author: Cheng Lian <lian@databricks.com> Closes #8392 from liancheng/spark-8580/parquet-hive-compat-tests.
*	[SPARK-10144] [UI] Actually show peak execution memory by default	Andrew Or	2015-08-24	2	-6/+8
\| \| \| \| \| \| \| \|	The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. The result is that the memory is not displayed by default. Author: Andrew Or <andrew@databricks.com> Closes #8345 from andrewor14/show-memory-default.
*	[SPARK-7710] [SPARK-7998] [DOCS] Docs for DataFrameStatFunctions	Burak Yavuz	2015-08-24	2	-1/+102
\| \| \| \| \| \| \| \| \| \|	This PR contains examples on how to use some of the Stat Functions available for DataFrames under `df.stat`. rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #8378 from brkyvz/update-sql-docs.
*	[SPARK-9791] [PACKAGE] Change private class to private class to prevent ↵	Tathagata Das	2015-08-24	13	-54/+28
\| \| \| \| \| \| \| \| \| \| \| \|	unnecessary classes from showing up in the docs In addition, some random cleanup of import ordering Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8387 from tdas/SPARK-9791 and squashes the following commits: 67f3ee9 [Tathagata Das] Change private class to private[package] class to prevent them from showing up in the docs
*	[SPARK-10168] [STREAMING] Fix the issue that maven publishes wrong artifact jars	zsxwing	2015-08-24	5	-25/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR removed the `outputFile` configuration from pom.xml and updated `tests.py` to search jars for both sbt build and maven build. I ran ` mvn -Pkinesis-asl -DskipTests clean install` locally, and verified the jars in my local repository were correct. I also checked Python tests for maven build, and it passed all tests. Author: zsxwing <zsxwing@gmail.com> Closes #8373 from zsxwing/SPARK-10168 and squashes the following commits: e0b5818 [zsxwing] Fix the sbt build c697627 [zsxwing] Add the jar pathes to the exception message be1d8a5 [zsxwing] Fix the issue that maven publishes wrong artifact jars
*	[SPARK-10142] [STREAMING] Made python checkpoint recovery handle non-local ↵	Tathagata Das	2015-08-23	3	-16/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	checkpoint paths and existing SparkContexts The current code only checks checkpoint files in local filesystem, and always tries to create a new Python SparkContext (even if one already exists). The solution is to do the following: 1. Use the same code path as Java to check whether a valid checkpoint exists 2. Create a new Python SparkContext only if there no active one. There is not test for the path as its hard to test with distributed filesystem paths in a local unit test. I am going to test it with a distributed file system manually to verify that this patch works. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8366 from tdas/SPARK-10142 and squashes the following commits: 3afa666 [Tathagata Das] Added tests 2dd4ae5 [Tathagata Das] Added the check to not create a context if one already exists 9bf151b [Tathagata Das] Made python checkpoint recovery use java to find the checkpoint files
*	[SPARK-10164] [MLLIB] Fixed GMM distributed decomposition bug	Joseph K. Bradley	2015-08-23	2	-9/+35
\| \| \| \| \| \| \| \| \| \| \| \|	GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests. This PR adds a unit test which checks this. It failed previously but works with this fix. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8370 from jkbradley/gmm-fix.
*	[SPARK-10148] [STREAMING] Display active and inactive receiver numbers in ↵	zsxwing	2015-08-23	2	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \|	Streaming page Added the active and inactive receiver numbers in the summary section of Streaming page. <img width="1074" alt="screen shot 2015-08-21 at 2 08 54 pm" src="https://cloud.githubusercontent.com/assets/1000778/9402437/ff2806a2-480f-11e5-8f8e-efdf8e5d514d.png"> Author: zsxwing <zsxwing@gmail.com> Closes #8351 from zsxwing/receiver-number.