spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-9782] [YARN] Support YARN application tags via SparkConf	Dennis Huo	2015-08-18	3	-0/+65
\| \| \| \| \| \| \| \| \|	Add a new test case in yarn/ClientSuite which checks how the various SparkConf and ClientArguments propagate into the ApplicationSubmissionContext. Author: Dennis Huo <dhuo@google.com> Closes #8072 from dennishuo/dhuo-yarn-application-tags.
*	[SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolation	Michael Armbrust	2015-08-18	3	-11/+22
\| \| \| \| \| \| \| \|	Turns out that inner classes of inner objects are referenced directly, and thus moving it will break binary compatibility. Author: Michael Armbrust <michael@databricks.com> Closes #8281 from marmbrus/binaryCompat.
*	[SPARK-9574] [STREAMING] Remove unnecessary contents of ↵	zsxwing	2015-08-18	5	-1/+249
\| \| \| \| \| \| \| \| \| \|	spark-streaming-XXX-assembly jars Removed contents already included in Spark assembly jar from spark-streaming-XXX-assembly jars. Author: zsxwing <zsxwing@gmail.com> Closes #8069 from zsxwing/SPARK-9574.
*	[SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import	Piotr Migdal	2015-08-18	1	-2/+0
\| \| \| \| \| \| \| \|	See https://issues.apache.org/jira/browse/SPARK-10085 Author: Piotr Migdal <pmigdal@gmail.com> Closes #8284 from stared/spark-10085.
*	[SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide	Yanbo Liang	2015-08-18	1	-0/+28
\| \| \| \| \| \| \| \|	Add Python example for mllib LDAModel user guide Author: Yanbo Liang <ybliang8@gmail.com> Closes #8227 from yanboliang/spark-10032.
*	[SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression ↵	Yanbo Liang	2015-08-18	1	-0/+35
\| \| \| \| \| \| \| \| \| \|	user guide Add Python examples for mllib IsotonicRegression user guide Author: Yanbo Liang <ybliang8@gmail.com> Closes #8225 from yanboliang/spark-10029.
*	[SPARK-9900] [MLLIB] User guide for Association Rules	Feynman Liang	2015-08-18	3	-15/+118
\| \| \| \| \| \| \| \|	Updates FPM user guide to include Association Rules. Author: Feynman Liang <fliang@databricks.com> Closes #8207 from feynmanliang/SPARK-9900-arules.
*	[SPARK-7736] [CORE] Fix a race introduced in PythonRunner.	Marcelo Vanzin	2015-08-18	1	-1/+7
\| \| \| \| \| \| \| \| \| \|	The fix for SPARK-7736 introduced a race where a port value of "-1" could be passed down to the pyspark process, causing it to fail to connect back to the JVM. This change adds code to fix that race. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8258 from vanzin/SPARK-7736.
*	[SPARK-9028] [ML] Add CountVectorizer as an estimator to generate ↵	Yuhao Yang	2015-08-18	4	-155/+402
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CountVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #7388 from hhbyyh/cvEstimator.
*	[SPARK-10007] [SPARKR] Update `NAMESPACE` file in SparkR for simple ↵	Yuu ISHIKAWA	2015-08-18	1	-3/+47
\| \| \| \| \| \| \| \| \| \| \|	parameters functions ### JIRA [[SPARK-10007] Update `NAMESPACE` file in SparkR for simple parameters functions - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10007) Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8277 from yu-iskw/SPARK-10007.
*	[SPARK-8118] [SQL] Redirects Parquet JUL logger via SLF4J	Cheng Lian	2015-08-18	5	-43/+47
\| \| \| \| \| \| \| \| \| \|	Parquet hard coded a JUL logger which always writes to stdout. This PR redirects it via SLF4j JUL bridge handler, so that we can control Parquet logs via `log4j.properties`. This solution is inspired by https://github.com/Parquet/parquet-mr/issues/390#issuecomment-46064909. Author: Cheng Lian <lian@databricks.com> Closes #8196 from liancheng/spark-8118/redirect-parquet-jul.
*	[MINOR] fix the comments in IndexShuffleBlockResolver	CodingCat	2015-08-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	it might be a typo introduced at the first moment or some leftover after some renaming...... the name of the method accessing the index file is called `getBlockData` now (not `getBlockLocation` as indicated in the comments) Author: CodingCat <zhunansjtu@gmail.com> Closes #8238 from CodingCat/minor_1.
*	[SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public	Yanbo Liang	2015-08-17	1	-2/+2
\| \| \| \| \| \| \| \|	Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8263 from yanboliang/mlp-public.
*	[SPARK-10038] [SQL] fix bug in generated unsafe projection when there is ↵	Davies Liu	2015-08-17	2	-4/+29
\| \| \| \| \| \| \| \| \| \| \| \|	binary in ArrayData The type for array of array in Java is slightly different than array of others. cc cloud-fan Author: Davies Liu <davies@databricks.com> Closes #8250 from davies/array_binary.
*	[MINOR] Format the comment of `translate` at `functions.scala`	Yu ISHIKAWA	2015-08-17	1	-8/+9
\| \| \| \| \| \|	Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8265 from yu-iskw/minor-translate-comment.
*	[SPARK-7808] [ML] add package doc for ml.feature	Xiangrui Meng	2015-08-17	1	-0/+89
\| \| \| \| \| \| \| \|	This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8260 from mengxr/SPARK-7808.
*	[SPARK-10059] [YARN] Explicitly add JSP dependencies for tests.	Marcelo Vanzin	2015-08-17	1	-3/+19
\| \| \| \| \| \|	Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8251 from vanzin/SPARK-10059.
*	[SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample ↵	jose.cambronero	2015-08-17	1	-4/+47
\| \| \| \| \| \| \| \| \| \|	KS test added doc examples for python. Author: jose.cambronero <jose.cambronero@cloudera.com> Closes #8154 from josepablocam/spark_9902.
*	[SPARK-7707] User guide and example code for KernelDensity	Sandy Ryza	2015-08-17	1	-0/+77
\| \| \| \| \| \|	Author: Sandy Ryza <sandy@cloudera.com> Closes #8230 from sryza/sandy-spark-7707.
*	[SPARK-9898] [MLLIB] Prefix Span user guide	Feynman Liang	2015-08-17	2	-0/+97
\| \| \| \| \| \| \| \| \| \|	Adds user guide for `PrefixSpan`, including Scala and Java example code. mengxr zhangjiajin Author: Feynman Liang <fliang@databricks.com> Closes #8253 from feynmanliang/SPARK-9898.
*	SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression	Prayag Chandran	2015-08-17	9	-12/+168
\| \| \| \| \| \| \| \| \| \| \| \| \|	Added since tags to mllib.regression Author: Prayag Chandran <prayagchandran@gmail.com> Closes #7518 from prayagchandran/sinceTags and squashes the following commits: fa4dda2 [Prayag Chandran] Re-formatting 6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags 1a0365f [Prayag Chandran] Reformating and adding a few more tags 89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression
*	[SPARK-9768] [PYSPARK] [ML] Add Python API and user guide for ↵	Yanbo Liang	2015-08-17	2	-9/+81
\| \| \| \| \| \| \| \| \| \|	ml.feature.ElementwiseProduct Add Python API, user guide and example for ml.feature.ElementwiseProduct. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8061 from yanboliang/SPARK-9768.
*	[SPARK-9974] [BUILD] [SQL] Makes sure ↵	Cheng Lian	2015-08-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	com.twitter:parquet-hadoop-bundle:1.6.0 is in SBT assembly jar PR #7967 enables Spark SQL to persist Parquet tables in Hive compatible format when possible. One of the consequence is that, we have to set input/output classes to `MapredParquetInputFormat`/`MapredParquetOutputFormat`, which rely on com.twitter:parquet-hadoop:1.6.0 bundled with Hive 1.2.1. When loading such a table in Spark SQL, `o.a.h.h.ql.metadata.Table` first loads these input/output format classes, and thus classes in com.twitter:parquet-hadoop:1.6.0. However, the scope of this dependency is defined as "runtime", and is not packaged into Spark assembly jar. This results in a `ClassNotFoundException`. This issue can be worked around by asking users to add parquet-hadoop 1.6.0 via the `--driver-class-path` option. However, considering Maven build is immune to this problem, I feel it can be confusing and inconvenient for users. So this PR fixes this issue by changing scope of parquet-hadoop 1.6.0 to "compile". Author: Cheng Lian <lian@databricks.com> Closes #8198 from liancheng/spark-9974/bundle-parquet-1.6.0.
*	[SPARK-8920] [MLLIB] Add @since tags to mllib.linalg	Sameer Abhyankar	2015-08-17	8	-17/+227
\| \| \| \| \| \| \|	Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.Samavihome> Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.local> Closes #7729 from sabhyankar/branch_8920.
*	[SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listing	Feynman Liang	2015-08-17	1	-13/+13
\| \| \| \| \| \| \| \|	mengxr jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8255 from feynmanliang/SPARK-10068.
*	[SPARK-9592] [SQL] Fix Last function implemented based on AggregateExpression1.	Yin Huai	2015-08-17	2	-2/+22
\| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-9592 #8113 has the fundamental fix. But, if we want to minimize the number of changed lines, we can go with this one. Then, in 1.6, we merge #8113. Author: Yin Huai <yhuai@databricks.com> Closes #8172 from yhuai/lastFix and squashes the following commits: b28c42a [Yin Huai] Regression test. af87086 [Yin Huai] Fix last.
*	[SPARK-9526] [SQL] Utilize randomized tests to reveal potential bugs in sql ↵	Yijie Shen	2015-08-17	10	-6/+410
\| \| \| \| \| \| \| \| \| \| \| \|	expressions JIRA: https://issues.apache.org/jira/browse/SPARK-9526 This PR is a follow up of #7830, aiming at utilizing randomized tests to reveal more potential bugs in sql expression. Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7855 from yjshen/property_check.
*	[SPARK-10036] [SQL] Load JDBC driver in DataFrameReader.jdbc and ↵	zsxwing	2015-08-17	4	-7/+20
\| \| \| \| \| \| \| \| \| \| \| \| \|	DataFrameWriter.jdbc This PR uses `JDBCRDD.getConnector` to load JDBC driver before creating connection in `DataFrameReader.jdbc` and `DataFrameWriter.jdbc`. Author: zsxwing <zsxwing@gmail.com> Closes #8232 from zsxwing/SPARK-10036 and squashes the following commits: adf75de [zsxwing] Add extraOptions to the connection properties 57f59d4 [zsxwing] Load JDBC driver in DataFrameReader.jdbc and DataFrameWriter.jdbc
*	[SPARK-9950] [SQL] Wrong Analysis Error for grouping/aggregating on struct ↵	Wenchen Fan	2015-08-17	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	fields This issue has been fixed by https://github.com/apache/spark/pull/8215, this PR added regression test for it. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8222 from cloud-fan/minor and squashes the following commits: 0bbfb1c [Wenchen Fan] fix style... 7e2d8d9 [Wenchen Fan] add test
*	[SPARK-7736] [CORE] [YARN] Make pyspark fail YARN app on failure.	Marcelo Vanzin	2015-08-17	4	-8/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The YARN backend doesn't like when user code calls `System.exit`, since it cannot know the exit status and thus cannot set an appropriate final status for the application. So, for pyspark, avoid that call and instead throw an exception with the exit code. SparkSubmit handles that exception and exits with the given exit code, while YARN uses the exit code as the failure code for the Spark app. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7751 from vanzin/SPARK-9416.
*	[SPARK-9924] [WEB UI] Don't schedule checkForLogs while some of them are ↵	Rohit Agarwal	2015-08-17	1	-7/+21
\| \| \| \| \| \| \| \|	already running. Author: Rohit Agarwal <rohita@qubole.com> Closes #8153 from mindprince/SPARK-9924.
*	[SPARK-7837] [SQL] Avoids double closing output writers when commitTask() fails	Cheng Lian	2015-08-18	2	-6/+61
\| \| \| \| \| \| \| \|	When inserting data into a `HadoopFsRelation`, if `commitTask()` of the writer container fails, `abortTask()` will be invoked. However, both `commitTask()` and `abortTask()` try to close the output writer(s). The problem is that, closing underlying writers may not be an idempotent operation. E.g., `ParquetRecordWriter.close()` throws NPE when called twice. Author: Cheng Lian <lian@databricks.com> Closes #8236 from liancheng/spark-7837/double-closing.
*	[SPARK-9959] [MLLIB] Association Rules Java Compatibility	Feynman Liang	2015-08-17	1	-2/+28
\| \| \| \| \| \| \| \|	mengxr Author: Feynman Liang <fliang@databricks.com> Closes #8206 from feynmanliang/SPARK-9959-arules-java.
*	[SPARK-9199] [CORE] Upgrade Tachyon version from 0.7.0 -> 0.7.1.	Calvin Jia	2015-08-17	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Updates the tachyon-client version to the latest release. The main difference between 0.7.0 and 0.7.1 on the client side is to support running Tachyon on local file system by default. No new non-Tachyon dependencies are added, and no code changes are required since the client API has not changed. Author: Calvin Jia <jia.calvin@gmail.com> Closes #8235 from calvinjia/spark-9199-master.
*	[SPARK-9871] [SPARKR] Add expression functions into SparkR which have a ↵	Yu ISHIKAWA	2015-08-16	4	-0/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	variable parameter ### Summary - Add `lit` function - Add `concat`, `greatest`, `least` functions I think we need to improve `collect` function in order to implement `struct` function. Since `collect` doesn't work with arguments which includes a nested `list` variable. It seems that a list against `struct` still has `jobj` classes. So it would be better to solve this problem on another issue. ### JIRA [[SPARK-9871] Add expression functions into SparkR which have a variable parameter - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9871) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8194 from yu-iskw/SPARK-9856.
*	[SPARK-10005] [SQL] Fixes schema merging for nested structs	Cheng Lian	2015-08-16	4	-22/+112
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In case of schema merging, we only handled first level fields when converting Parquet groups to `InternalRow`s. Nested struct fields are not properly handled. For example, the schema of a Parquet file to be read can be: ``` message individual { required group f1 { optional binary f11 (utf8); } } ``` while the global schema is: ``` message global { required group f1 { optional binary f11 (utf8); optional int32 f12; } } ``` This PR fixes this issue by padding missing fields when creating actual converters. Author: Cheng Lian <lian@databricks.com> Closes #8228 from liancheng/spark-10005/nested-schema-merging.
*	[SPARK-10008] Ensure shuffle locality doesn't take precedence over narrow deps	Matei Zaharia	2015-08-16	2	-19/+44
\| \| \| \| \| \| \| \| \| \| \| \|	The shuffle locality patch made the DAGScheduler aware of shuffle data, but for RDDs that have both narrow and shuffle dependencies, it can cause them to place tasks based on the shuffle dependency instead of the narrow one. This case is common in iterative join-based algorithms like PageRank and ALS, where one RDD is hash-partitioned and one isn't. Author: Matei Zaharia <matei@databricks.com> Closes #8220 from mateiz/shuffle-loc-fix.
*	[SPARK-8844] [SPARKR] head/collect is broken in SparkR.	Sun Rui	2015-08-16	2	-6/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a WIP patch for SPARK-8844 for collecting reviews. This bug is about reading an empty DataFrame. in readCol(), lapply(1:numRows, function(x) { does not take into consideration the case where numRows = 0. Will add unit test case. Author: Sun Rui <rui.sun@intel.com> Closes #7419 from sun-rui/SPARK-8844.
*	[SPARK-9973] [SQL] Correct in-memory columnar buffer size	Kun Xu	2015-08-16	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \|	The `initialSize` argument of `ColumnBuilder.initialize()` should be the number of rows rather than bytes. However `InMemoryColumnarTableScan` passes in a byte size, which makes Spark SQL allocate more memory than necessary when building in-memory columnar buffers. Author: Kun Xu <viper_kun@163.com> Closes #8189 from viper-kun/errorSize.
*	[SPARK-9805] [MLLIB] [PYTHON] [STREAMING] Added _eventually for ml streaming ↵	Joseph K. Bradley	2015-08-15	1	-48/+129
\| \| \| \| \| \| \| \| \| \| \| \|	pyspark tests Recently, PySpark ML streaming tests have been flaky, most likely because of the batches not being processed in time. Proposal: Replace the use of _ssc_wait (which waits for a fixed amount of time) with a method which waits for a fixed amount of time but can terminate early based on a termination condition method. With this, we can extend the waiting period (to make tests less flaky) but also stop early when possible (making tests faster on average, which I verified locally). CC: mengxr tdas freeman-lab Author: Joseph K. Bradley <joseph@databricks.com> Closes #8087 from jkbradley/streaming-ml-tests.
*	[SPARK-9955] [SQL] correct error message for aggregate	Wenchen Fan	2015-08-15	3	-7/+12
\| \| \| \| \| \| \| \| \| \| \|	We should skip unresolved `LogicalPlan`s for `PullOutNondeterministic`, as calling `output` on unresolved `LogicalPlan` will produce confusing error message. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8203 from cloud-fan/error-msg and squashes the following commits: 1c67ca7 [Wenchen Fan] move test 7593080 [Wenchen Fan] correct error message for aggregate
*	[SPARK-9980] [BUILD] Fix SBT publishLocal error due to invalid characters in doc	Herman van Hovell	2015-08-15	9	-19/+19
\| \| \| \| \| \| \| \|	Tiny modification to a few comments ```sbt publishLocal``` work again. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #8209 from hvanhovell/SPARK-9980.
*	[SPARK-9725] [SQL] fix serialization of UTF8String across different JVM	Davies Liu	2015-08-14	1	-6/+25
\| \| \| \| \| \| \| \|	The BYTE_ARRAY_OFFSET could be different in JVM with different configurations (for example, different heap size, 24 if heap > 32G, otherwise 16), so offset of UTF8String is not portable, we should handler that during serialization. Author: Davies Liu <davies@databricks.com> Closes #8210 from davies/serialize_utf8string.
*	[SPARK-9960] [GRAPHX] sendMessage type fix in LabelPropagation.scala	zc he	2015-08-14	1	-1/+1
\| \| \| \| \| \|	Author: zc he <farseer90718@gmail.com> Closes #8188 from farseer90718/farseer-patch-1.
*	[SPARK-9984] [SQL] Create local physical operator interface.	Reynold Xin	2015-08-14	4	-0/+224
\| \| \| \| \| \| \| \| \| \| \| \|	This pull request creates a new operator interface that is more similar to traditional database query iterators (with open/close/next/get). These local operators are not currently used anywhere, but will become the basis for SPARK-9983 (local physical operators for query execution). cc zsxwing Author: Reynold Xin <rxin@databricks.com> Closes #8212 from rxin/SPARK-9984.
*	[SPARK-8887] [SQL] Explicit define which data types can be used as dynamic ↵	Yijie Shen	2015-08-14	5	-4/+41
\| \| \| \| \| \| \| \| \| \| \| \|	partition columns This PR enforce dynamic partition column data type requirements by adding analysis rules. JIRA: https://issues.apache.org/jira/browse/SPARK-8887 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #8201 from yjshen/dynamic_partition_columns.
*	[SPARK-9634] [SPARK-9323] [SQL] cleanup unnecessary Aliases in LogicalPlan ↵	Wenchen Fan	2015-08-14	9	-24/+120
\| \| \| \| \| \| \| \| \| \| \| \| \|	at the end of analysis Also alias the ExtractValue instead of wrapping it with UnresolvedAlias when resolve attribute in LogicalPlan, as this alias will be trimmed if it's unnecessary. Based on #7957 without the changes to mllib, but instead maintaining earlier behavior when using `withColumn` on expressions that already have metadata. Author: Wenchen Fan <cloud0fan@outlook.com> Author: Michael Armbrust <michael@databricks.com> Closes #8215 from marmbrus/pr/7957.
*	[HOTFIX] fix duplicated braces	Davies Liu	2015-08-14	13	-15/+15
\| \| \| \| \| \|	Author: Davies Liu <davies@databricks.com> Closes #8219 from davies/fix_typo.
*	[SPARK-9934] Deprecate NIO ConnectionManager.	Reynold Xin	2015-08-14	2	-1/+4
\| \| \| \| \| \| \| \|	Deprecate NIO ConnectionManager in Spark 1.5.0, before removing it in Spark 1.6.0. Author: Reynold Xin <rxin@databricks.com> Closes #8162 from rxin/SPARK-9934.
*	[SPARK-9949] [SQL] Fix TakeOrderedAndProject's output.	Yin Huai	2015-08-14	2	-4/+28
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-9949 Author: Yin Huai <yhuai@databricks.com> Closes #8179 from yhuai/SPARK-9949.