spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-10547] [TEST] Streamline / improve style of Java API tests	Sean Owen	2015-09-12	15	-761/+755
\| \| \| \| \| \| \| \|	Fix a few Java API test style issues: unused generic types, exceptions, wrong assert argument order Author: Sean Owen <sowen@cloudera.com> Closes #8706 from srowen/SPARK-10547.
*	[SPARK-10554] [CORE] Fix NPE with ShutdownHook	Nithin Asokan	2015-09-12	1	-1/+3
\| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10554 Fixes NPE when ShutdownHook tries to cleanup temporary folders Author: Nithin Asokan <Nithin.Asokan@Cerner.com> Closes #8720 from nasokan/SPARK-10554.
*	[SPARK-10566] [CORE] SnappyCompressionCodec init exception handling masks ↵	Daniel Imfeld	2015-09-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	important error information When throwing an IllegalArgumentException in SnappyCompressionCodec.init, chain the existing exception. This allows potentially important debugging info to be passed to the user. Manual testing shows the exception chained properly, and the test suite still looks fine as well. This contribution is my original work and I license the work to the project under the project's open source license. Author: Daniel Imfeld <daniel@danielimfeld.com> Closes #8725 from dimfeld/dimfeld-patch-1.
*	[SPARK-9014] [SQL] Allow Python spark API to use built-in exponential operator	0x0FFF	2015-09-11	2	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR addresses (SPARK-9014)[https://issues.apache.org/jira/browse/SPARK-9014] Added functionality: `Column` object in Python now supports exponential operator `*` Example: ``` from pyspark.sql import df = sqlContext.createDataFrame([Row(a=2)]) df.select(3df.a,df.a3,df.a**df.a).collect() ``` Outputs: ``` [Row(POWER(3.0, a)=9.0, POWER(a, 3.0)=8.0, POWER(a, a)=4.0)] ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8658 from 0x0FFF/SPARK-9014.
*	[SPARK-10564] ThreadingSuite: assertion failures in threads don't fail the test	Andrew Or	2015-09-11	1	-23/+45
\| \| \| \| \| \| \| \|	This commit ensures if an assertion fails within a thread, it will ultimately fail the test. Otherwise we end up potentially masking real bugs by not propagating assertion failures properly. Author: Andrew Or <andrew@databricks.com> Closes #8723 from andrewor14/fix-threading-suite.
*	[SPARK-9990] [SQL] Local hash join follow-ups	Andrew Or	2015-09-11	4	-5/+125
\| \| \| \| \| \| \| \| \|	1. Hide `LocalNodeIterator` behind the `LocalNode#asIterator` method 2. Add tests for this Author: Andrew Or <andrew@databricks.com> Closes #8708 from andrewor14/local-hash-join-follow-up.
*	[SPARK-9992] [SPARK-9994] [SPARK-9998] [SQL] Implement the local TopK, ↵	zsxwing	2015-09-11	8	-1/+353
\| \| \| \| \| \| \| \| \| \|	sample and intersect operators This PR is in conflict with #8535. I will update this one when #8535 gets merged. Author: zsxwing <zsxwing@gmail.com> Closes #8573 from zsxwing/more-local-operators.
*	[SPARK-7142] [SQL] Minor enhancement to BooleanSimplification Optimizer ↵	Yash Datta	2015-09-11	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \|	rule. Incorporate review comments Adding changes suggested by cloud-fan in #5700 cc marmbrus Author: Yash Datta <Yash.Datta@guavus.com> Closes #8716 from saucam/bool_simp.
*	[SPARK-10442] [SQL] fix string to boolean cast	Wenchen Fan	2015-09-11	4	-24/+82
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we cast string to boolean in hive, it returns `true` if the length of string is > 0, and spark SQL follows this behavior. However, this behavior is very different from other SQL systems: 1. [presto](https://github.com/facebook/presto/blob/master/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L89-L118) will return `true` for 't' 'true' '1', `false` for 'f' 'false' '0', throw exception for others. 2. [redshift](http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html) will return `true` for 't' 'true' 'y' 'yes' '1', `false` for 'f' 'false' 'n' 'no' '0', null for others. 3. [postgresql](http://www.postgresql.org/docs/devel/static/datatype-boolean.html) will return `true` for 't' 'true' 'y' 'yes' 'on' '1', `false` for 'f' 'false' 'n' 'no' 'off' '0', throw exception for others. 4. [vertica](https://my.vertica.com/docs/5.0/HTML/Master/2983.htm) will return `true` for 't' 'true' 'y' 'yes' '1', `false` for 'f' 'false' 'n' 'no' '0', null for others. 5. [impala](http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_boolean.html) throw exception when try to cast string to boolean. 6. mysql, oracle, sqlserver don't have boolean type Whether we should change the cast behavior according to other SQL system or not is not decided yet, this PR is a test to see if we changed, how many compatibility tests will fail. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8698 from cloud-fan/string2boolean.
*	[PYTHON] Fixed typo in exception message	Icaro Medeiros	2015-09-11	1	-1/+1
\| \| \| \| \| \| \| \|	Just fixing a typo in exception message, raised when attempting to pickle SparkContext. Author: Icaro Medeiros <icaro.medeiros@gmail.com> Closes #8724 from icaromedeiros/master.
*	[SPARK-10546] Check partitionId's range in ExternalSorter#spill()	tedyu	2015-09-11	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	See this thread for background: http://search-hadoop.com/m/q3RTt0rWvIkHAE81 We should check the range of partition Id and provide meaningful message through exception. Alternatively, we can use abs() and modulo to force the partition Id into legitimate range. However, expectation is that user should correct the logic error in his / her code. Author: tedyu <yuzhihong@gmail.com> Closes #8703 from tedyu/master.
*	[SPARK-8530] [ML] add python API for MinMaxScaler	Yuhao Yang	2015-09-11	1	-5/+99
\| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-8530 add python API for MinMaxScaler jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514 Author: Yuhao Yang <hhbyyh@gmail.com> Closes #7150 from hhbyyh/pythonMinMax.
*	[SPARK-10540] [SQL] Ignore HadoopFsRelationTest's "test all data types" if ↵	Yin Huai	2015-09-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	it is too flaky If hadoopFsRelationSuites's "test all data types" is too flaky we can disable it for now. https://issues.apache.org/jira/browse/SPARK-10540 Author: Yin Huai <yhuai@databricks.com> Closes #8705 from yhuai/SPARK-10540-ignore.
*	[MINOR] [MLLIB] [ML] [DOC] Minor doc fixes for StringIndexer and MetadataUtils	Joseph K. Bradley	2015-09-11	3	-29/+20
\| \| \| \| \| \| \| \| \| \| \| \|	Changes: * Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited. * MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore CC: holdenk mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8679 from jkbradley/doc-fixes-1.5.
*	[SPARK-10537] [ML] document LIBSVM source options in public API doc and some ↵	Xiangrui Meng	2015-09-11	3	-43/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	minor improvements We should document options in public API doc. Otherwise, it is hard to find out the options without looking at the code. I tried to make `DefaultSource` private and put the documentation to package doc. However, since then there exists no public class under `source.libsvm`, the Java package doc doesn't show up in the generated html file (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4492654). So I put the doc to `DefaultSource` instead. There are several minor updates in this PR: 1. Do `vectorType == "sparse"` only once. 2. Update `hashCode` and `equals`. 3. Remove inherited doc. 4. Delete temp dir in `afterAll`. Lewuathe Author: Xiangrui Meng <meng@databricks.com> Closes #8699 from mengxr/SPARK-10537.
*	[SPARK-9773] [ML] [PySpark] Add Python API for MultilayerPerceptronClassifier	Yanbo Liang	2015-09-11	2	-1/+140
\| \| \| \| \| \| \| \|	Add Python API for ```MultilayerPerceptronClassifier```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8067 from yanboliang/SPARK-9773.
*	[SPARK-10026] [ML] [PySpark] Implement some common Params for regression in ↵	Yanbo Liang	2015-09-11	4	-96/+143
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PySpark LinearRegression and LogisticRegression lack of some Params for Python, and some Params are not shared classes which lead we need to write them for each class. These kinds of Params are list here: ```scala HasElasticNetParam HasFitIntercept HasStandardization HasThresholds ``` Here we implement them in shared params at Python side and make LinearRegression/LogisticRegression parameters peer with Scala one. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8508 from yanboliang/spark-10026.
*	[SPARK-10518] [DOCS] Update code examples in spark.ml user guide to use ↵	y-shimizu	2015-09-11	3	-104/+47
\| \| \| \| \| \| \| \| \| \|	LIBSVM data source instead of MLUtils I fixed to use LIBSVM data source in the example code in spark.ml instead of MLUtils Author: y-shimizu <y.shimizu0429@gmail.com> Closes #8697 from y-shimizu/SPARK-10518.
*	[SPARK-10556] Remove explicit Scala version for sbt project build files	Ahir Reddy	2015-09-11	1	-2/+0
\| \| \| \| \| \| \| \| \| \|	Previously, project/plugins.sbt explicitly set scalaVersion to 2.10.4. This can cause issues when using a version of sbt that is compiled against a different version of Scala (for example sbt 0.13.9 uses 2.10.5). Removing this explicit setting will cause build files to be compiled and run against the same version of Scala that sbt is compiled against. Note that this only applies to the project build files (items in project/), it is distinct from the version of Scala we target for the actual spark compilation. Author: Ahir Reddy <ahirreddy@gmail.com> Closes #8709 from ahirreddy/sbt-scala-version-fix.
*	[SPARK-10472] [SQL] Fixes DataType.typeName for UDT	Cheng Lian	2015-09-11	2	-1/+9
\| \| \| \| \| \| \| \|	Before this fix, `MyDenseVectorUDT.typeName` gives `mydensevecto`, which is not desirable. Author: Cheng Lian <lian@databricks.com> Closes #8640 from liancheng/spark-10472/udt-type-name.
*	[SPARK-10027] [ML] [PySpark] Add Python API missing methods for ml.feature	Yanbo Liang	2015-09-10	3	-8/+59
\| \| \| \| \| \| \| \| \| \| \|	Missing method of ml.feature are listed here: ```StringIndexer``` lacks of parameter ```handleInvalid```. ```StringIndexerModel``` lacks of method ```labels```. ```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8313 from yanboliang/spark-10027.
*	[SPARK-10023] [ML] [PySpark] Unified DecisionTreeParams checkpointInterval ↵	Yanbo Liang	2015-09-10	4	-24/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	between Scala and Python API. "checkpointInterval" is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. ``` member of DecisionTreeParams <-> Scala API shared param for all ML Transformer/Estimator <-> Python API ``` Proposal: "checkpointInterval" is also used by ALS, so we make it shared params at Scala. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8528 from yanboliang/spark-10023.
*	[SPARK-9043] Serialize key, value and combiner classes in ShuffleDependency	Matt Massie	2015-09-10	9	-23/+168
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ShuffleManager implementations are currently not given type information for the key, value and combiner classes. Serialization of shuffle objects relies on objects being JavaSerializable, with methods defined for reading/writing the object or, alternatively, serialization via Kryo which uses reflection. Serialization systems like Avro, Thrift and Protobuf generate classes with zero argument constructors and explicit schema information (e.g. IndexedRecords in Avro have get, put and getSchema methods). By serializing the key, value and combiner class names in ShuffleDependency, shuffle implementations will have access to schema information when registerShuffle() is called. Author: Matt Massie <massie@cs.berkeley.edu> Closes #7403 from massie/shuffle-classtags.
*	[SPARK-7544] [SQL] [PySpark] pyspark.sql.types.Row implements __getitem__	Yanbo Liang	2015-09-10	1	-0/+15
\| \| \| \| \| \| \| \|	pyspark.sql.types.Row implements ```__getitem__``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #8333 from yanboliang/spark-7544.
*	Add 1.5 to master branch EC2 scripts	Shivaram Venkataraman	2015-09-10	1	-2/+6
\| \| \| \| \| \| \| \|	This change brings it to par with `branch-1.5` (and 1.5.0 release) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #8704 from shivaram/ec2-1.5-update.
*	[SPARK-10443] [SQL] Refactor SortMergeOuterJoin to reduce duplication	Andrew Or	2015-09-10	1	-61/+77
\| \| \| \| \| \| \| \|	`LeftOutputIterator` and `RightOutputIterator` are symmetrically identical and can share a lot of code. If someone makes a change in one but forgets to do the same thing in the other we'll end up with inconsistent behavior. This patch also adds inline comments to clarify the intention of the code. Author: Andrew Or <andrew@databricks.com> Closes #8596 from andrewor14/smoj-cleanup.
*	[SPARK-10049] [SPARKR] Support collecting data of ArraryType in DataFrame.	Sun Rui	2015-09-10	11	-151/+250
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	this PR : 1. Enhance reflection in RBackend. Automatically matching a Java array to Scala Seq when finding methods. Util functions like seq(), listToSeq() in R side can be removed, as they will conflict with the Serde logic that transferrs a Scala seq to R side. 2. Enhance the SerDe to support transferring a Scala seq to R side. Data of ArrayType in DataFrame after collection is observed to be of Scala Seq type. 3. Support ArrayType in createDataFrame(). Author: Sun Rui <rui.sun@intel.com> Closes #8458 from sun-rui/SPARK-10049.
*	[SPARK-9990] [SQL] Create local hash join operator	zsxwing	2015-09-10	16	-24/+455
\| \| \| \| \| \| \| \| \| \| \|	This PR includes the following changes: - Add SQLConf to LocalNode - Add HashJoinNode - Add ConvertToUnsafeNode and ConvertToSafeNode.scala to test unsafe hash join. Author: zsxwing <zsxwing@gmail.com> Closes #8535 from zsxwing/SPARK-9990.
*	[SPARK-10514] [MESOS] waiting for min no of total cores acquired by Spark by ↵	Akash Mishra	2015-09-10	2	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	implementing the sufficientResourcesRegistered method spark.scheduler.minRegisteredResourcesRatio configuration parameter works for YARN mode but not for Mesos Coarse grained mode. If the parameter specified default value of 0 will be set for spark.scheduler.minRegisteredResourcesRatio in base class and this method will always return true. There are no existing test for YARN mode too. Hence not added test for the same. Author: Akash Mishra <akash.mishra20@gmail.com> Closes #8672 from SleepyThread/master.
*	[SPARK-6350] [MESOS] Fine-grained mode scheduler respects mesosExecutor.cores	Iulian Dragos	2015-09-10	2	-3/+33
\| \| \| \| \| \| \| \| \| \|	This is a regression introduced in #4960, this commit fixes it and adds a test. tnachen andrewor14 please review, this should be an easy one. Author: Iulian Dragos <jaguarul@gmail.com> Closes #8653 from dragos/issue/mesos/fine-grained-maxExecutorCores.
*	[SPARK-8167] Make tasks that fail from YARN preemption not fail job	mcheah	2015-09-10	17	-79/+261
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The architecture is that, in YARN mode, if the driver detects that an executor has disconnected, it asks the ApplicationMaster why the executor died. If the ApplicationMaster is aware that the executor died because of preemption, all tasks associated with that executor are not marked as failed. The executor is still removed from the driver's list of available executors, however. There's a few open questions: 1. Should standalone mode have a similar "get executor loss reason" as well? I localized this change as much as possible to affect only YARN, but there could be a valid case to differentiate executor losses in standalone mode as well. 2. I make a pretty strong assumption in YarnAllocator that getExecutorLossReason(executorId) will only be called once per executor id; I do this so that I can remove the metadata from the in-memory map to avoid object accumulation. It's not clear if I'm being overly zealous to save space, however. cc vanzin specifically for review because it collided with some earlier YARN scheduling work. cc JoshRosen because it's similar to output commit coordination we did in the past cc andrewor14 for our discussion on how to get executor exit codes and loss reasons Author: mcheah <mcheah@palantir.com> Closes #8007 from mccheah/feature/preemption-handling.
*	[SPARK-10469] [DOC] Try and document the three options	Holden Karau	2015-09-10	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	From JIRA: Add documentation for tungsten-sort. From the mailing list "I saw a new "spark.shuffle.manager=tungsten-sort" implemented in https://issues.apache.org/jira/browse/SPARK-7081, but it can't be found its corresponding description in http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/configuration.html(Currenlty there are only 'sort' and 'hash' two options)." Author: Holden Karau <holden@pigscanfly.ca> Closes #8638 from holdenk/SPARK-10469-document-tungsten-sort.
*	[SPARK-10466] [SQL] UnsafeRow SerDe exception with data spill	Cheng Hao	2015-09-10	3	-5/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Data Spill with UnsafeRow causes assert failure. ``` java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ``` To reproduce that with code (thanks andrewor14): ```scala bin/spark-shell --master local --conf spark.shuffle.memoryFraction=0.005 --conf spark.shuffle.sort.bypassMergeThreshold=0 sc.parallelize(1 to 2 * 1000 * 1000, 10) .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count() ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #8635 from chenghao-intel/unsafe_spill.
*	[SPARK-10301] [SPARK-10428] [SQL] Addresses comments of PR #8583 and #8509 ↵	Cheng Lian	2015-09-10	4	-45/+522
\| \| \| \| \| \| \| \|	for master Author: Cheng Lian <lian@databricks.com> Closes #8670 from liancheng/spark-10301/address-pr-comments.
*	[SPARK-7142] [SQL] Minor enhancement to BooleanSimplification Optimizer rule	Yash Datta	2015-09-10	2	-0/+25
\| \| \| \| \| \| \| \| \| \| \| \|	Use these in the optimizer as well: A and (not(A) or B) => A and B not(A and B) => not(A) or not(B) not(A or B) => not(A) and not(B) Author: Yash Datta <Yash.Datta@guavus.com> Closes #5700 from saucam/bool_simp.
*	[SPARK-10065] [SQL] avoid the extra copy when generate unsafe array	Wenchen Fan	2015-09-10	1	-60/+24
\| \| \| \| \| \| \| \| \| \| \| \|	The reason for this extra copy is that we iterate the array twice: calculate elements data size and copy elements to array buffer. A simple solution is to follow `createCodeForStruct`, we can dynamically grow the buffer when needed and thus don't need to know the data size ahead. This PR also include some typo and style fixes, and did some minor refactor to make sure `input.primitive` is always variable name not code when generate unsafe code. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8496 from cloud-fan/avoid-copy.
*	[SPARK-10497] [BUILD] [TRIVIAL] Handle both locations for JIRAError with ↵	Holden Karau	2015-09-10	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \|	python-jira Location of JIRAError has moved between old and new versions of python-jira package. Longer term it probably makes sense to pin to specific versions (as mentioned in https://issues.apache.org/jira/browse/SPARK-10498 ) but for now, making release tools works with both new and old versions of python-jira. Author: Holden Karau <holden@pigscanfly.ca> Closes #8661 from holdenk/SPARK-10497-release-utils-does-not-work-with-new-jira-python.
*	[MINOR] [MLLIB] [ML] [DOC] fixed typo: label for negative result should be ↵	Sean Paradiso	2015-09-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	0.0 (original: 1.0) Small typo in the example for `LabelledPoint` in the MLLib docs. Author: Sean Paradiso <seanparadiso@gmail.com> Closes #8680 from sparadiso/docs_mllib_smalltypo.
*	[SPARK-9772] [PYSPARK] [ML] Add Python API for ml.feature.VectorSlicer	Yanbo Liang	2015-09-09	1	-5/+90
\| \| \| \| \| \| \| \|	Add Python API for ml.feature.VectorSlicer. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8102 from yanboliang/SPARK-9772.
*	[SPARK-9730] [SQL] Add Full Outer Join support for SortMergeJoin	Liang-Chi Hsieh	2015-09-09	5	-34/+259
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR is based on #8383 , thanks to viirya JIRA: https://issues.apache.org/jira/browse/SPARK-9730 This patch adds the Full Outer Join support for SortMergeJoin. A new class SortMergeFullJoinScanner is added to scan rows from left and right iterators. FullOuterIterator is simply a wrapper of type RowIterator to consume joined rows from SortMergeFullJoinScanner. Closes #8383 Author: Liang-Chi Hsieh <viirya@appier.com> Author: Davies Liu <davies@databricks.com> Closes #8579 from davies/smj_fullouter.
*	[SPARK-10461] [SQL] make sure `input.primitive` is always variable name not ↵	Wenchen Fan	2015-09-09	5	-67/+75
\| \| \| \| \| \| \| \| \| \| \| \|	code at `GenerateUnsafeProjection` When we generate unsafe code inside `createCodeForXXX`, we always assign the `input.primitive` to a temp variable in case `input.primitive` is expression code. This PR did some refactor to make sure `input.primitive` is always variable name, and some other typo and style fixes. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8613 from cloud-fan/minor.
*	[SPARK-10481] [YARN] SPARK_PREPEND_CLASSES make spark-yarn related jar could ↵	Jeff Zhang	2015-09-09	1	-1/+4
\| \| \| \| \| \| \| \| \| \|	n… Throw a more readable exception. Please help review. Thanks Author: Jeff Zhang <zjffdu@apache.org> Closes #8649 from zjffdu/SPARK-10481.
*	[SPARK-10117] [MLLIB] Implement SQL data source API for reading LIBSVM data	lewuathe	2015-09-09	4	-0/+256
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API. Two option is implemented. * `numFeatures`: Specify the dimension of features vector * `featuresType`: Specify the type of output vector. `sparse` is default. Author: lewuathe <lewuathe@me.com> Closes #8537 from Lewuathe/SPARK-10117 and squashes the following commits: 986999d [lewuathe] Change unit test phrase 11d513f [lewuathe] Fix some reviews 21600a4 [lewuathe] Merge branch 'master' into SPARK-10117 9ce63c7 [lewuathe] Rewrite service loader file 1fdd2df [lewuathe] Merge branch 'SPARK-10117' of github.com:Lewuathe/spark into SPARK-10117 ba3657c [lewuathe] Merge branch 'master' into SPARK-10117 0ea1c1c [lewuathe] LibSVMRelation is registered into META-INF 4f40891 [lewuathe] Improve test suites 5ab62ab [lewuathe] Merge branch 'master' into SPARK-10117 8660d0e [lewuathe] Fix Java unit test b56a948 [lewuathe] Merge branch 'master' into SPARK-10117 2c12894 [lewuathe] Remove unnecessary tag 7d693c2 [lewuathe] Resolv conflict 62010af [lewuathe] Merge branch 'master' into SPARK-10117 a97ee97 [lewuathe] Fix some points aef9564 [lewuathe] Fix 70ee4dd [lewuathe] Add Java test 3fd8dce [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data 40d3027 [lewuathe] Add Java test 7056d4a [lewuathe] Merge branch 'master' into SPARK-10117 99accaa [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
*	[SPARK-10227] fatal warnings with sbt on Scala 2.11	Luc Bourlier	2015-09-09	60	-151/+158
\| \| \| \| \| \| \| \| \| \| \|	The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary. But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations. The remainder are some potential bugs, and deprecated syntax. Author: Luc Bourlier <luc.bourlier@typesafe.com> Closes #8433 from skyluc/issue/sbt-2.11.
*	[SPARK-10249] [ML] [DOC] Add Python Code Example to StopWordsRemover User Guide	Yuhao Yang	2015-09-08	1	-0/+19
\| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-10249 update user guide since python support added. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #8620 from hhbyyh/swPyDocExample.
*	[SPARK-9654] [ML] [PYSPARK] Add IndexToString to PySpark	Holden Karau	2015-09-08	3	-6/+73
\| \| \| \| \| \| \| \|	Adds IndexToString to PySpark. Author: Holden Karau <holden@pigscanfly.ca> Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.
*	[SPARK-10094] Pyspark ML Feature transformers marked as experimental	noelsmith	2015-09-08	1	-0/+52
\| \| \| \| \| \| \| \|	Modified class-level docstrings to mark all feature transformers in pyspark.ml as experimental. Author: noelsmith <mail@noelsmith.com> Closes #8623 from noel-smith/SPARK-10094-mark-pyspark-ml-trans-exp.
*	[SPARK-10373] [PYSPARK] move @since into pyspark from sql	Davies Liu	2015-09-08	9	-25/+23
\| \| \| \| \| \| \| \|	cc mengxr Author: Davies Liu <davies@databricks.com> Closes #8657 from davies/move_since.
*	[SPARK-10464] [MLLIB] Add WeibullGenerator for RandomDataGenerator	Yanbo Liang	2015-09-08	2	-3/+40
\| \| \| \| \| \| \| \| \|	Add WeibullGenerator for RandomDataGenerator. #8611 need use WeibullGenerator to generate random data based on Weibull distribution. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8622 from yanboliang/spark-10464.
*	[SPARK-9834] [MLLIB] implement weighted least squares via normal equation	Xiangrui Meng	2015-09-08	4	-1/+438
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The goal of this PR is to have a weighted least squares implementation that takes the normal equation approach, and hence to be able to provide R-like summary statistics and support IRLS (used by GLMs). The tests match R's lm and glmnet. There are couple TODOs that can be addressed in future PRs: * consolidate summary statistics aggregators * move `dspr` to `BLAS` * etc It would be nice to have this merged first because it blocks couple other features. dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8588 from mengxr/SPARK-9834.