spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-12232][SPARKR] New R API for read.table to avoid name conflict	felixcheung	2016-01-19	5	-27/+21
\| \| \| \| \| \| \| \|	shivaram sorry it took longer to fix some conflicts, this is the change to add an alias for `table` Author: felixcheung <felixcheung_m@hotmail.com> Closes #10406 from felixcheung/readtable.
*	Revert "[SPARK-11295] Add packages to JUnit output for Python tests"	Xiangrui Meng	2016-01-19	5	-18/+10
\| \| \| \|	This reverts commit c6f971b4aeca7265ab374fa46c5c452461d9b6a7.
*	[SPARK-12337][SPARKR] Implement dropDuplicates() method of DataFrame in SparkR.	Sun Rui	2016-01-19	4	-1/+75
\| \| \| \| \| \|	Author: Sun Rui <rui.sun@intel.com> Closes #10309 from sun-rui/SPARK-12337.
*	[SPARK-12168][SPARKR] Add automated tests for conflicted function in R	felixcheung	2016-01-19	2	-1/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently this is reported when loading the SparkR package in R (probably would add is.nan) ``` Loading required package: methods Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’: cov, filter, lag, na.omit, predict, sd, var The following objects are masked from ‘package:base’: colnames, colnames<-, intersect, rank, rbind, sample, subset, summary, table, transform ``` Adding this test adds an automated way to track changes to masked method. Also, the second part of this test check for those functions that would not be accessible without namespace/package prefix. Incidentally, this might point to how we would fix those inaccessible functions in base or stats. Looking for feedback for adding this test. Author: felixcheung <felixcheung_m@hotmail.com> Closes #10171 from felixcheung/rmaskedtest.
*	[SPARK-12770][SQL] Implement rules for branch elimination for CaseWhen	Reynold Xin	2016-01-19	2	-0/+55
\| \| \| \| \| \| \| \| \| \| \| \|	The three optimization cases are: 1. If the first branch's condition is a true literal, remove the CaseWhen and use the value from that branch. 2. If a branch's condition is a false or null literal, remove that branch. 3. If only the else branch is left, remove the CaseWhen and use the value from the else branch. Author: Reynold Xin <rxin@databricks.com> Closes #10827 from rxin/SPARK-12770.
*	[SPARK-9716][ML] BinaryClassificationEvaluator should accept Double ↵	BenFradet	2016-01-19	4	-5/+58
\| \| \| \| \| \| \| \| \| \|	prediction column This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10472 from BenFradet/SPARK-9716.
*	[SPARK-2750][WEB UI] Add https support to the Web UI	scwf	2016-01-19	22	-93/+338
\| \| \| \| \| \| \| \| \|	Author: scwf <wangfei1@huawei.com> Author: Marcelo Vanzin <vanzin@cloudera.com> Author: WangTaoTheTonic <wangtao111@huawei.com> Author: w00228970 <wangfei1@huawei.com> Closes #10238 from vanzin/SPARK-2750.
*	[BUILD] Runner for spark packages	Michael Armbrust	2016-01-19	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a convenience method added to the SBT build for developers, though if people think its useful we could consider adding a official script that runs using the assembly instead of compiling on demand. It simply compiles spark (without requiring an assembly), and invokes Spark Submit to download / run the package. Example Usage: ``` $ build/sbt > sparkPackage com.databricks:spark-sql-perf_2.10:0.2.4 com.databricks.spark.sql.perf.RunBenchmark --help ``` Author: Michael Armbrust <michael@databricks.com> Closes #10834 from marmbrus/sparkPackageRunner.
*	[SPARK-11295] Add packages to JUnit output for Python tests	Gábor Lipták	2016-01-19	5	-10/+18
\| \| \| \| \| \| \| \| \| \|	SPARK-11295 Add packages to JUnit output for Python tests This improves grouping/display of test case results. Author: Gábor Lipták <gliptak@gmail.com> Closes #9263 from gliptak/SPARK-11295.
*	[SPARK-12816][SQL] De-alias type when generating schemas	Jakob Odersky	2016-01-19	2	-1/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Call `dealias` on local types to fix schema generation for abstract type members, such as ```scala type KeyValue = (Int, String) ``` Add simple test Author: Jakob Odersky <jodersky@gmail.com> Closes #10749 from jodersky/aliased-schema.
*	[SPARK-12560][SQL] SqlTestUtils.stripSparkFilter needs to copy utf8strings	Imran Rashid	2016-01-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	See https://issues.apache.org/jira/browse/SPARK-12560 This isn't causing any problems currently because the tests for string predicate pushdown are currently disabled. I ran into this while trying to turn them back on with a different version of parquet. Figure it was good to fix now in any case. Author: Imran Rashid <irashid@cloudera.com> Closes #10510 from squito/SPARK-12560.
*	[SPARK-12867][SQL] Nullability of Intersect can be stricter	gatorsmile	2016-01-19	2	-6/+33
\| \| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-12867 When intersecting one nullable column with one non-nullable column, the result will not contain any null. Thus, we can make nullability of `intersect` stricter. liancheng Could you please check if the code changes are appropriate? Also added test cases to verify the results. Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #10812 from gatorsmile/nullabilityIntersect.
*	[SPARK-12804][ML] Fix LogisticRegression with FitIntercept on all same label ↵	Feynman Liang	2016-01-19	2	-95/+148
\| \| \| \| \| \| \| \| \| \|	training data CC jkbradley mengxr dbtsai Author: Feynman Liang <feynman.liang@gmail.com> Closes #10743 from feynmanliang/SPARK-12804.
*	[SPARK-12887] Do not expose var's in TaskMetrics	Andrew Or	2016-01-19	27	-246/+281
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a step in implementing SPARK-10620, which migrates TaskMetrics to accumulators. TaskMetrics has a bunch of var's, some are fully public, some are `private[spark]`. This is bad coding style that makes it easy to accidentally overwrite previously set metrics. This has happened a few times in the past and caused bugs that were difficult to debug. Instead, we should have get-or-create semantics, which are more readily understandable. This makes sense in the case of TaskMetrics because these are just aggregated metrics that we want to collect throughout the task, so it doesn't matter who's incrementing them. Parent PR: #10717 Author: Andrew Or <andrew@databricks.com> Author: Josh Rosen <joshrosen@databricks.com> Author: andrewor14 <andrew@databricks.com> Closes #10815 from andrewor14/get-or-create-metrics.
*	[SPARK-12870][SQL] better format bucket id in file name	Wenchen Fan	2016-01-19	4	-7/+13
\| \| \| \| \| \| \| \|	for normal parquet file without bucket, it's file name ends with a jobUUID which maybe all numbers and mistakeny regarded as bucket id. This PR improves the format of bucket id in file name by using a different seperator, `_`, so that the regex is more robust. Author: Wenchen Fan <wenchen@databricks.com> Closes #10799 from cloud-fan/fix-bucket.
*	[SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k means	Holden Karau	2016-01-19	3	-5/+159
\| \| \| \| \| \| \| \|	From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans. Author: Holden Karau <holden@us.ibm.com> Closes #10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.
*	[MLLIB] Fix CholeskyDecomposition assertion's message	Wojciech Jurczyk	2016-01-19	1	-1/+1
\| \| \| \| \| \| \| \|	Change assertion's message so it's consistent with the code. The old message says that the invoked method was lapack.dports, where in fact it was lapack.dppsv method. Author: Wojciech Jurczyk <wojtek.jurczyk@gmail.com> Closes #10818 from wjur/wjur/rename_error_message.
*	[SPARK-7683][PYSPARK] Confusing behavior of fold function of RDD in pyspark	Sean Owen	2016-01-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Fix order of arguments that Pyspark RDD.fold passes to its op - should be (acc, obj) like other implementations. Obviously, this is a potentially breaking change, so can only happen for 2.x CC davies Author: Sean Owen <sowen@cloudera.com> Closes #10771 from srowen/SPARK-7683.
*	[SQL][MINOR] Fix one little mismatched comment according to the codes in ↵	proflin	2016-01-19	1	-1/+1
\| \| \| \| \| \| \| \|	interface.scala Author: proflin <proflin.me@gmail.com> Closes #10824 from proflin/master.
*	[SPARK-12668][SQL] Providing aliases for CSV options to be similar to Pandas ↵	hyukjinkwon	2016-01-18	3	-6/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and R https://issues.apache.org/jira/browse/SPARK-12668 Spark CSV datasource has been being merged (filed in [SPARK-12420](https://issues.apache.org/jira/browse/SPARK-12420)). This is a quicky PR that simply renames several CSV options to similar Pandas and R. - Alias for delimiter -> sep - charset -> encoding Author: hyukjinkwon <gurwls223@gmail.com> Closes #10800 from HyukjinKwon/SPARK-12668.
*	[HOT][BUILD] Changed the import order	gatorsmile	2016-01-18	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	This PR is to fix the master's build break. The following tests failed due to the import order issues in the master. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49651/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49652/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49653/consoleFull Author: gatorsmile <gatorsmile@gmail.com> Closes #10823 from gatorsmile/importOrder.
*	[SPARK-12885][MINOR] Rename 3 fields in ShuffleWriteMetrics	Andrew Or	2016-01-18	24	-114/+126
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a small step in implementing SPARK-10620, which migrates TaskMetrics to accumulators. This patch is strictly a cleanup patch and introduces no change in functionality. It literally just renames 3 fields for consistency. Today we have: ``` inputMetrics.recordsRead outputMetrics.bytesWritten shuffleReadMetrics.localBlocksFetched ... shuffleWriteMetrics.shuffleRecordsWritten shuffleWriteMetrics.shuffleBytesWritten shuffleWriteMetrics.shuffleWriteTime ``` The shuffle write ones are kind of redundant. We can drop the `shuffle` part in the method names. I added backward compatible (but deprecated) methods with the old names. Parent PR: #10717 Author: Andrew Or <andrew@databricks.com> Closes #10811 from andrewor14/rename-things.
*	[SPARK-12700] [SQL] embed condition into SMJ and BroadcastHashJoin	Davies Liu	2016-01-18	6	-72/+96
\| \| \| \| \| \| \| \| \| \| \| \|	Currently SortMergeJoin and BroadcastHashJoin do not support condition, the need a followed Filter for that, the result projection to generate UnsafeRow could be very expensive if they generate lots of rows and could be filtered mostly by condition. This PR brings the support of condition for SortMergeJoin and BroadcastHashJoin, just like other outer joins do. This could improve the performance of Q72 by 7x (from 120s to 16.5s). Author: Davies Liu <davies@databricks.com> Closes #10653 from davies/filter_join.
*	[SPARK-12889][SQL] Rename ParserDialect -> ParserInterface.	Reynold Xin	2016-01-18	7	-10/+10
\| \| \| \| \| \| \| \|	Based on discussions in #10801, I'm submitting a pull request to rename ParserDialect to ParserInterface. Author: Reynold Xin <rxin@databricks.com> Closes #10817 from rxin/SPARK-12889.
*	[SPARK-12894][DOCUMENT] Add deploy instructions for Python in Kinesis ↵	Shixiong Zhu	2016-01-18	1	-2/+12
\| \| \| \| \| \| \| \| \| \|	integration doc This PR added instructions to get Kinesis assembly jar for Python users in the Kinesis integration page like Kafka doc. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10822 from zsxwing/kinesis-doc.
*	Revert "[SPARK-12829] Turn Java style checker on"	Shixiong Zhu	2016-01-18	1	-1/+2
\| \| \| \|	This reverts commit 591c88c9e2a6c2e2ca84f1b66c635f198a16d112. `lint-java` doesn't work on a machine with a clean Maven cache.
*	[SPARK-12814][DOCUMENT] Add deploy instructions for Python in flume ↵	Shixiong Zhu	2016-01-18	2	-4/+13
\| \| \| \| \| \| \| \| \| \|	integration doc This PR added instructions to get flume assembly jar for Python users in the flume integration page like Kafka doc. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10746 from zsxwing/flume-doc.
*	[SPARK-12882][SQL] simplify bucket tests and add more comments	Wenchen Fan	2016-01-18	2	-46/+78
\| \| \| \| \| \| \| \|	Right now, the bucket tests are kind of hard to understand, this PR simplifies them and add more commetns. Author: Wenchen Fan <wenchen@databricks.com> Closes #10813 from cloud-fan/bucket-comment.
*	[SPARK-12841][SQL] fix cast in filter	Wenchen Fan	2016-01-18	3	-8/+18
\| \| \| \| \| \| \| \|	In SPARK-10743 we wrap cast with `UnresolvedAlias` to give `Cast` a better alias if possible. However, for cases like `filter`, the `UnresolvedAlias` can't be resolved and actually we don't need a better alias for this case. This PR move the cast wrapping logic to `Column.named` so that we will only do it when we need a alias name. Author: Wenchen Fan <wenchen@databricks.com> Closes #10781 from cloud-fan/bug.
*	[SPARK-12855][SQL] Remove parser dialect developer API	Reynold Xin	2016-01-18	13	-139/+16
\| \| \| \| \| \| \| \|	This pull request removes the public developer parser API for external parsers. Given everything a parser depends on (e.g. logical plans and expressions) are internal and not stable, external parsers will break with every release of Spark. It is a bad idea to create the illusion that Spark actually supports pluggable parsers. In addition, this also reduces incentives for 3rd party projects to contribute parse improvements back to Spark. Author: Reynold Xin <rxin@databricks.com> Closes #10801 from rxin/SPARK-12855.
*	[SPARK-10985][CORE] Avoid passing evicted blocks throughout BlockManager	Josh Rosen	2016-01-18	14	-241/+170
\| \| \| \| \| \| \| \|	This patch refactors portions of the BlockManager and CacheManager in order to avoid having to pass `evictedBlocks` lists throughout the code. It appears that these lists were only consumed by `TaskContext.taskMetrics`, so the new code now directly updates the metrics from the lower-level BlockManager methods. Author: Josh Rosen <joshrosen@databricks.com> Closes #10776 from JoshRosen/SPARK-10985.
*	[SPARK-12884] Move classes to their own files for readability	Andrew Or	2016-01-18	8	-360/+493
\| \| \| \| \| \| \| \| \| \|	This is a small step in implementing SPARK-10620, which migrates `TaskMetrics` to accumulators. This patch is strictly a cleanup patch and introduces no change in functionality. It literally just moves classes to their own files to avoid having single monolithic ones that contain 10 different classes. Parent PR: #10717 Author: Andrew Or <andrew@databricks.com> Closes #10810 from andrewor14/move-things.
*	[SPARK-12346][ML] Missing attribute names in GLM for vector-type features	Eric Liang	2016-01-18	3	-5/+43
\| \| \| \| \| \| \| \| \| \|	Currently `summary()` fails on a GLM model fitted over a vector feature missing ML attrs, since the output feature attrs will also have no name. We can avoid this situation by forcing `VectorAssembler` to make up suitable names when inputs are missing names. cc mengxr Author: Eric Liang <ekl@databricks.com> Closes #10323 from ericl/spark-12346.
*	[SPARK-12873][SQL] Add more comment in HiveTypeCoercion for type widening	Reynold Xin	2016-01-18	2	-40/+49
\| \| \| \| \| \| \| \| \| \|	I was reading this part of the analyzer code again and got confused by the difference between findWiderTypeForTwo and findTightestCommonTypeOfTwo. I also simplified WidenSetOperationTypes to make it a lot simpler. The easiest way to review this one is to just read the original code, and the new code. The logic is super simple. Author: Reynold Xin <rxin@databricks.com> Closes #10802 from rxin/SPARK-12873.
*	[SPARK-12558][FOLLOW-UP] AnalysisException when multiple functions applied ↵	Dilip Biswal	2016-01-18	1	-5/+9
\| \| \| \| \| \| \| \| \| \| \|	in GROUP BY clause Addresses the comments from Yin. https://github.com/apache/spark/pull/10520 Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10758 from dilipbiswal/spark-12558-followup.
*	[SPARK-10264][DOCUMENTATION] Added @Since to ml.recomendation	Tommy YU	2016-01-18	1	-3/+30
\| \| \| \| \| \| \| \| \| \| \|	I create new pr since original pr long time no update. Please help to review. srowen Author: Tommy YU <tummyyu@163.com> Closes #10756 from Wenpei/add_since_to_recomm.
*	[SQL] [MINOR] speed up hashcode for UTF8String	Wenchen Fan	2016-01-17	1	-5/+2
\| \| \| \| \| \| \| \|	similar to https://github.com/apache/spark/pull/10784, use `Murmur3_x86_32.hashUnsafeBytes` instead. Author: Wenchen Fan <wenchen@databricks.com> Closes #10791 from cloud-fan/string-hashcode.
*	[SPARK-12862][SPARKR] Jenkins does not run R tests	felixcheung	2016-01-17	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Slight correction: I'm leaving sparkR as-is (ie. R file not supported) and fixed only run-tests.sh as shivaram described. I also assume we are going to cover all doc changes in https://issues.apache.org/jira/browse/SPARK-12846 instead of here. rxin shivaram zjffdu Author: felixcheung <felixcheung_m@hotmail.com> Closes #10792 from felixcheung/sparkRcmd.
*	[SPARK-12860] [SQL] speed up safe projection for primitive types	Wenchen Fan	2016-01-17	1	-2/+3
\| \| \| \| \| \| \| \| \| \|	The idea is simple, use `SpecificMutableRow` instead of `GenericMutableRow` as result row for safe projection. A simple benchmark shows about 1.5x speed up for primitive types, code: https://gist.github.com/cloud-fan/fa77713ccebf0823b2ab#file-safeprojectionbenchmark-scala Author: Wenchen Fan <wenchen@databricks.com> Closes #10790 from cloud-fan/safe-projection.
*	[SPARK-12796] [SQL] Whole stage codegen	Davies Liu	2016-01-16	37	-107/+694
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is the initial work for whole stage codegen, it support Projection/Filter/Range, we will continue work on this to support more physical operators. A micro benchmark show that a query with range, filter and projection could be 3X faster then before. It's turned on by default. For a tree that have at least two chained plans, a WholeStageCodegen will be inserted into it, for example, the following plan ``` Limit 10 +- Project [(id#5L + 1) AS (id + 1)#6L] +- Filter ((id#5L & 1) = 1) +- Range 0, 1, 4, 10, [id#5L] ``` will be translated into ``` Limit 10 +- WholeStageCodegen +- Project [(id#1L + 1) AS (id + 1)#2L] +- Filter ((id#1L & 1) = 1) +- Range 0, 1, 4, 10, [id#1L] ``` Here is the call graph to generate Java source for A and B (A support codegen, but B does not): ``` * WholeStageCodegen Plan A FakeInput Plan B * ========================================================================= * * -> execute() * \| * doExecute() --------> produce() * \| * doProduce() -------> produce() * \| * doProduce() ---> execute() * \| * consume() * doConsume() ------------\| * \| * doConsume() <----- consume() ``` A SparkPlan that support codegen need to implement doProduce() and doConsume(): ``` def doProduce(ctx: CodegenContext): (RDD[InternalRow], String) def doConsume(ctx: CodegenContext, child: SparkPlan, input: Seq[ExprCode]): String ``` Author: Davies Liu <davies@databricks.com> Closes #10735 from davies/whole2.
*	[SPARK-12722][DOCS] Fixed typo in Pipeline example	Jeff Lam	2016-01-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline ``` val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model") ``` should be ``` val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model") ``` cc: jkbradley Author: Jeff Lam <sha0lin@alumni.carnegiemellon.edu> Closes #10769 from Agent007/SPARK-12722.
*	[SPARK-12856] [SQL] speed up hashCode of unsafe array	Wenchen Fan	2016-01-16	1	-5/+2
\| \| \| \| \| \| \| \| \| \|	We iterate the bytes to calculate hashCode before, but now we have `Murmur3_x86_32.hashUnsafeBytes` that don't require the bytes to be word algned, we should use that instead. A simple benchmark shows it's about 3 X faster, benchmark code: https://gist.github.com/cloud-fan/fa77713ccebf0823b2ab#file-arrayhashbenchmark-scala Author: Wenchen Fan <wenchen@databricks.com> Closes #10784 from cloud-fan/array-hashcode.
*	[SPARK-12840] [SQL] Support passing arbitrary objects (not just expressions) ↵	Davies Liu	2016-01-15	11	-49/+48
\| \| \| \| \| \| \| \| \| \|	into code generated classes This is a refactor to support codegen for aggregation and broadcast join. Author: Davies Liu <davies@databricks.com> Closes #10777 from davies/rename2.
*	[SPARK-12644][SQL] Update parquet reader to be vectorized.	Nong Li	2016-01-15	12	-56/+625
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This inlines a few of the Parquet decoders and adds vectorized APIs to support decoding in batch. There are a few particulars in the Parquet encodings that make this much more efficient. In particular, RLE encodings are very well suited for batch decoding. The Parquet 2.0 encodings are also very suited for this. This is a work in progress and does not affect the current execution. In subsequent patches, we will support more encodings and types before enabling this. Simple benchmarks indicate this can decode single ints about > 3x faster. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes #10593 from nongli/spark-12644.
*	[SPARK-12649][SQL] support reading bucketed table	Wenchen Fan	2016-01-15	18	-45/+314
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds the support to read bucketed tables, and correctly populate `outputPartitioning`, so that we can avoid shuffle for some cases. TODO(follow-up PRs): * bucket pruning * avoid shuffle for bucketed table join when use any super-set of the bucketing key. (we should re-visit it after https://issues.apache.org/jira/browse/SPARK-12704 is fixed) * recognize hive bucketed table Author: Wenchen Fan <wenchen@databricks.com> Closes #10604 from cloud-fan/bucket-read.
*	[SPARK-12842][TEST-HADOOP2.7] Add Hadoop 2.7 build profile	Josh Rosen	2016-01-15	7	-2/+206
\| \| \| \| \| \| \| \| \| \|	This patch adds a Hadoop 2.7 build profile in order to let us automate tests against that version. /cc rxin srowen Author: Josh Rosen <joshrosen@databricks.com> Closes #10775 from JoshRosen/add-hadoop-2.7-profile.
*	[SPARK-12833][HOT-FIX] Reset the locale after we set it.	Yin Huai	2016-01-15	1	-4/+9
\| \| \| \| \| \|	Author: Yin Huai <yhuai@databricks.com> Closes #10778 from yhuai/resetLocale.
*	[SPARK-11925][ML][PYSPARK] Add PySpark missing methods for ml.feature during ↵	Yanbo Liang	2016-01-15	1	-10/+62
\| \| \| \| \| \| \| \| \| \| \| \| \|	Spark 1.6 QA Add PySpark missing methods and params for ml.feature: * ```RegexTokenizer``` should support setting ```toLowercase```. * ```MinMaxScalerModel``` should support output ```originalMin``` and ```originalMax```. * ```PCAModel``` should support output ```pc```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9908 from yanboliang/spark-11925.
*	[SPARK-12575][SQL] Grammar parity with existing SQL parser	Herman van Hovell	2016-01-15	33	-972/+286
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base. Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out: - The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR removes this keyword. - The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is not supported anymore. See https://github.com/apache/spark/pull/10689 for the rationale for this. - Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we remove this feature from the parser. It would be quite easy to implement such a feature as an Expression later on. - Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed. cc rxin viirya marmbrus yhuai cloud-fan Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10745 from hvanhovell/SPARK-12575-2.
*	[SQL][MINOR] BoundReference do not need to be NamedExpression	Wenchen Fan	2016-01-15	1	-11/+1
\| \| \| \| \| \| \| \|	We made it a `NamedExpression` to workaroud some hacky cases long time ago, and now seems it's safe to remove it. Author: Wenchen Fan <wenchen@databricks.com> Closes #10765 from cloud-fan/minor.