spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in ↵	Josh Rosen	2015-09-15	17	-69/+174
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	OutputCommitCoordinator When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop. This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish). This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code. Author: Josh Rosen <joshrosen@databricks.com> Closes #8544 from JoshRosen/SPARK-10381.
*	[SPARK-10575] [SPARK CORE] Wrapped RDD.takeSample with Scope	vinodkc	2015-09-15	1	-37/+31
\| \| \| \| \| \| \| \| \| \|	Remove return statements in RDD.takeSample and wrap it withScope Author: vinodkc <vinod.kc.in@gmail.com> Author: vinodkc <vinodkc@users.noreply.github.com> Author: Vinod K C <vinod.kc@huawei.com> Closes #8730 from vinodkc/fix_takesample_return.
*	[SPARK-10612] [SQL] Add prepare to LocalNode.	Reynold Xin	2015-09-15	1	-0/+8
\| \| \| \| \| \| \| \|	The idea is that we should separate the function call that does memory reservation (i.e. prepare) from the function call that consumes the input (e.g. open()), so all operators can be a chance to reserve memory before they are all consumed. Author: Reynold Xin <rxin@databricks.com> Closes #8761 from rxin/SPARK-10612.
*	[SPARK-10548] [SPARK-10563] [SQL] Fix concurrent SQL executions	Andrew Or	2015-09-15	3	-43/+132
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Note: this is for master branch only. The fix for branch-1.5 is at #8721. The query execution ID is currently passed from a thread to its children, which is not the intended behavior. This led to `IllegalArgumentException: spark.sql.execution.id is already set` when running queries in parallel, e.g.: ``` (1 to 100).par.foreach { _ => sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count() } ``` The cause is `SparkContext`'s local properties are inherited by default. This patch adds a way to exclude keys we don't want to be inherited, and makes SQL go through that code path. Author: Andrew Or <andrew@databricks.com> Closes #8710 from andrewor14/concurrent-sql-executions.
*	[SPARK-7685] [ML] Apply weights to different samples in Logistic Regression	DB Tsai	2015-09-15	7	-128/+303
\| \| \| \| \| \| \| \| \| \| \|	In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive. http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm. Author: DB Tsai <dbt@netflix.com> Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com> Closes #7884 from dbtsai/SPARK-7685.
*	[SPARK-10475] [SQL] improve column prunning for Project on Sort	Wenchen Fan	2015-09-15	2	-4/+26
\| \| \| \| \| \| \| \|	Sometimes we can't push down the whole `Project` though `Sort`, but we still have a chance to push down part of it. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8644 from cloud-fan/column-prune.
*	[SPARK-10437] [SQL] Support aggregation expressions in Order By	Liang-Chi Hsieh	2015-09-15	2	-4/+30
\| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-10437 If an expression in `SortOrder` is a resolved one, such as `count(1)`, the corresponding rule in `Analyzer` to make it work in order by will not be applied. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8599 from viirya/orderby-agg.
*	Revert "[SPARK-10300] [BUILD] [TESTS] Add support for test tags in ↵	Marcelo Vanzin	2015-09-15	26	-124/+147
\| \| \| \| \| \|	run-tests.py." This reverts commit 8abef21dac1a6538c4e4e0140323b83d804d602b.
*	[DOCS] Small fixes to Spark on Yarn doc	Jacek Laskowski	2015-09-15	1	-6/+6
\| \| \| \| \| \| \| \| \|	* a follow-up to 16b6d18613e150c7038c613992d80a7828413e66 as `--num-executors` flag is not suppported. * links + formatting Author: Jacek Laskowski <jacek.laskowski@deepsense.io> Closes #8762 from jaceklaskowski/docs-spark-on-yarn.
*	Closes #8738	Xiangrui Meng	2015-09-15	0	-0/+0
\| \| \| \| \| \| \| \|	Closes #8767 Closes #2491 Closes #6795 Closes #2096 Closes #7722
*	[PYSPARK] [MLLIB] [DOCS] Replaced addversion with versionadded in mllib.random	noelsmith	2015-09-15	1	-1/+1
\| \| \| \| \| \| \| \|	Missed this when reviewing `pyspark.mllib.random` for SPARK-10275. Author: noelsmith <mail@noelsmith.com> Closes #8773 from noel-smith/mllib-random-versionadded-fix.
*	[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py.	Marcelo Vanzin	2015-09-15	26	-147/+124
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change does two things: - tag a few tests and adds the mechanism in the build to be able to disable those tags, both in maven and sbt, for both junit and scalatest suites. - add some logic to run-tests.py to disable some tags depending on what files have changed; that's used to disable expensive tests when a module hasn't explicitly been changed, to speed up testing for changes that don't directly affect those modules. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8437 from vanzin/test-tags.
*	[SPARK-10491] [MLLIB] move RowMatrix.dspr to BLAS	Yuhao Yang	2015-09-15	4	-41/+72
\| \| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-10491 We implemented dspr with sparse vector support in `RowMatrix`. This method is also used in WeightedLeastSquares and other places. It would be useful to move it to `linalg.BLAS`. Let me know if new UT needed. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #8663 from hhbyyh/movedspr.
*	Update version to 1.6.0-SNAPSHOT.	Reynold Xin	2015-09-15	38	-40/+49
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #8350 from rxin/1.6.
*	[SPARK-10598] [DOCS]	Robin East	2015-09-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	Comments preceding toMessage method state: "The edge partition is encoded in the lower * 30 bytes of the Int, and the position is encoded in the upper 2 bytes of the Int.". References to bytes should be changed to bits. This contribution is my original work and I license the work to the Spark project under it's open source license. Author: Robin East <robin.east@xense.co.uk> Closes #8756 from insidedctm/master.
*	Small fixes to docs	Jacek Laskowski	2015-09-14	1	-5/+5
\| \| \| \| \| \| \| \|	Links work now properly + consistent use of Spark standalone cluster (Spark uppercase + lowercase the rest -- seems agreed in the other places in the docs). Author: Jacek Laskowski <jacek.laskowski@deepsense.io> Closes #8759 from jaceklaskowski/docs-submitting-apps.
*	[SPARK-10275] [MLLIB] Add @since annotation to pyspark.mllib.random	Yu ISHIKAWA	2015-09-14	1	-0/+15
\| \| \| \| \| \|	Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8666 from yu-iskw/SPARK-10275.
*	[SPARK-10273] Add @since annotation to pyspark.mllib.feature	noelsmith	2015-09-14	1	-1/+57
\| \| \| \| \| \| \| \| \| \|	Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings). Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark). Author: noelsmith <mail@noelsmith.com> Closes #8633 from noel-smith/SPARK-10273-since-mllib-feature.
*	[SPARK-9793] [MLLIB] [PYSPARK] PySpark DenseVector, SparseVector implement ↵	Yanbo Liang	2015-09-14	2	-15/+107
\| \| \| \| \| \| \| \| \| \| \|	__eq__ and __hash__ correctly PySpark DenseVector, SparseVector ```__eq__``` method should use semantics equality, and DenseVector can compared with SparseVector. Implement PySpark DenseVector, SparseVector ```__hash__``` method based on the first 16 entries. That will make PySpark Vector objects can be used in collections. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8166 from yanboliang/spark-9793.
*	[SPARK-10542] [PYSPARK] fix serialize namedtuple	Davies Liu	2015-09-14	3	-1/+20
\| \| \| \| \| \|	Author: Davies Liu <davies@databricks.com> Closes #8707 from davies/fix_namedtuple.
*	[SPARK-9851] Support submitting map stages individually in DAGScheduler	Matei Zaharia	2015-09-14	12	-63/+710
\| \| \| \| \| \| \| \| \| \|	This patch adds support for submitting map stages in a DAG individually so that we can make downstream decisions after seeing statistics about their output, as part of SPARK-9850. I also added more comments to many of the key classes in DAGScheduler. By itself, the patch is not super useful except maybe to switch between a shuffle and broadcast join, but with the other subtasks of SPARK-9850 we'll be able to do more interesting decisions. The main entry point is SparkContext.submitMapStage, which lets you run a map stage and see stats about the map output sizes. Other stats could also be collected through accumulators. See AdaptiveSchedulingSuite for a short example. Author: Matei Zaharia <matei@databricks.com> Closes #8180 from mateiz/spark-9851.
*	[SPARK-10564] ThreadingSuite: assertion failures in threads don't fail the ↵	Andrew Or	2015-09-14	1	-8/+15
\| \| \| \| \| \| \| \| \| \|	test (round 2) This is a follow-up patch to #8723. I missed one case there. Author: Andrew Or <andrew@databricks.com> Closes #8727 from andrewor14/fix-threading-suite.
*	[SPARK-10543] [CORE] Peak Execution Memory Quantile should be Per-task Basis	Forest Fang	2015-09-14	2	-8/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Read `PEAK_EXECUTION_MEMORY` using `update` to get per task partial value instead of cumulative value. I tested with this workload: ```scala val size = 1000 val repetitions = 10 val data = sc.parallelize(1 to size, 5).map(x => (util.Random.nextInt(size / repetitions),util.Random.nextDouble)).toDF("key", "value") val res = data.toDF.groupBy("key").agg(sum("value")).count ``` Before: ![image](https://cloud.githubusercontent.com/assets/4317392/9828197/07dd6874-58b8-11e5-9bd9-6ba927c38b26.png) After: ![image](https://cloud.githubusercontent.com/assets/4317392/9828151/a5ddff30-58b7-11e5-8d31-eda5dc4eae79.png) Tasks view: ![image](https://cloud.githubusercontent.com/assets/4317392/9828199/17dc2b84-58b8-11e5-92a8-be89ce4d29d1.png) cc andrewor14 I appreciate if you can give feedback on this since I think you introduced display of this metric. Author: Forest Fang <forest.fang@outlook.com> Closes #8726 from saurfang/stagepage.
*	[SPARK-10549] scala 2.11 spark on yarn with security - Repl doesn't work	Tom Graves	2015-09-14	1	-1/+2
\| \| \| \| \| \| \| \| \|	Make this lazy so that it can set the yarn mode before creating the securityManager. Author: Tom Graves <tgraves@yahoo-inc.com> Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Closes #8719 from tgravescs/SPARK-10549.
*	[SPARK-10576] [BUILD] Move .java files out of src/main/scala	Sean Owen	2015-09-14	8	-0/+0
\| \| \| \| \| \| \| \|	Move .java files in `src/main/scala` to `src/main/java` root, except for `package-info.java` (to stay next to package.scala) Author: Sean Owen <sowen@cloudera.com> Closes #8736 from srowen/SPARK-10576.
*	[SPARK-10594] [YARN] Remove reference to --num-executors, add --properties-file	Erick Tryzelaar	2015-09-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	`ApplicationMaster` no longer has the `--num-executors` flag, and had an undocumented `--properties-file` configuration option. cc srowen Author: Erick Tryzelaar <erick.tryzelaar@gmail.com> Closes #8754 from erickt/master.
*	[SPARK-9996] [SPARK-9997] [SQL] Add local expand and NestedLoopJoin operators	zsxwing	2015-09-14	7	-15/+574
\| \| \| \| \| \| \| \|	This PR is in conflict with #8535 and #8573. Will update this one when they are merged. Author: zsxwing <zsxwing@gmail.com> Closes #8642 from zsxwing/expand-nest-join.
*	[SPARK-6981] [SQL] Factor out SparkPlanner and QueryExecution from SQLContext	Edoardo Vacchi	2015-09-14	6	-128/+195
\| \| \| \| \| \| \| \| \| \|	Alternative to PR #6122; in this case the refactored out classes are replaced by inner classes with the same name for backwards binary compatibility * process in a lighter-weight, backwards-compatible way Author: Edoardo Vacchi <uncommonnonsense@gmail.com> Closes #6356 from evacchi/sqlctx-refactoring-lite.
*	[SPARK-10522] [SQL] Nanoseconds of Timestamp in Parquet should be positive	Davies Liu	2015-09-14	2	-14/+15
\| \| \| \| \| \| \| \| \| \|	Or Hive can't read it back correctly. Thanks vanzin for report this. Author: Davies Liu <davies@databricks.com> Closes #8674 from davies/positive_nano.
*	[SPARK-10573] [ML] IndexToString output schema should be StringType	Nick Pritchard	2015-09-14	2	-3/+10
\| \| \| \| \| \| \| \|	Fixes bug where IndexToString output schema was DoubleType. Correct me if I'm wrong, but it doesn't seem like the output needs to have any "ML Attribute" metadata. Author: Nick Pritchard <nicholas.pritchard@falkonry.com> Closes #8751 from pnpritchard/SPARK-10573.
*	[SPARK-10194] [MLLIB] [PYSPARK] SGD algorithms need convergenceTol parameter ↵	Yanbo Liang	2015-09-14	3	-21/+48
\| \| \| \| \| \| \| \| \| \|	in Python [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8457 from yanboliang/spark-10194.
*	[SPARK-10584] [DOC] [SQL] Documentation about ↵	Kousuke Saruta	2015-09-14	2	-5/+8
\| \| \| \| \| \| \| \| \| \| \|	spark.sql.hive.metastore.version is wrong. The default value of hive metastore version is 1.2.1 but the documentation says the value of `spark.sql.hive.metastore.version` is 0.13.1. Also, we cannot get the default value by `sqlContext.getConf("spark.sql.hive.metastore.version")`. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #8739 from sarutak/SPARK-10584.
*	[SPARK-9899] [SQL] log warning for direct output committer with speculation ↵	Wenchen Fan	2015-09-14	3	-9/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	enabled This is a follow-up of https://github.com/apache/spark/pull/8317. When speculation is enabled, there may be multiply tasks writing to the same path. Generally it's OK as we will write to a temporary directory first and only one task can commit the temporary directory to target path. However, when we use direct output committer, tasks will write data to target path directly without temporary directory. This causes problems like corrupted data. Please see [PR comment](https://github.com/apache/spark/pull/8191#issuecomment-131598385) for more details. Unfortunately, we don't have a simple flag to tell if a output committer will write to temporary directory or not, so for safety, we have to disable any customized output committer when `speculation` is true. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8687 from cloud-fan/direct-committer.
*	[SPARK-9720] [ML] Identifiable types need UID in toString methods	Bertrand Dechoux	2015-09-14	8	-9/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	A few Identifiable types did override their toString method but without using the parent implementation. As a consequence, the uid was not present anymore in the toString result. It is the default behaviour. This patch is a quick fix. The question of enforcement is still up. No tests have been written to verify the toString method behaviour. That would be long to do because all types should be tested and not only those which have a regression now. It is possible to enforce the condition using the compiler by making the toString method final but that would introduce unwanted potential API breaking changes (see jira). Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com> Closes #8062 from BertrandDechoux/SPARK-9720.
*	[SPARK-10222] [GRAPHX] [DOCS] More thoroughly deprecate Bagel in favor of GraphX	Sean Owen	2015-09-13	4	-11/+8
\| \| \| \| \| \| \| \|	Finish deprecating Bagel; remove reference to nonexistent example Author: Sean Owen <sowen@cloudera.com> Closes #8731 from srowen/SPARK-10222.
*	[SPARK-10330] Add Scalastyle rule to require use of SparkHadoopUtil ↵	Josh Rosen	2015-09-12	15	-20/+61
\| \| \| \| \| \| \| \| \| \|	JobContext methods This is a followup to #8499 which adds a Scalastyle rule to mandate the use of SparkHadoopUtil's JobContext accessor methods and fixes the existing violations. Author: Josh Rosen <joshrosen@databricks.com> Closes #8521 from JoshRosen/SPARK-10330-part2.
*	[SPARK-6548] Adding stddev to DataFrame functions	JihongMa	2015-09-12	16	-64/+574
\| \| \| \| \| \| \| \| \| \| \|	Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change. Author: JihongMa <linlin200605@gmail.com> Author: Jihong MA <linlin200605@gmail.com> Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com> Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local> Closes #6297 from JihongMA/SPARK-SQL.
*	[SPARK-10547] [TEST] Streamline / improve style of Java API tests	Sean Owen	2015-09-12	15	-761/+755
\| \| \| \| \| \| \| \|	Fix a few Java API test style issues: unused generic types, exceptions, wrong assert argument order Author: Sean Owen <sowen@cloudera.com> Closes #8706 from srowen/SPARK-10547.
*	[SPARK-10554] [CORE] Fix NPE with ShutdownHook	Nithin Asokan	2015-09-12	1	-1/+3
\| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10554 Fixes NPE when ShutdownHook tries to cleanup temporary folders Author: Nithin Asokan <Nithin.Asokan@Cerner.com> Closes #8720 from nasokan/SPARK-10554.
*	[SPARK-10566] [CORE] SnappyCompressionCodec init exception handling masks ↵	Daniel Imfeld	2015-09-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	important error information When throwing an IllegalArgumentException in SnappyCompressionCodec.init, chain the existing exception. This allows potentially important debugging info to be passed to the user. Manual testing shows the exception chained properly, and the test suite still looks fine as well. This contribution is my original work and I license the work to the project under the project's open source license. Author: Daniel Imfeld <daniel@danielimfeld.com> Closes #8725 from dimfeld/dimfeld-patch-1.
*	[SPARK-9014] [SQL] Allow Python spark API to use built-in exponential operator	0x0FFF	2015-09-11	2	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR addresses (SPARK-9014)[https://issues.apache.org/jira/browse/SPARK-9014] Added functionality: `Column` object in Python now supports exponential operator `*` Example: ``` from pyspark.sql import df = sqlContext.createDataFrame([Row(a=2)]) df.select(3df.a,df.a3,df.a**df.a).collect() ``` Outputs: ``` [Row(POWER(3.0, a)=9.0, POWER(a, 3.0)=8.0, POWER(a, a)=4.0)] ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8658 from 0x0FFF/SPARK-9014.
*	[SPARK-10564] ThreadingSuite: assertion failures in threads don't fail the test	Andrew Or	2015-09-11	1	-23/+45
\| \| \| \| \| \| \| \|	This commit ensures if an assertion fails within a thread, it will ultimately fail the test. Otherwise we end up potentially masking real bugs by not propagating assertion failures properly. Author: Andrew Or <andrew@databricks.com> Closes #8723 from andrewor14/fix-threading-suite.
*	[SPARK-9990] [SQL] Local hash join follow-ups	Andrew Or	2015-09-11	4	-5/+125
\| \| \| \| \| \| \| \| \|	1. Hide `LocalNodeIterator` behind the `LocalNode#asIterator` method 2. Add tests for this Author: Andrew Or <andrew@databricks.com> Closes #8708 from andrewor14/local-hash-join-follow-up.
*	[SPARK-9992] [SPARK-9994] [SPARK-9998] [SQL] Implement the local TopK, ↵	zsxwing	2015-09-11	8	-1/+353
\| \| \| \| \| \| \| \| \| \|	sample and intersect operators This PR is in conflict with #8535. I will update this one when #8535 gets merged. Author: zsxwing <zsxwing@gmail.com> Closes #8573 from zsxwing/more-local-operators.
*	[SPARK-7142] [SQL] Minor enhancement to BooleanSimplification Optimizer ↵	Yash Datta	2015-09-11	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \|	rule. Incorporate review comments Adding changes suggested by cloud-fan in #5700 cc marmbrus Author: Yash Datta <Yash.Datta@guavus.com> Closes #8716 from saucam/bool_simp.
*	[SPARK-10442] [SQL] fix string to boolean cast	Wenchen Fan	2015-09-11	4	-24/+82
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we cast string to boolean in hive, it returns `true` if the length of string is > 0, and spark SQL follows this behavior. However, this behavior is very different from other SQL systems: 1. [presto](https://github.com/facebook/presto/blob/master/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L89-L118) will return `true` for 't' 'true' '1', `false` for 'f' 'false' '0', throw exception for others. 2. [redshift](http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html) will return `true` for 't' 'true' 'y' 'yes' '1', `false` for 'f' 'false' 'n' 'no' '0', null for others. 3. [postgresql](http://www.postgresql.org/docs/devel/static/datatype-boolean.html) will return `true` for 't' 'true' 'y' 'yes' 'on' '1', `false` for 'f' 'false' 'n' 'no' 'off' '0', throw exception for others. 4. [vertica](https://my.vertica.com/docs/5.0/HTML/Master/2983.htm) will return `true` for 't' 'true' 'y' 'yes' '1', `false` for 'f' 'false' 'n' 'no' '0', null for others. 5. [impala](http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_boolean.html) throw exception when try to cast string to boolean. 6. mysql, oracle, sqlserver don't have boolean type Whether we should change the cast behavior according to other SQL system or not is not decided yet, this PR is a test to see if we changed, how many compatibility tests will fail. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8698 from cloud-fan/string2boolean.
*	[PYTHON] Fixed typo in exception message	Icaro Medeiros	2015-09-11	1	-1/+1
\| \| \| \| \| \| \| \|	Just fixing a typo in exception message, raised when attempting to pickle SparkContext. Author: Icaro Medeiros <icaro.medeiros@gmail.com> Closes #8724 from icaromedeiros/master.
*	[SPARK-10546] Check partitionId's range in ExternalSorter#spill()	tedyu	2015-09-11	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	See this thread for background: http://search-hadoop.com/m/q3RTt0rWvIkHAE81 We should check the range of partition Id and provide meaningful message through exception. Alternatively, we can use abs() and modulo to force the partition Id into legitimate range. However, expectation is that user should correct the logic error in his / her code. Author: tedyu <yuzhihong@gmail.com> Closes #8703 from tedyu/master.
*	[SPARK-8530] [ML] add python API for MinMaxScaler	Yuhao Yang	2015-09-11	1	-5/+99
\| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-8530 add python API for MinMaxScaler jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514 Author: Yuhao Yang <hhbyyh@gmail.com> Closes #7150 from hhbyyh/pythonMinMax.
*	[SPARK-10540] [SQL] Ignore HadoopFsRelationTest's "test all data types" if ↵	Yin Huai	2015-09-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	it is too flaky If hadoopFsRelationSuites's "test all data types" is too flaky we can disable it for now. https://issues.apache.org/jira/browse/SPARK-10540 Author: Yin Huai <yhuai@databricks.com> Closes #8705 from yhuai/SPARK-10540-ignore.