spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-10328] [SPARKR] Fix generic for na.omit	Shivaram Venkataraman	2015-08-28	4	-6/+27
\| \| \| \| \| \| \| \| \| \|	S3 function is at https://stat.ethz.ch/R-manual/R-patched/library/stats/html/na.fail.html Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com> Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8495 from shivaram/na-omit-fix.
*	[SPARK-10188] [PYSPARK] Pyspark CrossValidator with RMSE selects incorrect model	noelsmith	2015-08-27	3	-1/+104
\| \| \| \| \| \| \| \| \| \| \| \| \|	* Added isLargerBetter() method to Pyspark Evaluator to match the Scala version. * JavaEvaluator delegates isLargerBetter() to underlying Scala object. * Added check for isLargerBetter() in CrossValidator to determine whether to use argmin or argmax. * Added test cases for where smaller is better (RMSE) and larger is better (R-Squared). (This contribution is my original work and that I license the work to the project under Sparks' open source license) Author: noelsmith <mail@noelsmith.com> Closes #8399 from noel-smith/pyspark-rmse-xval-fix.
*	[SPARK-SQL] [MINOR] Fixes some typos in HiveContext	Cheng Lian	2015-08-27	2	-5/+5
\| \| \| \| \| \|	Author: Cheng Lian <lian@databricks.com> Closes #8481 from liancheng/hive-context-typo.
*	[SPARK-9905] [ML] [DOC] Adds LinearRegressionSummary user guide	Feynman Liang	2015-08-27	1	-13/+127
\| \| \| \| \| \| \| \| \| \| \|	* Adds user guide for `LinearRegressionSummary` * Fixes unresolved issues in #8197 CC jkbradley mengxr Author: Feynman Liang <fliang@databricks.com> Closes #8491 from feynmanliang/SPARK-9905.
*	[SPARK-9911] [DOC] [ML] Update Userguide for Evaluator	MechCoder	2015-08-27	1	-0/+13
\| \| \| \| \| \| \| \|	I added a small note about the different types of evaluator and the metrics used. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8304 from MechCoder/multiclass_evaluator.
*	[SPARK-8505] [SPARKR] Add settings to kick `lint-r` from `./dev/run-test.py`	Yu ISHIKAWA	2015-08-27	5	-12/+47
\| \| \| \| \| \| \| \| \| \| \| \|	JoshRosen we'd like to check the SparkR source code with the `dev/lint-r` script on the Jenkins. I tried to incorporate the script into `dev/run-test.py`. Could you review it when you have time? shivaram I modified `dev/lint-r` and `dev/lint-r.R` to install lintr package into a local directory(`R/lib/`) and to exit with a lint status. Could you review it? - [[SPARK-8505] Add settings to kick `lint-r` from `./dev/run-test.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8505) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #7883 from yu-iskw/SPARK-8505.
*	[SPARK-10321] sizeInBytes in HadoopFsRelation	Davies Liu	2015-08-27	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	Having sizeInBytes in HadoopFsRelation to enable broadcast join. cc marmbrus Author: Davies Liu <davies@databricks.com> Closes #8490 from davies/sizeInByte.
*	[SPARK-10287] [SQL] Fixes JSONRelation refreshing on read path	Yin Huai	2015-08-27	4	-25/+7
\| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10287 After porting json to HadoopFsRelation, it seems hard to keep the behavior of picking up new files automatically for JSON. This PR removes this behavior, so JSON is consistent with others (ORC and Parquet). Author: Yin Huai <yhuai@databricks.com> Closes #8469 from yhuai/jsonRefresh.
*	[SPARK-9680] [MLLIB] [DOC] StopWordsRemovers user guide and Java ↵	Feynman Liang	2015-08-27	2	-3/+171
\| \| \| \| \| \| \| \| \| \| \| \| \|	compatibility test * Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine * Cleans up scaladocs for public methods * Adds test for Java compatibility * Follow up Python user guide code example is tracked by SPARK-10249 Author: Feynman Liang <fliang@databricks.com> Closes #8436 from feynmanliang/SPARK-10230.
*	[SPARK-9906] [ML] User guide for LogisticRegressionSummary	MechCoder	2015-08-27	1	-16/+133
\| \| \| \| \| \| \| \| \| \|	User guide for LogisticRegression summaries Author: MechCoder <manojkumarsivaraj334@gmail.com> Author: Manoj Kumar <mks542@nyu.edu> Author: Feynman Liang <fliang@databricks.com> Closes #8197 from MechCoder/log_summary_user_guide.
*	[SPARK-9901] User guide for RowMatrix Tall-and-skinny QR	Yuhao Yang	2015-08-27	1	-1/+10
\| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-9901 The jira covers only the document update. I can further provide example code for QR (like the ones for SVD and PCA) in a separate PR. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #8462 from hhbyyh/qrDoc.
*	[SPARK-10315] remove document on spark.akka.failure-detector.threshold	CodingCat	2015-08-27	1	-10/+0
\| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10315 this parameter is not used any longer and there is some mistake in the current document , should be 'akka.remote.watch-failure-detector.threshold' Author: CodingCat <zhunansjtu@gmail.com> Closes #8483 from CodingCat/SPARK_10315.
*	[SPARK-9148] [SPARK-10252] [SQL] Update SQL Programming Guide	Michael Armbrust	2015-08-27	1	-19/+73
\| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #8441 from marmbrus/documentation.
*	[SPARK-10182] [MLLIB] GeneralizedLinearModel doesn't unpersist cached data	Vyacheslav Baranov	2015-08-27	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \|	`GeneralizedLinearModel` creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache. The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting `StandardScaler` for the first time, and when running optimizer for the second time. So, it might worth to return removed warning. Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better. Author: Vyacheslav Baranov <slavik.baranov@gmail.com> Closes #8395 from SlavikBaranov/SPARK-10182.
*	[SPARK-10257] [MLLIB] Removes Guava from all spark.mllib Java tests	Feynman Liang	2015-08-27	14	-74/+71
\| \| \| \| \| \| \| \| \| \| \| \|	* Replaces instances of `Lists.newArrayList` with `Arrays.asList` * Replaces `commons.lang.StringUtils` over `com.google.collections.Strings` * Replaces `List` interface over `ArrayList` implementations This PR along with #8445 #8446 #8447 completely removes all `com.google.collections.Lists` dependencies within mllib's Java tests. Author: Feynman Liang <fliang@databricks.com> Closes #8451 from feynmanliang/SPARK-10257.
*	[SPARK-9613] [HOTFIX] Fix usage of JavaConverters removed in Scala 2.11	Jacek Laskowski	2015-08-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix for [JavaConverters.asJavaListConverter](http://www.scala-lang.org/api/2.10.5/index.html#scala.collection.JavaConverters$) being removed in 2.11.7 and hence the build fails with the 2.11 profile enabled. Tested with the default 2.10 and 2.11 profiles. BUILD SUCCESS in both cases. Build for 2.10: ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -DskipTests clean install and 2.11: ./dev/change-scala-version.sh 2.11 ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean install Author: Jacek Laskowski <jacek@japila.pl> Closes #8479 from jaceklaskowski/SPARK-9613-hotfix.
*	[SPARK-10256] [ML] Removes guava dependency from spark.ml.classification ↵	Feynman Liang	2015-08-27	1	-2/+2
\| \| \| \| \| \| \| \|	JavaTests Author: Feynman Liang <fliang@databricks.com> Closes #8447 from feynmanliang/SPARK-10256.
*	[SPARK-10255] [ML] Removes Guava dependencies from spark.ml.param JavaTests	Feynman Liang	2015-08-27	2	-6/+6
\| \| \| \| \| \|	Author: Feynman Liang <fliang@databricks.com> Closes #8446 from feynmanliang/SPARK-10255.
*	[SPARK-10254] [ML] Removes Guava dependencies in spark.ml.feature JavaTests	Feynman Liang	2015-08-27	11	-30/+35
\| \| \| \| \| \| \| \| \|	* Replaces `com.google.common` dependencies with `java.util.Arrays` * Small clean up in `JavaNormalizerSuite` Author: Feynman Liang <fliang@databricks.com> Closes #8445 from feynmanliang/SPARK-10254.
*	[DOCS] [STREAMING] [KAFKA] Fix typo in exactly once semantics	Moussa Taifi	2015-08-27	1	-1/+1
\| \| \| \| \| \| \| \| \|	Fix Typo in exactly once semantics [Semantics of output operations] link Author: Moussa Taifi <moutai10@gmail.com> Closes #8468 from moutai/patch-3.
*	[SPARK-10251] [CORE] some common types are not registered for Kryo Serializat…	Ram Sriharsha	2015-08-26	2	-1/+64
\| \| \| \| \| \| \| \|	…ion by default Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #8465 from harsha2010/SPARK-10251.
*	[SPARK-10219] [SPARKR] Fix varargsToEnv and add test case	Shivaram Venkataraman	2015-08-26	2	-1/+8
\| \| \| \| \| \| \| \|	cc sun-rui davies Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #8475 from shivaram/varargs-fix.
*	[SPARK-9964] [PYSPARK] [SQL] PySpark DataFrameReader accept RDD of String ↵	Yanbo Liang	2015-08-26	1	-6/+22
\| \| \| \| \| \| \| \| \| \| \|	for JSON PySpark DataFrameReader should could accept an RDD of Strings (like the Scala version does) for JSON, rather than only taking a path. If this PR is merged, it should be duplicated to cover the other input types (not just JSON). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8444 from yanboliang/spark-9964.
*	[SPARK-9424] [SQL] Parquet programming guide updates for 1.5	Cheng Lian	2015-08-26	1	-8/+37
\| \| \| \| \| \|	Author: Cheng Lian <lian@databricks.com> Closes #8467 from liancheng/spark-9424/parquet-docs-for-1.5.
*	[MINOR] [SPARKR] Fix some validation problems in SparkR	Yu ISHIKAWA	2015-08-26	3	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Getting rid of some validation problems in SparkR https://github.com/apache/spark/pull/7883 cc shivaram ``` inst/tests/test_Serde.R:26:1: style: Trailing whitespace is superfluous. ^~ inst/tests/test_Serde.R:34:1: style: Trailing whitespace is superfluous. ^~ inst/tests/test_Serde.R:37:38: style: Trailing whitespace is superfluous. expect_equal(class(x), "character") ^~ inst/tests/test_Serde.R:50:1: style: Trailing whitespace is superfluous. ^~ inst/tests/test_Serde.R:55:1: style: Trailing whitespace is superfluous. ^~ inst/tests/test_Serde.R:60:1: style: Trailing whitespace is superfluous. ^~ inst/tests/test_sparkSQL.R:611:1: style: Trailing whitespace is superfluous. ^~ R/DataFrame.R:664:1: style: Trailing whitespace is superfluous. ^~~~~~~~~~~~~~ R/DataFrame.R:670:55: style: Trailing whitespace is superfluous. df <- data.frame(row.names = 1 : nrow) ^~~~~~~~~~~~~~~~ R/DataFrame.R:672:1: style: Trailing whitespace is superfluous. ^~~~~~~~~~~~~~ R/DataFrame.R:686:49: style: Trailing whitespace is superfluous. df[[names[colIndex]]] <- vec ^~~~~~~~~~~~~~~~~~ ``` Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8474 from yu-iskw/minor-fix-sparkr.
*	[SPARK-10308] [SPARKR] Add %in% to the exported namespace	Shivaram Venkataraman	2015-08-26	1	-3/+4
\| \| \| \| \| \| \| \| \| \|	I also checked all the other functions defined in column.R, functions.R and DataFrame.R and everything else looked fine. cc yu-iskw Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #8473 from shivaram/in-namespace.
*	[SPARK-10305] [SQL] fix create DataFrame from Python class	Davies Liu	2015-08-26	2	-0/+18
\| \| \| \| \| \| \| \|	cc jkbradley Author: Davies Liu <davies@databricks.com> Closes #8470 from davies/fix_create_df.
*	[SPARK-10241] [MLLIB] update since versions in mllib.recommendation	Xiangrui Meng	2015-08-26	2	-5/+25
\| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.recommendation`. cc srowen coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #8432 from mengxr/SPARK-10241.
*	HOTFIX: Increase PRB timeout	Patrick Wendell	2015-08-26	1	-2/+2
\|
*	[SPARK-9665] [MLLIB] audit MLlib API annotations	Xiangrui Meng	2015-08-26	1	-4/+8
\| \| \| \| \| \| \| \| \| \|	I only found `ml.NaiveBayes` missing `Experimental` annotation. This PR doesn't cover Python APIs. cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8452 from mengxr/SPARK-9665.
*	Closes #8443	Reynold Xin	2015-08-26	0	-0/+0
\|
*	[SPARK-9316] [SPARKR] Add support for filtering using `[` (synonym for ↵	felixcheung	2015-08-25	2	-1/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	filter / select) Add support for ``` df[df$name == "Smith", c(1,2)] df[df$age %in% c(19, 30), 1:2] ``` shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #8394 from felixcheung/rsubset.
*	[SPARK-10236] [MLLIB] update since versions in mllib.feature	Xiangrui Meng	2015-08-25	8	-16/+21
\| \| \| \| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.feature`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8449 from mengxr/SPARK-10236.feature and squashes the following commits: 0e8d658 [Xiangrui Meng] remove unnecessary comment ad70b03 [Xiangrui Meng] update since versions in mllib.feature
*	[SPARK-10235] [MLLIB] update since versions in mllib.regression	Xiangrui Meng	2015-08-25	8	-29/+47
\| \| \| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.regression`. cc freeman-lab dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8426 from mengxr/SPARK-10235 and squashes the following commits: 6cd28e4 [Xiangrui Meng] update since versions in mllib.regression
*	[SPARK-10243] [MLLIB] update since versions in mllib.tree	Xiangrui Meng	2015-08-25	12	-44/+57
\| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.tree`. cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8442 from mengxr/SPARK-10236.
*	[SPARK-10234] [MLLIB] update since version in mllib.clustering	Xiangrui Meng	2015-08-25	7	-23/+44
\| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.clustering`. cc feynmanliang yu-iskw Author: Xiangrui Meng <meng@databricks.com> Closes #8435 from mengxr/SPARK-10234.
*	[SPARK-10240] [SPARK-10242] [MLLIB] update since versions in mlilb.random ↵	Xiangrui Meng	2015-08-25	4	-25/+117
\| \| \| \| \| \| \| \| \| \| \| \|	and mllib.stat The same as #8241 but for `mllib.stat` and `mllib.random`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8439 from mengxr/SPARK-10242.
*	[SPARK-10238] [MLLIB] update since versions in mllib.linalg	Xiangrui Meng	2015-08-25	8	-31/+64
\| \| \| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.linalg`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8440 from mengxr/SPARK-10238 and squashes the following commits: b38437e [Xiangrui Meng] update since versions in mllib.linalg
*	[SPARK-10233] [MLLIB] update since version in mllib.evaluation	Xiangrui Meng	2015-08-25	4	-7/+27
\| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.evaluation`. cc avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8423 from mengxr/SPARK-10233.
*	[SPARK-9888] [MLLIB] User guide for new LDA features	Feynman Liang	2015-08-25	3	-20/+117
\| \| \| \| \| \| \| \| \| \| \| \|	* Adds two new sections to LDA's user guide; one for each optimizer/model * Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization) * Cleans up a TODO and sets a default parameter in LDA code jkbradley hhbyyh Author: Feynman Liang <fliang@databricks.com> Closes #8254 from feynmanliang/SPARK-9888.
*	[SPARK-10215] [SQL] Fix precision of division (follow the rule in Hive)	Davies Liu	2015-08-25	4	-13/+39
\| \| \| \| \| \| \| \| \| \|	Follow the rule in Hive for decimal division. see https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFOPDivide.java#L113 cc chenghao-intel Author: Davies Liu <davies@databricks.com> Closes #8415 from davies/decimal_div2.
*	[SPARK-10245] [SQL] Fix decimal literals with precision < scale	Davies Liu	2015-08-25	3	-6/+19
\| \| \| \| \| \| \| \|	In BigDecimal or java.math.BigDecimal, the precision could be smaller than scale, for example, BigDecimal("0.001") has precision = 1 and scale = 3. But DecimalType require that the precision should be larger than scale, so we should use the maximum of precision and scale when inferring the schema from decimal literal. Author: Davies Liu <davies@databricks.com> Closes #8428 from davies/smaller_decimal.
*	[SPARK-10239] [SPARK-10244] [MLLIB] update since versions in mllib.pmml and ↵	Xiangrui Meng	2015-08-25	9	-11/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	mllib.util Same as #8421 but for `mllib.pmml` and `mllib.util`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8430 from mengxr/SPARK-10239 and squashes the following commits: a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util
*	[SPARK-9797] [MLLIB] [DOC] ↵	Feynman Liang	2015-08-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	StreamingLinearRegressionWithSGD.setConvergenceTol default value Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc Author: Feynman Liang <fliang@databricks.com> Closes #8424 from feynmanliang/SPARK-9797.
*	[SPARK-10237] [MLLIB] update since versions in mllib.fpm	Xiangrui Meng	2015-08-25	3	-7/+32
\| \| \| \| \| \| \| \| \| \|	Same as #8421 but for `mllib.fpm`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8429 from mengxr/SPARK-10237.
*	[SPARK-9800] Adds docs for GradientDescent$.runMiniBatchSGD alias	Feynman Liang	2015-08-25	1	-1/+4
\| \| \| \| \| \| \| \| \|	* Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol * Cleans up a note in code Author: Feynman Liang <fliang@databricks.com> Closes #8425 from feynmanliang/SPARK-9800.
*	[SPARK-10048] [SPARKR] Support arbitrary nested Java array in serde.	Sun Rui	2015-08-25	8	-127/+216
\| \| \| \| \| \| \| \| \| \| \|	This PR: 1. supports transferring arbitrary nested array from JVM to R side in SerDe; 2. based on 1, collect() implemenation is improved. Now it can support collecting data of complex types from a DataFrame. Author: Sun Rui <rui.sun@intel.com> Closes #8276 from sun-rui/SPARK-10048.
*	[SPARK-10231] [MLLIB] update @Since annotation for mllib.classification	Xiangrui Meng	2015-08-25	5	-21/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Update `Since` annotation in `mllib.classification`: 1. add version to classes, objects, constructors, and public variables declared in constructors 2. correct some versions 3. remove `Since` on `toString` MechCoder dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8421 from mengxr/SPARK-10231 and squashes the following commits: b2dce80 [Xiangrui Meng] update @Since annotation for mllib.classification
*	[SPARK-10230] [MLLIB] Rename optimizeAlpha to optimizeDocConcentration	Feynman Liang	2015-08-25	2	-9/+9
\| \| \| \| \| \| \| \| \| \|	See [discussion](https://github.com/apache/spark/pull/8254#discussion_r37837770) CC jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8422 from feynmanliang/SPARK-10230.
*	[SPARK-8531] [ML] Update ML user guide for MinMaxScaler	Yuhao Yang	2015-08-25	1	-0/+71
\| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-8531 Update ML user guide for MinMaxScaler Author: Yuhao Yang <hhbyyh@gmail.com> Author: unknown <yuhaoyan@yuhaoyan-MOBL1.ccr.corp.intel.com> Closes #7211 from hhbyyh/minmaxdoc.