spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-12299][CORE] Remove history serving functionality from Master	Bryan Cutler	2016-05-04	1	-5/+0
\| \| \| \| \| \| \| \| \| \|	Remove history server functionality from standalone Master. Previously, the Master process rebuilt a SparkUI once the application was completed which sometimes caused problems, such as OOM, when the application event log is large (see SPARK-6270). Keeping this functionality out of the Master will help to simplify the process and increase stability. Testing for this change included running core unit tests and manually running an application on a standalone cluster to verify that it completed successfully and that the Master UI functioned correctly. Also added 2 unit tests to verify killing an application and driver from MasterWebUI makes the correct request to the Master. Author: Bryan Cutler <cutlerb@gmail.com> Closes #10991 from BryanCutler/remove-history-master-SPARK-12299.
*	[SPARK-4224][CORE][YARN] Support group acls	Dhruve Ashar	2016-05-04	3	-8/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently only a list of users can be specified for view and modify acls. This change enables a group of admins/devs/users to be provisioned for viewing and modifying Spark jobs. Changes Proposed in the fix Three new corresponding config entries have been added where the user can specify the groups to be given access. ``` spark.admin.acls.groups spark.modify.acls.groups spark.ui.view.acls.groups ``` New config entries were added because specifying the users and groups explicitly is a better and cleaner way compared to specifying them in the existing config entry using a delimiter. A generic trait has been introduced to provide the user to group mapping which makes it pluggable to support a variety of mapping protocols - similar to the one used in hadoop. A default unix shell based implementation has been provided. Custom user to group mapping protocol can be specified and configured by the entry ```spark.user.groups.mapping``` How the patch was Tested We ran different spark jobs setting the config entries in combinations of admin, modify and ui acls. For modify acls we tried killing the job stages from the ui and using yarn commands. For view acls we tried accessing the UI tabs and the logs. Headless accounts were used to launch these jobs and different users tried to modify and view the jobs to ensure that the groups mapping applied correctly. Additional Unit tests have been added without modifying the existing ones. These test for different ways of setting the acls through configuration and/or API and validate the expected behavior. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #12760 from dhruve/impr/SPARK-4224.
*	[MINOR][DOC] Fixed some python snippets in mllib data types documentation.	Shuai Lin	2016-05-03	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Some python snippets is using scala imports and comments. ## How was this patch tested? Generated the docs locally with `SKIP_API=1 jekyll build` and viewed the changes in the browser. Author: Shuai Lin <linshuai2012@gmail.com> Closes #12869 from lins05/fix-mllib-python-snippets.
*	[MINOR][DOCS] Fix type Information in Quick Start and Programming Guide	Sandeep Singh	2016-05-03	2	-5/+5
\| \| \| \| \| \|	Author: Sandeep Singh <sandeep@techaddict.me> Closes #12841 from techaddict/improve_docs_1.
*	Fix reference to external metrics documentation	Ben McCann	2016-05-01	1	-1/+1
\| \| \| \| \| \|	Author: Ben McCann <benjamin.j.mccann@gmail.com> Closes #12833 from benmccann/patch-1.
*	[SPARK-13973][PYSPARK] Make pyspark fail noisily if IPYTHON or IPYTHON_OPTS ↵	pshearer	2016-04-30	1	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	are set ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13973 Following discussion with srowen the IPYTHON and IPYTHON_OPTS variables are removed. If they are set in the user's environment, pyspark will not execute and prints an error message. Failing noisily will force users to remove these options and learn the new configuration scheme, which is much more sustainable and less confusing. ## How was this patch tested? Manual testing; set IPYTHON=1 and verified that the error message prints. Author: pshearer <pshearer@massmutual.com> Author: shearerp <shearerp@umich.edu> Closes #12528 from shearerp/master.
*	[SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR.	Sun Rui	2016-04-29	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame. The function signature is: dapply(df, function(localDF) {}, schema = NULL) R function input: local data.frame from the partition on local node R function output: local data.frame Schema specifies the Row format of the resulting DataFrame. It must match the R function's output. If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply(). ## How was this patch tested? SparkR unit tests. Author: Sun Rui <rui.sun@intel.com> Author: Sun Rui <sunrui2016@gmail.com> Closes #12493 from sun-rui/SPARK-12919.
*	[SPARK-14882][DOCS] Clarify that Spark can be cross-built for other Scala ↵	Sean Owen	2016-04-28	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	versions ## What changes were proposed in this pull request? Add simple clarification that Spark can be cross-built for other Scala versions. ## How was this patch tested? Automated doc build Author: Sean Owen <sowen@cloudera.com> Closes #12757 from srowen/SPARK-14882.
*	[SPARK-6735][YARN] Add window based executor failure tracking mechanism for ↵	jerryshao	2016-04-28	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \|	long running service This work is based on twinkle-sachdeva 's proposal. In parallel to such mechanism for AM failures, here add similar mechanism for executor failure tracking, this is useful for long running Spark service to mitigate the executor failure problems. Please help to review, tgravescs sryza and vanzin Author: jerryshao <sshao@hortonworks.com> Closes #10241 from jerryshao/SPARK-6735.
*	[SPARK-14514][DOC] Add python example for VectorSlicer	Zheng RuiFeng	2016-04-26	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add the missing python example for VectorSlicer ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12282 from zhengruifeng/vecslicer_pe.
*	Fix dynamic allocation docs to address cached data.	Michael Gummelt	2016-04-26	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Documentation changes ## How was this patch tested? No tests Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #12664 from mgummelt/fix-dynamic-docs.
*	[SPARK-14883][DOCS] Fix wrong R examples and make them up-to-date	Dongjoon Hyun	2016-04-24	2	-23/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules. - Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later. - Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency - Fix datatypes in `sparkr.md`. - Update a data result in `sparkr.md`. - Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`. - Other minor syntax fixes and a typo. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12649 from dongjoon-hyun/SPARK-14883.
*	[DOCS][MINOR] Screenshot + minor fixes to improve reading for accumulators	Jacek Laskowski	2016-04-24	2	-6/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Added screenshot + minor fixes to improve reading ## How was this patch tested? Manual Author: Jacek Laskowski <jacek@japila.pl> Closes #12569 from jaceklaskowski/docs-accumulators.
*	[SPARK-13267][WEB UI] document the ?param arguments of the REST API; lift the…	Steve Loughran	2016-04-24	1	-16/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add to the REST API details on the ? args and examples from the test suite. I've used the existing table, adding all the fields to the second table. see [in the pr](https://github.com/steveloughran/spark/blob/history/SPARK-13267-doc-params/docs/monitoring.md). There's a slightly more sophisticated option: make the table 3 columns wide, and for all existing entries, have the initial `td` span 2 columns. The new entries would then have an empty 1st column, param in 2nd and text in 3rd, with any examples after a `br` entry. Author: Steve Loughran <stevel@hortonworks.com> Closes #11152 from steveloughran/history/SPARK-13267-doc-params.
*	[SPARK-12148][SPARKR] fix doc after renaming DataFrame to SparkDataFrame	felixcheung	2016-04-23	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fixed inadvertent roxygen2 doc changes, added class name change to programming guide Follow up of #12621 ## How was this patch tested? manually checked Author: felixcheung <felixcheung_m@hotmail.com> Closes #12647 from felixcheung/rdataframe.
*	[SPARK-13988][CORE] Make replaying event logs multi threaded in Histo…ry ↵	Parth Brahmbhatt	2016-04-21	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	server to ensure a single large log does not block other logs from being rendered. ## What changes were proposed in this pull request? The patch makes event log processing multi threaded. ## How was this patch tested? Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI. Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com> Closes #11800 from Parth-Brahmbhatt/SPARK-13988.
*	[SPARK-14742][DOCS] Redirect spark-ec2 doc to new location	Sean Owen	2016-04-20	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Restore `ec2-scripts.md` as a redirect to amplab/spark-ec2 docs ## How was this patch tested? `jekyll build` and checked with the browser Author: Sean Owen <sowen@cloudera.com> Closes #12534 from srowen/SPARK-14742.
*	[SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTF	Yuhao Yang	2016-04-20	1	-3/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this. ## How was this patch tested? unit tests and doc generation Author: Yuhao Yang <hhbyyh@gmail.com> Closes #12454 from hhbyyh/tfdoc.
*	[SPARK-14667] Remove HashShuffleManager	Reynold Xin	2016-04-18	1	-9/+0
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager. ## How was this patch tested? Removed some tests related to the old manager. Author: Reynold Xin <rxin@databricks.com> Closes #12423 from rxin/SPARK-14667.
*	[SPARK-14515][DOC] Add python example for ChiSqSelector	Zheng RuiFeng	2016-04-18	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add the missing python example for ChiSqSelector ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12283 from zhengruifeng/chi2_pe.
*	[SPARK-14601][DOC] Minor doc/usage changes related to removal of Spark assembly	Mark Grover	2016-04-14	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Removing references to assembly jar in documentation. Adding an additional (previously undocumented) usage of spark-submit to run examples. ## How was this patch tested? Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit. Author: Mark Grover <mark@apache.org> Closes #12365 from markgrover/spark-14601.
*	[SPARK-14572][DOC] Update config docs to allow -Xms in extraJavaOptions	Dhruve Ashar	2016-04-14	2	-5/+10
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The configuration docs are updated to reflect the changes introduced with [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This allows the user to specify initial heap memory settings through the extraJavaOptions for executor, driver and am. ## How was this patch tested? The changes are tested in [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This is just documenting the changes made. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #12333 from dhruve/doc/SPARK-14572.
*	[SPARK-13089][ML] [Doc] spark.ml Naive Bayes user guide and examples	Yuhao Yang	2016-04-13	1	-0/+34
\| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-13089 Add section in ml-classification.md for NaiveBayes DataFrame-based API, plus example code (using include_example to clip code from examples/ folder files). Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11015 from hhbyyh/naiveBayesDoc.
*	[SPARK-14509][DOC] Add python CountVectorizerExample	Zheng RuiFeng	2016-04-13	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add python CountVectorizerExample ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11917 from zhengruifeng/cv_pe.
*	[MINOR][DOCS] Fix wrong data types in JSON Datasets example.	Dongjoon Hyun	2016-04-11	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR fixes the `age` data types from `integer` to `long` in `SQL Programming Guide: JSON Datasets`. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12290 from dongjoon-hyun/minor_fix_type_in_json_example.
*	[SPARK-14339][DOC] Add python examples for DCT,MinMaxScaler,MaxAbsScaler	Zheng RuiFeng	2016-04-09	1	-0/+24
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? add three python examples ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12063 from zhengruifeng/dct_pe.
*	[DOCS][MINOR] Remove sentence about Mesos not supporting cluster mode.	Michael Gummelt	2016-04-07	1	-2/+1
\| \| \| \| \| \| \| \| \| \|	Docs change to remove the sentence about Mesos not supporting cluster mode. It was not. Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #12249 from mgummelt/fix-mesos-cluster-docs.
*	Better host description for multi-master mesos	Malte	2016-04-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Since not having the correct zk url causes job failure, the documentation should include all parameters ## How was this patch tested? no tests necessary Author: Malte <elmalto@users.noreply.github.com> Closes #12218 from elmalto/patch-1.
*	[SPARK-10063][SQL] Remove DirectParquetOutputCommitter	Reynold Xin	2016-04-07	1	-33/+0
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue. ## How was this patch tested? Removed the related tests also. Author: Reynold Xin <rxin@databricks.com> Closes #12229 from rxin/SPARK-10063.
*	[SPARK-14424][BUILD][DOCS] Update the build docs to switch from assembly to ↵	Holden Karau	2016-04-06	1	-10/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	package and add a no… ## What changes were proposed in this pull request? Change our build docs & shell scripts to that developers are aware of the change from "assembly" to "package" ## How was this patch tested? Manually ran ./bin/spark-shell after ./build/sbt assembly and verified error message printed, ran new suggested build target and verified ./bin/spark-shell runs after this. Author: Holden Karau <holden@pigscanfly.ca> Author: Holden Karau <holden@us.ibm.com> Closes #12197 from holdenk/SPARK-1424-spark-class-broken-fix-build-docs.
*	[SPARK-13063][YARN] Make the SPARK YARN STAGING DIR as configurable	Devaraj K	2016-04-05	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Made the SPARK YARN STAGING DIR as configurable with the configuration as 'spark.yarn.staging-dir'. ## How was this patch tested? I have verified it manually by running applications on yarn, If the 'spark.yarn.staging-dir' is configured then the value used as staging directory otherwise uses the default value i.e. file system’s home directory for the user. Author: Devaraj K <devaraj@apache.org> Closes #12082 from devaraj-kavali/SPARK-13063.
*	[SPARK-13579][BUILD] Stop building the main Spark assembly.	Marcelo Vanzin	2016-04-04	1	-6/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change modifies the "assembly/" module to just copy needed dependencies to its build directory, and modifies the packaging script to pick those up (and remove duplicate jars packages in the examples module). I also made some minor adjustments to dependencies to remove some test jars from the final packaging, and remove jars that conflict with each other when packaged separately (e.g. servlet api). Also note that this change restores guava in applications' classpaths, even though it's still shaded inside Spark. This is now needed for the Hadoop libraries that are packaged with Spark, which now are not processed by the shade plugin. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11796 from vanzin/SPARK-13579.
*	[SPARK-14342][CORE][DOCS][TESTS] Remove straggler references to Tachyon	Liwei Lin	2016-04-02	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Straggler references to Tachyon were removed: - for docs, `tachyon` has been generalized as `off-heap memory`; - for Mesos test suits, the key-value `tachyon:true`/`tachyon:false` has been changed to `os:centos`/`os:ubuntu`, since `os` is an example constrain used by the [Mesos official docs](http://mesos.apache.org/documentation/attributes-resources/). ## How was this patch tested? Existing test suites. Author: Liwei Lin <lwlin7@gmail.com> Closes #12129 from lw-lin/tachyon-cleanup.
*	[SPARK-12343][YARN] Simplify Yarn client and client argument	jerryshao	2016-04-01	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently in Spark on YARN, configurations can be passed through SparkConf, env and command arguments, some parts are duplicated, like client argument and SparkConf. So here propose to simplify the command arguments. ## How was this patch tested? This patch is tested manually with unit test. CC vanzin tgravescs , please help to suggest this proposal. The original purpose of this JIRA is to remove `ClientArguments`, through refactoring some arguments like `--class`, `--arg` are not so easy to replace, so here I remove the most part of command line arguments, only keep the minimal set. Author: jerryshao <sshao@hortonworks.com> Closes #11603 from jerryshao/SPARK-12343.
*	[SPARK-14281][TESTS] Fix java8-tests and simplify their build	Josh Rosen	2016-03-31	1	-4/+4
\| \| \| \| \| \| \| \|	This patch fixes a compilation / build break in Spark's `java8-tests` and refactors their POM to simplify the build. See individual commit messages for more details. Author: Josh Rosen <joshrosen@databricks.com> Closes #12073 from JoshRosen/fix-java8-tests.
*	[Docs] Update monitoring.md to accurately describe the history server	Michael Gummelt	2016-03-31	1	-29/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	It looks like the docs were recently updated to reflect the History Server's support for incomplete applications, but they still had wording that suggested only completed applications were viewable. This fixes that. My editor also introduced several whitespace removal changes, that I hope are OK, as text files shouldn't have trailing whitespace. To verify they're purely whitespace changes, add `&w=1` to your browser address. If this isn't acceptable, let me know and I'll update the PR. I also didn't think this required a JIRA. Let me know if I should create one. Not tested Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #12045 from mgummelt/update-history-docs.
*	[SPARK-13874][DOC] Remove docs of streaming-akka, streaming-zeromq, ↵	Shixiong Zhu	2016-03-26	1	-54/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	streaming-mqtt and streaming-twitter ## What changes were proposed in this pull request? This PR removes all docs about the old streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter projects since I have already copied them to https://github.com/spark-packages Also remove mqtt_wordcount.py that I forgot to remove previously. ## How was this patch tested? Jenkins PR Build. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11824 from zsxwing/remove-doc.
*	[SPARK-13017][DOCS] Replace example code in mllib-feature-extraction.md ↵	Xin Ren	2016-03-24	1	-361/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	using include_example Replace example code in mllib-feature-extraction.md using include_example https://issues.apache.org/jira/browse/SPARK-13017 The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. `{% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}` Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/TFIDFExample.scala` and pick code blocks marked "example" and replace code block in `{% highlight %}` in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 Author: Xin Ren <iamshrek@126.com> Closes #11142 from keypointt/SPARK-13017.
*	[SPARK-13019][DOCS] fix for scala-2.10 build: Replace example code in ↵	Xin Ren	2016-03-24	1	-382/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	mllib-statistics.md using include_example ## What changes were proposed in this pull request? This PR for ticket SPARK-13019 is based on previous PR(https://github.com/apache/spark/pull/11108). Since PR(https://github.com/apache/spark/pull/11108) is breaking scala-2.10 build, more work is needed to fix build errors. What I did new in this PR is adding keyword argument for 'fractions': ` val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)` ` val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)` I reopened ticket on JIRA but sorry I don't know how to reopen a GitHub pull request, so I just submitting a new pull request. ## How was this patch tested? Manual build testing on local machine, build based on scala-2.10. Author: Xin Ren <iamshrek@126.com> Closes #11901 from keypointt/SPARK-13019.
*	Revert "[SPARK-13019][DOCS] Replace example code in mllib-statistics.md ↵	Xiangrui Meng	2016-03-21	1	-56/+382
\| \| \| \| \| \|	using include_example" This reverts commit 1af8de200c4d3357bcb09e7bbc6deece00e885f2.
*	[SPARK-13019][DOCS] Replace example code in mllib-statistics.md using ↵	Xin Ren	2016-03-21	1	-382/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	include_example https://issues.apache.org/jira/browse/SPARK-13019 The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. `{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}` Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` and pick code blocks marked "example" and replace code block in `{% highlight %}` in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 Author: Xin Ren <iamshrek@126.com> Closes #11108 from keypointt/SPARK-13019.
*	[MINOR][DOCS] Update build descriptions and commands	Dongjoon Hyun	2016-03-18	4	-10/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR updates Scala and Hadoop versions in the build description and commands in `Building Spark` documents. ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11838 from dongjoon-hyun/fix_doc_building_spark.
*	[MINOR][DOC] Add JavaStreamingTestExample	Zheng RuiFeng	2016-03-17	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add the java example of StreamingTest ## How was this patch tested? manual tests in CLI: bin/run-example mllib.JavaStreamingTestExample dataDir 5 100 Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11776 from zhengruifeng/streaming_je.
*	[SPARK-12855][MINOR][SQL][DOC][TEST] remove spark.sql.dialect from doc and test	Daoyuan Wang	2016-03-16	1	-7/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly. ## How was this patch tested? This patch will not affect the real code path. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #11758 from adrian-wang/spark12855.
*	[SPARK-13942][CORE][DOCS] Remove Shark-related docs for 2.x	Dongjoon Hyun	2016-03-16	1	-45/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `Shark` was merged into `Spark SQL` since [July 2014](https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html). The followings seem to be the only legacy. For Spark 2.x, we had better clean up those docs. Migration Guide ``` - ## Migration Guide for Shark Users - ... - ### Scheduling - ... - ### Reducer number - ... - ### Caching ``` ## How was this patch tested? Pass the Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11770 from dongjoon-hyun/SPARK-13942.
*	[SPARK-13888][DOC] Remove Akka Receiver doc and refer to the DStream Akka ↵	Shixiong Zhu	2016-03-14	2	-78/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	project ## What changes were proposed in this pull request? I have copied the docs of Streaming Akka to https://github.com/spark-packages/dstream-akka/blob/master/README.md So we can remove them from Spark now. ## How was this patch tested? Only document changes. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shixiong Zhu <shixiong@databricks.com> Closes #11711 from zsxwing/remove-akka-doc.
*	[MINOR][DOCS] Added Missing back slashes	Daniel Santana	2016-03-14	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? When studying spark many users just copy examples on the documentation and paste on their terminals and because of that the missing backlashes lead them run into some shell errors. The added backslashes avoid that problem for spark users with that behavior. ## How was this patch tested? I generated the documentation locally using jekyll and checked the generated pages Author: Daniel Santana <mestresan@gmail.com> Closes #11699 from danielsan/master.
*	[SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> ↵	Sean Owen	2016-03-13	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	byte[] conversions (and remaining Coverity items) ## What changes were proposed in this pull request? - Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8 - Same for `InputStreamReader` and `OutputStreamWriter` constructors - Standardizes on UTF-8 everywhere - Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`) - (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit https://github.com/srowen/spark/commit/1deecd8d9ca986d8adb1a42d315890ce5349d29c ) ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11657 from srowen/SPARK-13823.
*	[SPARK-13577][YARN] Allow Spark jar to be multiple jars, archive.	Marcelo Vanzin	2016-03-11	1	-7/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In preparation for the demise of assemblies, this change allows the YARN backend to use multiple jars and globs as the "Spark jar". The config option has been renamed to "spark.yarn.jars" to reflect that. A second option "spark.yarn.archive" was also added; if set, this takes precedence and uploads an archive expected to contain the jar files with the Spark code and its dependencies. Existing deployments should keep working, mostly. This change drops support for the "SPARK_JAR" environment variable, and also does not fall back to using "jarOfClass" if no configuration is set, falling back to finding files under SPARK_HOME instead. This should be fine since "jarOfClass" probably wouldn't work unless you were using spark-submit anyway. Tested with the unit tests, and trying the different config options on a YARN cluster. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11500 from vanzin/SPARK-13577.
*	[SPARK-13512][ML] add example and doc for MaxAbsScaler	Yuhao Yang	2016-03-11	1	-0/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-13512 Add example and doc for ml.feature.MaxAbsScaler. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11392 from hhbyyh/maxabsdoc.