aboutsummaryrefslogtreecommitdiff
path: root/docs
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-17089][DOCS] Remove api doc link for mapReduceTriplets operatorsandy2016-08-161-3/+2
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Remove the api doc link for mapReduceTriplets operator because in latest api they are remove so when user link to that api they will not get mapReduceTriplets there so its more good to remove than confuse the user. ## How was this patch tested? Run all the test cases ![screenshot from 2016-08-16 23-08-25](https://cloud.githubusercontent.com/assets/8075390/17709393/8cfbf75a-6406-11e6-98e6-38f7b319d833.png) Author: sandy <phalodi@gmail.com> Closes #14669 from phalodi/SPARK-17089.
* [MINOR][DOC] Correct code snippet results in quick start documentationlinbojin2016-08-161-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? As README.md file is updated over time. Some code snippet outputs are not correct based on new README.md file. For example: ``` scala> textFile.count() res0: Long = 126 ``` should be ``` scala> textFile.count() res0: Long = 99 ``` This pr is to add comments to point out this problem so that new spark learners have a correct reference. Also, fixed a samll bug, inside current documentation, the outputs of linesWithSpark.count() without and with cache are different (one is 15 and the other is 19) ``` scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:27 scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 15 ... scala> linesWithSpark.cache() res7: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:27 scala> linesWithSpark.count() res8: Long = 19 ``` ## How was this patch tested? manual test: run `$ SKIP_API=1 jekyll serve --watch` Author: linbojin <linbojin203@gmail.com> Closes #14645 from linbojin/quick-start-documentation.
* [SPARK-12370][DOCUMENTATION] Documentation should link to examples …Jagadeesan2016-08-136-28/+28
| | | | | | | | | | | | ## What changes were proposed in this pull request? When documentation is built is should reference examples from the same build. There are times when the docs have links that point to files in the GitHub head which may not be valid on the current release. Changed that in URLs to make them point to the right tag in git using ```SPARK_VERSION_SHORT``` …from its own release version] [Streaming programming guide] Author: Jagadeesan <as2@us.ibm.com> Closes #14596 from jagadeesanas2/SPARK-12370.
* [DOC] add config option spark.ui.enabled into documentWeichenXu2016-08-121-0/+7
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The configuration doc lost the config option `spark.ui.enabled` (default value is `true`) I think this option is important because many cases we would like to turn it off. so I add it. ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14604 from WeichenXu123/add_doc_param_spark_ui_enabled.
* [MINOR][DOC] Fix style in examples across documentationhyukjinkwon2016-08-125-47/+47
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes the documentation as below: - Python has 4 spaces and Java and Scala has 2 spaces (See https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide). - Avoid excessive parentheses and curly braces for anonymous functions. (See https://github.com/databricks/scala-style-guide#anonymous) ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes #14593 from HyukjinKwon/minor-documentation.
* [SPARK-13081][PYSPARK][SPARK_SUBMIT] Allow set pythonExec of driver and ↵Jeff Zhang2016-08-111-2/+19
| | | | | | | | | | | | executor through conf… Before this PR, user have to export environment variable to specify the python of driver & executor which is not so convenient for users. This PR is trying to allow user to specify python through configuration "--pyspark-driver-python" & "--pyspark-executor-python" Manually test in local & yarn mode for pyspark-shell and pyspark batch mode. Author: Jeff Zhang <zjffdu@apache.org> Closes #13146 from zjffdu/SPARK-13081.
* [SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and ↵hyukjinkwon2016-08-111-101/+101
| | | | | | | | | | | | | | | | | | | | | | | | | | | indentation in documentation ## What changes were proposed in this pull request? Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments. This PR fixes three things below: - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java. - Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples. - Fix `StructuredNetworkWordCountWindowed` and `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially). ## How was this patch tested? N/A Closes https://github.com/apache/spark/pull/14491 Author: hyukjinkwon <gurwls223@gmail.com> Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local> Closes #14564 from HyukjinKwon/SPARK-16886.
* Correct example value for spark.ssl.YYY.XXX settingsAndrew Ash2016-08-111-2/+4
| | | | | | | | | | Docs adjustment to: - link to other relevant section of docs - correct statement about the only value when actually other values are supported Author: Andrew Ash <andrew@andrewash.com> Closes #14581 from ash211/patch-10.
* [SPARK-17010][MINOR][DOC] Wrong description in memory management documentTao Wang2016-08-101-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? change the remain percent to right one. ## How was this patch tested? Manual review Author: Tao Wang <wangtao111@huawei.com> Closes #14591 from WangTaoTheTonic/patch-1.
* [SPARK-14743][YARN] Add a configurable credential manager for Spark running ↵jerryshao2016-08-101-8/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | on YARN ## What changes were proposed in this pull request? Add a configurable token manager for Spark on running on yarn. ### Current Problems ### 1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes. 2. Also this problem exits in timely token renewer and updater. ### Changes In This Proposal ### In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes: 1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface. 2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on. 3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded. ### Behavior Changes ### For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive). For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations: 1. `spark.yarn.security.tokens.test.enabled` to true 2. `spark.yarn.security.tokens.test.class` to the full qualified class name. So we still keep the same semantics as current code while add one new configuration. ### Current Status ### - [x] token provider interface and management framework. - [x] implement built-in token providers (hdfs, hbase, hive). - [x] Coverage of unit test. - [x] Integrated test with security cluster. ## How was this patch tested? Unit test and integrated test. Please suggest and review, any comment is greatly appreciated. Author: jerryshao <sshao@hortonworks.com> Closes #14065 from jerryshao/SPARK-16342.
* [SPARK-16927][SPARK-16923] Override task properties at dispatcher.Timothy Chen2016-08-101-0/+11
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? - enable setting default properties for all jobs submitted through the dispatcher [SPARK-16927] - remove duplication of conf vars on cluster submitted jobs [SPARK-16923] (this is a small fix, so I'm including in the same PR) ## How was this patch tested? mesos/spark integration test suite manual testing Author: Timothy Chen <tnachen@gmail.com> Closes #14511 from mgummelt/override-props.
* [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurableJosh Rosen2016-08-091-0/+15
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration. **Background:** This application-killing was added in 6b5980da796e0204a7735a31fb454f312bc9daac (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path. **Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative. I'd like to merge this patch into master, branch-2.0, and branch-1.6. ## How was this patch tested? I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now. Author: Josh Rosen <joshrosen@databricks.com> Closes #14544 from JoshRosen/add-setting-for-max-executor-failures.
* [SPARK-16809] enable history server links in dispatcher UIMichael Gummelt2016-08-091-0/+10
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Links the Spark Mesos Dispatcher UI to the history server UI - adds spark.mesos.dispatcher.historyServer.url - explicitly generates frameworkIDs for the launched drivers, so the dispatcher knows how to correlate drivers and frameworkIDs ## How was this patch tested? manual testing Author: Michael Gummelt <mgummelt@mesosphere.io> Author: Sergiusz Urbaniak <sur@mesosphere.io> Closes #14414 from mgummelt/history-server.
* Update docs to include SASL support for RPCMichael Gummelt2016-08-082-4/+6
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Update docs to include SASL support for RPC Evidence: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala#L63 ## How was this patch tested? Docs change only Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14549 from mgummelt/sasl.
* [SPARK-16911] Fix the links in the programming guideShivansh2016-08-073-106/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix the broken links in the programming guide of the Graphx Migration and understanding closures ## How was this patch tested? By running the test cases and checking the links. Author: Shivansh <shiv4nsh@gmail.com> Closes #14503 from shiv4nsh/SPARK-16911.
* [SPARK-16870][DOCS] Summary:add "spark.sql.broadcastTimeout" into ↵keliang2016-08-071-0/+9
| | | | | | | | | | | | | | | | | | | | | | | docs/sql-programming-gu… ## What changes were proposed in this pull request? default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned ## How was this patch tested? not need (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) …ide.md JIRA_ID:SPARK-16870 Description:default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned Test:done Author: keliang <keliang@cmss.chinamobile.com> Closes #14477 from biglobster/keliang.
* [SPARK-16932][DOCS] Changed programming guide to not reference old ↵Bryan Cutler2016-08-071-14/+27
| | | | | | | | | | | | | | | accumulator API in Scala ## What changes were proposed in this pull request? In the programming guide, the accumulator section mixes up both the old and new APIs causing it to be confusing. This is not necessary for Scala, so all references to the old API are removed. For Java, it is somewhat fixed up except for the example of a custom accumulator because I don't think an API exists yet. Python has not currently implemented the new API. ## How was this patch tested? built doc locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #14516 from BryanCutler/fixup-accumulator-programming-guide-SPARK-15702.
* document that Mesos cluster mode supports pythonMichael Gummelt2016-08-071-1/+2
| | | | | | | | update docs to be consistent with SPARK-14645 https://issues.apache.org/jira/browse/SPARK-14645 Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14514 from mgummelt/fix-docs.
* [SPARK-16312][STREAMING][KAFKA][DOC] Doc for Kafka 0.10 integrationcody koeninger2016-08-054-207/+452
| | | | | | | | | | | | ## What changes were proposed in this pull request? Doc for the Kafka 0.10 integration ## How was this patch tested? Scala code examples were taken from my example repo, so hopefully they compile. Author: cody koeninger <cody@koeninger.org> Closes #14385 from koeninger/SPARK-16312.
* [SPARK-15074][SHUFFLE] Cache shuffle index file to speedup shuffle fetchSital Kedia2016-08-041-0/+7
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Shuffle fetch on large intermediate dataset is slow because the shuffle service open/close the index file for each shuffle fetch. This change introduces a cache for the index information so that we can avoid accessing the index files for each block fetch ## How was this patch tested? Tested by running a job on the cluster and the shuffle read time was reduced by 50%. Author: Sital Kedia <skedia@fb.com> Closes #12944 from sitalkedia/shuffle_service.
* [SPARK-16822][DOC] Support latex in scaladoc.Shuai Lin2016-08-021-0/+20
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Support using latex in scaladoc by adding MathJax javascript to the js template. ## How was this patch tested? Generated scaladoc. Preview: - LogisticGradient: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) - MinMaxScaler: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) Author: Shuai Lin <linshuai2012@gmail.com> Closes #14438 from lins05/spark-16822-support-latex-in-scaladoc.
* [SPARK-16734][EXAMPLES][SQL] Revise examples of all language bindingsCheng Lian2016-08-021-42/+14
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed. ## How was this patch tested? Manually tested. Author: Cheng Lian <lian@databricks.com> Closes #14368 from liancheng/revise-examples.
* [SPARK-16761][DOC][ML] Fix doc link in docs/ml-guide.mdSun Dapeng2016-07-291-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix the link at http://spark.apache.org/docs/latest/ml-guide.html. ## How was this patch tested? None Author: Sun Dapeng <sdp@apache.org> Closes #14386 from sundapeng/doclink.
* [SPARK-16637] Unified containerizerMichael Gummelt2016-07-292-1/+11
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? New config var: spark.mesos.docker.containerizer={"mesos","docker" (default)} This adds support for running docker containers via the Mesos unified containerizer: http://mesos.apache.org/documentation/latest/container-image/ The benefit is losing the dependency on `dockerd`, and all the costs which it incurs. I've also updated the supported Mesos version to 0.28.2 for support of the required protobufs. This is blocked on: https://github.com/apache/spark/pull/14167 ## How was this patch tested? - manually testing jobs submitted with both "mesos" and "docker" settings for the new config var. - spark/mesos integration test suite Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14275 from mgummelt/unified-containerizer.
* [MINOR][DOC] missing keyword newBartek Wiśniewski2016-07-271-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? added missing keyword for java example ## How was this patch tested? wasn't Author: Bartek Wiśniewski <wedi@Ava.local> Closes #14381 from wedi-dev/quickfix/missing_keyword.
* [SPARK-5847][CORE] Allow for configuring MetricsSystem's use of app ID to ↵Mark Grover2016-07-271-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | namespace all metrics ## What changes were proposed in this pull request? Adding a new property to SparkConf called spark.metrics.namespace that allows users to set a custom namespace for executor and driver metrics in the metrics systems. By default, the root namespace used for driver or executor metrics is the value of `spark.app.id`. However, often times, users want to be able to track the metrics across apps for driver and executor metrics, which is hard to do with application ID (i.e. `spark.app.id`) since it changes with every invocation of the app. For such use cases, users can set the `spark.metrics.namespace` property to another spark configuration key like `spark.app.name` which is then used to populate the root namespace of the metrics system (with the app name in our example). `spark.metrics.namespace` property can be set to any arbitrary spark property key, whose value would be used to set the root namespace of the metrics system. Non driver and executor metrics are never prefixed with `spark.app.id`, nor does the `spark.metrics.namespace` property have any such affect on such metrics. ## How was this patch tested? Added new unit tests, modified existing unit tests. Author: Mark Grover <mark@apache.org> Closes #14270 from markgrover/spark-5847.
* [SPARK-15271][MESOS] Allow force pulling executor docker imagesPhilipp Hoffmann2016-07-262-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Mesos agents by default will not pull docker images which are cached locally already. In order to run Spark executors from mutable tags like `:latest` this commit introduces a Spark setting (`spark.mesos.executor.docker.forcePullImage`). Setting this flag to true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous implementation and Mesos' default behaviour). Author: Philipp Hoffmann <mail@philipphoffmann.de> Closes #14348 from philipphoffmann/force-pull-image.
* Fix description of spark.speculation.quantileNicholas Brown2016-07-251-1/+1
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Minor doc fix regarding the spark.speculation.quantile configuration parameter. It incorrectly states it should be a percentage, when it should be a fraction. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) I tried building the documentation but got some unidoc errors. I also got them when building off origin/master, so I don't think I caused that problem. I did run the web app and saw the changes reflected as expected. Author: Nicholas Brown <nbrown@adroitdigital.com> Closes #14352 from nwbvt/master.
* [SQL][DOC] Fix a default name for parquet compressionTakeshi YAMAMURO2016-07-251-1/+1
| | | | | | | | | ## What changes were proposed in this pull request? This pr is to fix a wrong description for parquet default compression. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #14351 from maropu/FixParquetDoc.
* Revert "[SPARK-15271][MESOS] Allow force pulling executor docker images"Josh Rosen2016-07-252-13/+1
| | | | This reverts commit 978cd5f125eb5a410bad2e60bf8385b11cf1b978.
* [SPARK-16485][DOC][ML] Fixed several inline formatting in ml features docShuai Lin2016-07-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixed several inline formatting in ml features doc. Before: <img width="475" alt="screen shot 2016-07-14 at 12 24 57 pm" src="https://cloud.githubusercontent.com/assets/717363/16827974/1e1b6e04-49be-11e6-8aa9-4a0cb6cd3b4e.png"> After: <img width="404" alt="screen shot 2016-07-14 at 12 25 48 pm" src="https://cloud.githubusercontent.com/assets/717363/16827976/2576510a-49be-11e6-96dd-92a1fa464d36.png"> ## How was this patch tested? Genetate the docs locally by `SKIP_API=1 jekyll build` and view it in the browser. Author: Shuai Lin <linshuai2012@gmail.com> Closes #14194 from lins05/fix-docs-formatting.
* [SPARK-15271][MESOS] Allow force pulling executor docker imagesPhilipp Hoffmann2016-07-252-1/+13
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Mesos agents by default will not pull docker images which are cached locally already. In order to run Spark executors from mutable tags like `:latest` this commit introduces a Spark setting `spark.mesos.executor.docker.forcePullImage`. Setting this flag to true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous implementation and Mesos' default behaviour). ## How was this patch tested? I ran a sample application including this change on a Mesos cluster and verified the correct behaviour for both, with and without, force pulling the executor image. As expected the image is being force pulled if the flag is set. Author: Philipp Hoffmann <mail@philipphoffmann.de> Closes #13051 from philipphoffmann/force-pull-image.
* [SPARKR][DOCS] fix broken url in docFelix Cheung2016-07-251-54/+53
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix broken url, also, sparkR.session.stop doc page should have it in the header, instead of saying "sparkR.stop" ![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png) Data type section is in the middle of a list of gapply/gapplyCollect subsections: ![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png) ## How was this patch tested? manual test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14329 from felixcheung/rdoclinkfix.
* [SPARK-16380][EXAMPLES] Update SQL examples and programming guide for Python ↵Cheng Lian2016-07-231-216/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | language binding This PR is based on PR #14098 authored by wangmiao1981. ## What changes were proposed in this pull request? This PR replaces the original Python Spark SQL example file with the following three files: - `sql/basic.py` Demonstrates basic Spark SQL features. - `sql/datasource.py` Demonstrates various Spark SQL data sources. - `sql/hive.py` Demonstrates Spark SQL Hive interaction. This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag. ## How was this patch tested? Manually tested. Author: wm624@hotmail.com <wm624@hotmail.com> Author: Cheng Lian <lian@databricks.com> Closes #14317 from liancheng/py-examples-update.
* [SPARK-16650] Improve documentation of spark.task.maxFailuresTom Graves2016-07-221-1/+3
| | | | | | | | | | Clarify documentation on spark.task.maxFailures No tests run as its documentation Author: Tom Graves <tgraves@yahoo-inc.com> Closes #14287 from tgravescs/SPARK-16650.
* [SPARK-16194] Mesos Driver env varsMichael Gummelt2016-07-211-0/+10
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Added new configuration namespace: spark.mesos.env.* This allows a user submitting a job in cluster mode to set arbitrary environment variables on the driver. spark.mesos.driverEnv.KEY=VAL will result in the env var "KEY" being set to "VAL" I've also refactored the tests a bit so we can re-use code in MesosClusterScheduler. And I've refactored the command building logic in `buildDriverCommand`. Command builder values were very intertwined before, and now it's easier to determine exactly how each variable is set. ## How was this patch tested? unit tests Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14167 from mgummelt/driver-env-vars.
* [MINOR][DOCS][STREAMING] Minor docfix schema of csv rather than parquet in ↵Holden Karau2016-07-211-2/+2
| | | | | | | | | | | | | | | comments ## What changes were proposed in this pull request? Fix parquet to csv in a comment to match the input format being read. ## How was this patch tested? N/A (doc change only) Author: Holden Karau <holden@us.ibm.com> Closes #14274 from holdenk/minor-docfix-schema-of-csv-rather-than-parquet.
* [SPARK-15951] Change Executors Page to use datatables to support sorting ↵Kishor Patil2016-07-201-1/+5
| | | | | | | | | | | | | | | | | | | columns and searching 1. Create the executorspage-template.html for displaying application information in datables. 2. Added REST API endpoint "allexecutors" to be able to see all executors created for particular job. 3. The executorspage.js uses jQuery to access the data from /api/v1/applications/appid/allexecutors REST API, and use DataTable to display executors for the application. It also, generates summary of dead/live and total executors created during life of the application. 4. Similar changes applicable to Executors Page on history server for a given application. Snapshots for how it looks like now: <img width="938" alt="screen shot 2016-06-14 at 2 45 44 pm" src="https://cloud.githubusercontent.com/assets/6090397/16060092/ad1de03a-324b-11e6-8469-9eaa3f2548b5.png"> New Executors Page screenshot looks like this: <img width="1436" alt="screen shot 2016-06-15 at 10 12 01 am" src="https://cloud.githubusercontent.com/assets/6090397/16085514/ee7004f0-32e1-11e6-9340-33d91e407f2b.png"> Author: Kishor Patil <kpatil@yahoo-inc.com> Closes #13670 from kishorvpatil/execTemplates.
* [SPARK-15923][YARN] Spark Application rest api returns 'no such app: …Weiqing Yang2016-07-201-3/+4
| | | | | | | | | | | ## What changes were proposed in this pull request? Update monitoring.md. …<appId>' Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #14163 from Sherry302/master.
* [SPARK-16568][SQL][DOCUMENTATION] update sql programming guide refreshTable ↵WeichenXu2016-07-191-2/+2
| | | | | | | | | | | | | | | | | | API in python code ## What changes were proposed in this pull request? update `refreshTable` API in python code of the sql-programming-guide. This API is added in SPARK-15820 ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14220 from WeichenXu123/update_sql_doc_catalog.
* [MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammarAhmed Mahran2016-07-191-83/+71
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Minor fixes correcting some typos, punctuations, grammar. Adding more anchors for easy navigation. Fixing minor issues with code snippets. ## How was this patch tested? `jekyll serve` Author: Ahmed Mahran <ahmed.mahran@mashin.io> Closes #14234 from ahmed-mahran/b-struct-streaming-docs.
* [SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example updateCheng Lian2016-07-181-29/+28
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL". ## How was this patch tested? Manually verified the generated HTML page. Author: Cheng Lian <lian@databricks.com> Closes #14245 from liancheng/minor-scala-example-update.
* [SPARKR][DOCS] minor code sample update in R programming guideFelix Cheung2016-07-181-2/+2
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix code style from ad hoc review of RC4 doc ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14250 from felixcheung/rdocs2rc4.
* [SPARK-16112][SPARKR] Programming guide for gapply/gapplyCollectNarine Kokhlikyan2016-07-161-4/+134
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Updates programming guide for spark.gapply/spark.gapplyCollect. Similar to other examples I used `faithful` dataset to demonstrate gapply's functionality. Please, let me know if you prefer another example. ## How was this patch tested? Existing test cases in R Author: Narine Kokhlikyan <narine@slice.com> Closes #14090 from NarineK/gapplyProgGuide.
* [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guideJoseph K. Bradley2016-07-1538-742/+807
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Made DataFrame-based API primary * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API * **Reviewers: please check this carefully** * (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix * Moved migration guide to ml-guide from mllib-guide * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides * **Reviewers**: I did not change any of the content of the migration guides. Reorganized DataFrame-based guide: * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc. * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html * **Reviewers**: I did not change the content of these guides, except some intro text. * Sidebar remains the same, but with pipeline and tuning sections added Other: * ml-classification-regression.html: Moved text about linear methods to new section in page ## How was this patch tested? Generated docs locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #14213 from jkbradley/ml-guide-2.0.
* [SPARK-16555] Work around Jekyll error-handling bug which led to silent failuresJosh Rosen2016-07-141-1/+9
| | | | | | | | | | | | | | | | | | | If a custom Jekyll template tag throws Ruby's equivalent of a "file not found" exception, then Jekyll will stop the doc building process but will exit with a successful status, causing our doc publishing jobs to silently fail. This is caused by https://github.com/jekyll/jekyll/issues/5104, a case of bad error-handling logic in Jekyll. This patch works around this by updating our `include_example.rb` plugin to catch the exception and exit rather than allowing it to bubble up and be ignored by Jekyll. I tested this manually with ``` rm ./examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala cd docs SKIP_API=1 jekyll build echo $? ``` Author: Josh Rosen <joshrosen@databricks.com> Closes #14209 from JoshRosen/fix-doc-building.
* [SPARK-16553][DOCS] Fix SQL example file name in docsShivaram Venkataraman2016-07-141-1/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixes a typo in the sql programming guide ## How was this patch tested? Building docs locally (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14208 from shivaram/spark-sql-doc-fix.
* [SPARK-16505][YARN] Optionally propagate error during shuffle service startup.Marcelo Vanzin2016-07-142-12/+32
| | | | | | | | | | | This prevents the NM from starting when something is wrong, which would lead to later errors which are confusing and harder to debug. Added a unit test to verify startup fails if something is wrong. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #14162 from vanzin/SPARK-16505.
* [SPARKR][DOCS][MINOR] R programming guide to include csv data source exampleFelix Cheung2016-07-131-9/+18
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Minor documentation update for code example, code style, and missed reference to "sparkR.init" ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14178 from felixcheung/rcsvprogrammingguide.
* [SPARK-16114][SQL] updated structured streaming guideJames Thomas2016-07-131-26/+23
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Updated structured streaming programming guide with new windowed example. ## How was this patch tested? Docs Author: James Thomas <jamesjoethomas@gmail.com> Closes #14183 from jjthomas/ss_docs_update.