aboutsummaryrefslogtreecommitdiff
path: root/docs
Commit message (Collapse)AuthorAgeFilesLines
* [SQL][Minor] Update sql doc according to data type APIs changesscwf2015-01-181-20/+21
| | | | | | | | | | | Follow up of #3925 /cc rxin Author: scwf <wangfei1@huawei.com> Closes #4095 from scwf/sql-doc and squashes the following commits: 97e311b [scwf] update sql doc since now expose only one version of the data type APIs
* [SPARK-733] Add documentation on use of accumulators in lazy transformationIlya Ganelin2015-01-161-0/+28
| | | | | | | | | | | | | | | | | | | I've added documentation clarifying the particular lack of clarity highlighted in the relevant JIRA. I've also added code examples for this issue to clarify the explanation. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #4022 from ilganeli/SPARK-733 and squashes the following commits: 587def5 [Ilya Ganelin] Updated to clarify verbage df3afd7 [Ilya Ganelin] Revert "Partially updated task metrics to make some vars private" 3f6c512 [Ilya Ganelin] Revert "Completed refactoring to make vars in TaskMetrics class private" 58034fb [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-733 4dc2cdb [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-733 3a38db1 [Ilya Ganelin] Verified documentation update by building via jekyll 33b5a2d [Ilya Ganelin] Added code examples for java and python 1fd59b2 [Ilya Ganelin] Updated documentation for accumulators to highlight lazy evaluation issue 5525c20 [Ilya Ganelin] Completed refactoring to make vars in TaskMetrics class private c64da4f [Ilya Ganelin] Partially updated task metrics to make some vars private
* [DOCS] Fix typo in return type of cogroupSean Owen2015-01-161-1/+1
| | | | | | | | | | | | This fixes a simple typo in the cogroup docs noted in http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAMAsSdJ8_24evMAMg7fOZCQjwimisbYWa9v8BN6Rc3JCauja6wmail.gmail.com%3E I didn't bother with a JIRA Author: Sean Owen <sowen@cloudera.com> Closes #4072 from srowen/CogroupDocFix and squashes the following commits: 43c850b [Sean Owen] Fix typo in return type of cogroup
* [SPARK-1507][YARN]specify # cores for ApplicationMasterWangTaoTheTonic2015-01-162-4/+28
| | | | | | | | | | | | | | | | | | | Based on top of changes in https://github.com/apache/spark/pull/3806. https://issues.apache.org/jira/browse/SPARK-1507 `--driver-cores` and `spark.driver.cores` for all cluster modes and `spark.yarn.am.cores` for yarn client mode. Author: WangTaoTheTonic <barneystinson@aliyun.com> Author: WangTao <barneystinson@aliyun.com> Closes #4018 from WangTaoTheTonic/SPARK-1507 and squashes the following commits: 01419d3 [WangTaoTheTonic] amend the args name b255795 [WangTaoTheTonic] indet thing d86557c [WangTaoTheTonic] some comments amend 43c9392 [WangTao] fix compile error b39a100 [WangTao] specify # cores for ApplicationMaster
* [SPARK-5254][MLLIB] remove developers section from spark.ml guideXiangrui Meng2015-01-141-14/+0
| | | | | | | | | | Forgot to remove this section in #4052. Author: Xiangrui Meng <meng@databricks.com> Closes #4053 from mengxr/SPARK-5254-update and squashes the following commits: f295bde [Xiangrui Meng] remove developers section from spark.ml guide
* [SPARK-5254][MLLIB] Update the user guide to position spark.ml betterXiangrui Meng2015-01-142-14/+21
| | | | | | | | | | | | | The current statement in the user guide may deliver confusing messages to users. spark.ml contains high-level APIs for building ML pipelines. But it doesn't mean that spark.mllib is being deprecated. First of all, the pipeline API is in its alpha stage and we need to see more use cases from the community to stabilizes it, which may take several releases. Secondly, the components in spark.ml are simple wrappers over spark.mllib implementations. Neither the APIs or the implementations from spark.mllib are being deprecated. We expect users use spark.ml pipeline APIs to build their ML pipelines, but we will keep supporting and adding features to spark.mllib. For example, there are many features in review at https://spark-prs.appspot.com/#mllib. So users should be comfortable with using spark.mllib features and expect more coming. The user guide needs to be updated to make the message clear. Author: Xiangrui Meng <meng@databricks.com> Closes #4052 from mengxr/SPARK-5254 and squashes the following commits: 6d5f1d3 [Xiangrui Meng] typo 0cc935b [Xiangrui Meng] update user guide to position spark.ml better
* [SPARK-5131][Streaming][DOC]: There is a discrepancy in WAL implementation ↵uncleGen2015-01-132-2/+2
| | | | | | | | | | | | | and configuration doc. There is a discrepancy in WAL implementation and configuration doc. Author: uncleGen <hustyugm@gmail.com> Closes #3930 from uncleGen/master-clean-doc and squashes the following commits: 3a4245f [uncleGen] doc typo 8e407d3 [uncleGen] doc typo
* [SPARK-5073] spark.storage.memoryMapThreshold have two default valuelewuathe2015-01-111-1/+1
| | | | | | | | | | | | | Because major OS page sizes is about 4KB, the default value of spark.storage.memoryMapThreshold is integrated to 2 * 4096 Author: lewuathe <lewuathe@me.com> Closes #3900 from Lewuathe/integrate-memoryMapThreshold and squashes the following commits: e417acd [lewuathe] [SPARK-5073] Update docs/configuration 834aba4 [lewuathe] [SPARK-5073] Fix style adcea33 [lewuathe] [SPARK-5073] Integrate memory map threshold to 2MB fcce2e5 [lewuathe] [SPARK-5073] spark.storage.memoryMapThreshold have two default value
* [DOC] Fixed Mesos version in doc from 0.18.1 to 0.21.0Kousuke Saruta2015-01-091-1/+1
| | | | | | | | | | | | #3934 upgraded Mesos version so we should also fix docs right? This issue is really minor so I don't file in JIRA. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3982 from sarutak/fix-mesos-version and squashes the following commits: 9a86ee3 [Kousuke Saruta] Fixed mesos version from 0.18.1 to 0.21.0
* [SPARK-1953][YARN]yarn client mode Application Master memory size is same as ↵WangTaoTheTonic2015-01-091-2/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | driver memory... ... size Ways to set Application Master's memory on yarn-client mode: 1. `spark.yarn.am.memory` in SparkConf or System Properties 2. default value 512m Note: this arguments is only available in yarn-client mode. Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #3607 from WangTaoTheTonic/SPARK4181 and squashes the following commits: d5ceb1b [WangTaoTheTonic] spark.driver.memeory is used in both modes 6c1b264 [WangTaoTheTonic] rebase b8410c0 [WangTaoTheTonic] minor optiminzation ddcd592 [WangTaoTheTonic] fix the bug produced in rebase and some improvements 3bf70cc [WangTaoTheTonic] rebase and give proper hint 987b99d [WangTaoTheTonic] disable --driver-memory in client mode 2b27928 [WangTaoTheTonic] inaccurate description b7acbb2 [WangTaoTheTonic] incorrect method invoked 2557c5e [WangTaoTheTonic] missing a single blank 42075b0 [WangTaoTheTonic] arrange the args and warn logging 69c7dba [WangTaoTheTonic] rebase 1960d16 [WangTaoTheTonic] fix wrong comment 7fa9e2e [WangTaoTheTonic] log a warning f6bee0e [WangTaoTheTonic] docs issue d619996 [WangTaoTheTonic] Merge branch 'master' into SPARK4181 b09c309 [WangTaoTheTonic] use code format ab16bb5 [WangTaoTheTonic] fix bug and add comments 44e48c2 [WangTaoTheTonic] minor fix 6fd13e1 [WangTaoTheTonic] add overhead mem and remove some configs 0566bb8 [WangTaoTheTonic] yarn client mode Application Master memory size is same as driver memory size
* SPARK-5136 [DOCS] Improve documentation around setting up Spark IntelliJ projectSean Owen2015-01-091-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | This PR simply points to the IntelliJ wiki page instead of also including IntelliJ notes in the docs. The intent however is to also update the wiki page with updated tips. This is the text I propose for the IntelliJ section on the wiki. I realize it omits some of the existing instructions on the wiki, about enabling Hive, but I think those are actually optional. ------ IntelliJ supports both Maven- and SBT-based projects. It is recommended, however, to import Spark as a Maven project. Choose "Import Project..." from the File menu, and select the `pom.xml` file in the Spark root directory. It is fine to leave all settings at their default values in the Maven import wizard, with two caveats. First, it is usually useful to enable "Import Maven projects automatically", sincchanges to the project structure will automatically update the IntelliJ project. Second, note the step that prompts you to choose active Maven build profiles. As documented above, some build configuration require specific profiles to be enabled. The same profiles that are enabled with `-P[profile name]` above may be enabled on this screen. For example, if developing for Hadoop 2.4 with YARN support, enable profiles `yarn` and `hadoop-2.4`. These selections can be changed later by accessing the "Maven Projects" tool window from the View menu, and expanding the Profiles section. "Rebuild Project" can fail the first time the project is compiled, because generate source files are not automatically generated. Try clicking the "Generate Sources and Update Folders For All Projects" button in the "Maven Projects" tool window to manually generate these sources. Compilation may fail with an error like "scalac: bad option: -P:/home/jakub/.m2/repository/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar". If so, go to Preferences > Build, Execution, Deployment > Scala Compiler and clear the "Additional compiler options" field. It will work then although the option will come back when the project reimports. Author: Sean Owen <sowen@cloudera.com> Closes #3952 from srowen/SPARK-5136 and squashes the following commits: f3baa66 [Sean Owen] Point to new IJ / Eclipse wiki link 016b7df [Sean Owen] Point to IntelliJ wiki page instead of also including IntelliJ notes in the docs
* [SPARK-2165][YARN]add support for setting maxAppAttempts in the ↵WangTaoTheTonic2015-01-071-0/+8
| | | | | | | | | | | | | | | | | | | | | ApplicationSubmissionContext ...xt https://issues.apache.org/jira/browse/SPARK-2165 I still have 2 questions: * If this config is not set, we should use yarn's corresponding value or a default value(like 2) on spark side? * Is the config name best? Or "spark.yarn.am.maxAttempts"? Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #3878 from WangTaoTheTonic/SPARK-2165 and squashes the following commits: 1416c83 [WangTaoTheTonic] use the name spark.yarn.maxAppAttempts 202ac85 [WangTaoTheTonic] rephrase some afdfc99 [WangTaoTheTonic] more detailed description 91562c6 [WangTaoTheTonic] add support for setting maxAppAttempts in the ApplicationSubmissionContext
* [SPARK-5093] Set spark.network.timeout to 120s consistently.Reynold Xin2015-01-051-3/+3
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #3903 from rxin/timeout-120 and squashes the following commits: 7c2138e [Reynold Xin] [SPARK-5093] Set spark.network.timeout to 120s consistently.
* [SPARK-4688] Have a single shared network timeout in SparkVarun Saxena2015-01-051-0/+10
| | | | | | | | | | | | | | | | [SPARK-4688] Have a single shared network timeout in Spark Author: Varun Saxena <vsaxena.varun@gmail.com> Author: varunsaxena <vsaxena.varun@gmail.com> Closes #3562 from varunsaxena/SPARK-4688 and squashes the following commits: 6e97f72 [Varun Saxena] [SPARK-4688] Single shared network timeout cd783a2 [Varun Saxena] SPARK-4688 d6f8c29 [Varun Saxena] SCALA-4688 9562b15 [Varun Saxena] SPARK-4688 a75f014 [varunsaxena] SPARK-4688 594226c [varunsaxena] SPARK-4688
* [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobsJosh Rosen2015-01-041-1/+3
| | | | | | | | | | | | | | | | | | | | | This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery. Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists. SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat. In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times. In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions. When output spec. validation is enabled, the second calls to these actions will fail due to existing output. This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler. This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable. Author: Josh Rosen <joshrosen@databricks.com> Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits: 36eaf35 [Josh Rosen] Add comment explaining use of transform() in test. 6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform() 7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming. e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic. 762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs.
* [SPARK-5058] Updated broken linkssigmoidanalytics2015-01-031-1/+1
| | | | | | | | | | Updated the broken link pointing to the KafkaWordCount example to the correct one. Author: sigmoidanalytics <mayur@sigmoidanalytics.com> Closes #3877 from sigmoidanalytics/patch-1 and squashes the following commits: 3e19b31 [sigmoidanalytics] Updated broken links
* Fixed typos in streaming-kafka-integration.mdAkhil Das2015-01-021-1/+1
| | | | | | | | | | Changed projrect to project :) Author: Akhil Das <akhld@darktech.ca> Closes #3876 from akhld/patch-1 and squashes the following commits: e0cf9ef [Akhil Das] Fixed typos in streaming-kafka-integration.md
* [SPARK-4930][SQL][DOCS]Update SQL programming guide, CACHE TABLE is eagerluogankun2014-12-301-5/+4
| | | | | | | | | | | `CACHE TABLE tbl` is now __eager__ by default not __lazy__ Author: luogankun <luogankun@gmail.com> Closes #3773 from luogankun/SPARK-4930 and squashes the following commits: cc17b7d [luogankun] [SPARK-4930][SQL][DOCS]Update SQL programming guide, add CACHE [LAZY] TABLE [AS SELECT] ... bffe0e8 [luogankun] [SPARK-4930][SQL][DOCS]Update SQL programming guide, CACHE TABLE tbl is eager
* [SPARK-4916][SQL][DOCS]Update SQL programming guide about cache sectionluogankun2014-12-301-4/+1
| | | | | | | | | | | | | `SchemeRDD.cache()` now uses in-memory columnar storage. Author: luogankun <luogankun@gmail.com> Closes #3759 from luogankun/SPARK-4916 and squashes the following commits: 7b39864 [luogankun] [SPARK-4916]Update SQL programming guide 6018122 [luogankun] Merge branch 'master' of https://github.com/apache/spark into SPARK-4916 0b93785 [luogankun] [SPARK-4916]Update SQL programming guide 99b2336 [luogankun] [SPARK-4916]Update SQL programming guide
* [SPARK-4982][DOC] `spark.ui.retainedJobs` description is wrong in Spark UI ↵wangxiaojing2014-12-291-1/+1
| | | | | | | | | | configuration guide Author: wangxiaojing <u9jing@gmail.com> Closes #3818 from wangxiaojing/SPARK-4982 and squashes the following commits: fe2ad5f [wangxiaojing] change stages to jobs
* [SPARK-4501][Core] - Create build/mvn to automatically download ↵Brennon York2014-12-274-24/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | maven/zinc/scalac Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would. Author: Brennon York <brennon.york@capitalone.com> Closes #3707 from brennonyork/SPARK-4501 and squashes the following commits: 0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this) 9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn b979c58 [Brennon York] updated the merge conflict c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set 28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead 7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed 14a5da0 [Brennon York] removed unnecessary zinc output on startup 1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable 3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed a680d12 [Brennon York] Added comments to functions and tested various mvn calls bb8cc9d [Brennon York] removed package files ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli 07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version f914dea [Brennon York] Beginning final portions of localized scala home 69c4e44 [Brennon York] working linux and osx installers for purely local mvn build 4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability
* [SPARK-4608][Streaming] Reorganize StreamingContext implicit to improve API ↵zsxwing2014-12-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | convenience There is only one implicit function `toPairDStreamFunctions` in `StreamingContext`. This PR did similar reorganization like [SPARK-4397](https://issues.apache.org/jira/browse/SPARK-4397). Compiled the following codes with Spark Streaming 1.1.0 and ran it with this PR. Everything is fine. ```Scala import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ object StreamingApp { def main(args: Array[String]) { val conf = new SparkConf().setMaster("local[2]").setAppName("FileWordCount") val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.textFileStream("/some/path") val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } ``` Author: zsxwing <zsxwing@gmail.com> Closes #3464 from zsxwing/SPARK-4608 and squashes the following commits: aa6d44a [zsxwing] Fix a copy-paste error f74c190 [zsxwing] Merge branch 'master' into SPARK-4608 e6f9cc9 [zsxwing] Update the docs 27833bb [zsxwing] Remove `import StreamingContext._` c15162c [zsxwing] Reorganize StreamingContext implicit to improve API convenience
* [SPARK-4953][Doc] Fix the description of building Spark with YARNKousuke Saruta2014-12-251-1/+1
| | | | | | | | | | | | At the section "Specifying the Hadoop Version" In building-spark.md, there is description about building with YARN with Hadoop 0.23. Spark 1.3.0 will not support Hadoop 0.23 so we should fix the description. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3787 from sarutak/SPARK-4953 and squashes the following commits: ee9c355 [Kousuke Saruta] Removed description related to a specific vendor 9ab0c24 [Kousuke Saruta] Fix the description about building SPARK with YARN
* [SPARK-4931][Yarn][Docs] Fix the format of running-on-yarn.mdzsxwing2014-12-231-5/+5
| | | | | | | | | | | | Currently, the format about log4j in running-on-yarn.md is a bit messy. ![running-on-yarn](https://cloud.githubusercontent.com/assets/1000778/5535248/204c4b64-8ab4-11e4-83c3-b4722ea0ad9d.png) Author: zsxwing <zsxwing@gmail.com> Closes #3774 from zsxwing/SPARK-4931 and squashes the following commits: 4a5f853 [zsxwing] Fix the format of running-on-yarn.md
* [Docs] Minor typo fixesNicholas Chammas2014-12-221-3/+3
| | | | | | | | Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #3772 from nchammas/patch-1 and squashes the following commits: b7d9083 [Nicholas Chammas] [Docs] Minor typo fixes
* [SPARK-4864] Add documentation to Netty-based configsAaron Davidson2014-12-221-0/+35
| | | | | | | | | Author: Aaron Davidson <aaron@databricks.com> Closes #3713 from aarondav/netty-configs and squashes the following commits: 8a8b373 [Aaron Davidson] Address Patrick's comments 3b1f84e [Aaron Davidson] [SPARK-4864] Add documentation to Netty-based configs
* [SPARK-4915][YARN] Fix classname to be specified for external shuffle service.Tsuyoshi Ozawa2014-12-221-1/+1
| | | | | | | | Author: Tsuyoshi Ozawa <ozawa.tsuyoshi@lab.ntt.co.jp> Closes #3757 from oza/SPARK-4915 and squashes the following commits: 3b0d6d6 [Tsuyoshi Ozawa] Fix classname to be specified for external shuffle service.
* [SPARK-4140] Document dynamic allocationAndrew Or2014-12-192-0/+169
| | | | | | | | | | | | | | | | | | | | Once the external shuffle service is also documented, the dynamic allocation section will link to it. Let me know if the whole dynamic allocation should be moved to its separate page; I personally think the organization might be cleaner that way. This patch builds on top of oza's work in #3689. aarondav pwendell Author: Andrew Or <andrew@databricks.com> Author: Tsuyoshi Ozawa <ozawa.tsuyoshi@gmail.com> Closes #3731 from andrewor14/document-dynamic-allocation and squashes the following commits: 1281447 [Andrew Or] Address a few comments b9843f2 [Andrew Or] Document the configs as well 246fb44 [Andrew Or] Merge branch 'SPARK-4839' of github.com:oza/spark into document-dynamic-allocation 8c64004 [Andrew Or] Add documentation for dynamic allocation (without configs) 6827b56 [Tsuyoshi Ozawa] Fixing a documentation of spark.dynamicAllocation.enabled. 53cff58 [Tsuyoshi Ozawa] Adding a documentation about dynamic resource allocation.
* change signature of example to match released codeEran Medan2014-12-191-1/+1
| | | | | | | | | | the signature of registerKryoClasses is actually of Array[Class[_]] not Seq Author: Eran Medan <ehrann.mehdan@gmail.com> Closes #3747 from eranation/patch-1 and squashes the following commits: ee9885d [Eran Medan] change signature of example to match released code
* Add mesos specific configurations into docTimothy Chen2014-12-181-0/+45
| | | | | | | | | | Author: Timothy Chen <tnachen@gmail.com> Closes #3349 from tnachen/mesos_doc and squashes the following commits: 737ef49 [Timothy Chen] Add TOC 5ca546a [Timothy Chen] Update description around cores requested. 26283a5 [Timothy Chen] Add mesos specific configurations into doc
* SPARK-3779. yarn spark.yarn.applicationMaster.waitTries config should be...Sandy Ryza2014-12-181-3/+5
| | | | | | | | | | | | ... changed to a time period Author: Sandy Ryza <sandy@cloudera.com> Closes #3471 from sryza/sandy-spark-3779 and squashes the following commits: 20b9887 [Sandy Ryza] Deprecate old property 42b5df7 [Sandy Ryza] Review feedback 9a959a1 [Sandy Ryza] SPARK-3779. yarn spark.yarn.applicationMaster.waitTries config should be changed to a time period
* [SPARK-4461][YARN] pass extra java options to yarn application masterZhan Zhang2014-12-181-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, there is no way to pass yarn am specific java options. It cause some potential issues when reading classpath from hadoop configuration file. Hadoop configuration actually replace variables in its property with the system property passed in java options. How to specify the value depends on different hadoop distribution. The new options are SPARK_YARN_JAVA_OPTS or spark.yarn.extraJavaOptions. I make it as spark global level, because typically we don't want user to specify this in their command line each time submitting spark job after it is setup in spark-defaults.conf. In addition, with this new extra options enabled to be passed to AM, it provides more flexibility. For example int the following valid mapred-site.xml file, we have the class path which specify values using system property. Hadoop can correctly handle it because it has java options passed in. This is the example, currently spark will break due to hadoop.version is not passed in. <property> <name>mapreduce.application.classpath</name> <value>/etc/hadoop/${hadoop.version}/mapreduce/*</value> </property> In the meantime, we cannot relies on mapreduce.admin.map.child.java.opts in mapred-site.xml, because it has its own extra java options specified, which does not apply to Spark. Author: Zhan Zhang <zhazhan@gmail.com> Closes #3409 from zhzhan/Spark-4461 and squashes the following commits: daec3d0 [Zhan Zhang] solve review comments 08f44a7 [Zhan Zhang] add warning in driver mode if spark.yarn.am.extraJavaOptions is configured 5a505d3 [Zhan Zhang] solve review comments 4ed43ad [Zhan Zhang] solve review comments ad777ed [Zhan Zhang] Merge branch 'master' into Spark-4461 3e9e574 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark e3f9abe [Zhan Zhang] solve review comments 8963552 [Zhan Zhang] rebase f8f6700 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark dea1692 [Zhan Zhang] change the option key name to client mode specific 90d5dff [Zhan Zhang] rebase 8ac9254 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 092a25f [Zhan Zhang] solve review comments bc5a9ae [Zhan Zhang] solve review comments 782b014 [Zhan Zhang] add new configuration to docs/running-on-yarn.md and remove it from spark-defaults.conf.template 6faaa97 [Zhan Zhang] solve review comments 369863f [Zhan Zhang] clean up unnecessary var 733de9c [Zhan Zhang] Merge branch 'master' into Spark-4461 a68e7f0 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 864505a [Zhan Zhang] Add extra java options to be passed to Yarn application master 15830fc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 685d911 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 03ebad3 [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark 46d9e3d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark ebb213a [Zhan Zhang] revert b983ef3 [Zhan Zhang] test c4efb9b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 779d67b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 4daae6d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 12e1be5 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark ce0ca7b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 93f3081 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 3764505 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark a9d372b [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark a00f60f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 497b0f4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 4a2e36d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark a72c0d4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 301eb4a [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark cedcc6f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark adf4924 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark d10bf00 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 7e0cc36 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 68deb11 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 3ee3b2b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 2b0d513 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 1ccd7cc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark af9feb9 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark e4c1982 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 921e914 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 789ea21 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark cb53a2c [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark f6a8a40 [Zhan Zhang] revert ba14f28 [Zhan Zhang] test
* [DOCS][SQL] Add a Note on jsonFile having separate JSON objects per linePeter Vandenabeele2014-12-161-0/+12
| | | | | | | | | | | | | | | * This commit hopes to avoid the confusion I faced when trying to submit a regular, valid multi-line JSON file, also see http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html Author: Peter Vandenabeele <peter@vandenabeele.com> Closes #3517 from petervandenabeele/pv-docs-note-on-jsonFile-format/01 and squashes the following commits: 1f98e52 [Peter Vandenabeele] Revert to people.json and simple Note text 6b6e062 [Peter Vandenabeele] Change the "JSON" connotation to "txt" fca7dfb [Peter Vandenabeele] Add a Note on jsonFile having separate JSON objects per line
* [SQL] SPARK-4700: Add HTTP protocol spark thrift serverJudy Nash2014-12-161-0/+12
| | | | | | | | | | | | | | | | | Add HTTP protocol support and test cases to spark thrift server, so users can deploy thrift server in both TCP and http mode. Author: Judy Nash <judynash@microsoft.com> Author: judynash <judynash@microsoft.com> Closes #3672 from judynash/master and squashes the following commits: 526315d [Judy Nash] correct spacing on startThriftServer method 31a6520 [Judy Nash] fix code style issues and update sql programming guide format issue 47bf87e [Judy Nash] modify withJdbcStatement method definition to meet less than 100 line length 2e9c11c [Judy Nash] add thrift server in http mode documentation on sql programming guide 1cbd305 [Judy Nash] Merge remote-tracking branch 'upstream/master' 2b1d312 [Judy Nash] updated http thrift server support based on feedback 377532c [judynash] add HTTP protocol spark thrift server
* [SPARK-3405] add subnet-id and vpc-id options to spark_ec2.pyMike Jennings2014-12-161-0/+19
| | | | | | | | | | | | | | | | | | | | Based on this gist: https://gist.github.com/amar-analytx/0b62543621e1f246c0a2 We use security group ids instead of security group to get around this issue: https://github.com/boto/boto/issues/350 Author: Mike Jennings <mvj101@gmail.com> Author: Mike Jennings <mvj@google.com> Closes #2872 from mvj101/SPARK-3405 and squashes the following commits: be9cb43 [Mike Jennings] `pep8 spark_ec2.py` runs cleanly. 4dc6756 [Mike Jennings] Remove duplicate comment 731d94c [Mike Jennings] Update for code review. ad90a36 [Mike Jennings] Merge branch 'master' of https://github.com/apache/spark into SPARK-3405 1ebffa1 [Mike Jennings] Merge branch 'master' into SPARK-3405 52aaeec [Mike Jennings] [SPARK-3405] add subnet-id and vpc-id options to spark_ec2.py
* [SPARK-4668] Fix some documentation typos.Ryan Williams2014-12-153-8/+22
| | | | | | | | | | | | | | | | | Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #3523 from ryan-williams/tweaks and squashes the following commits: d2eddaa [Ryan Williams] code review feedback ce27fc1 [Ryan Williams] CoGroupedRDD comment nit c6cfad9 [Ryan Williams] remove unnecessary if statement b74ea35 [Ryan Williams] comment fix b0221f0 [Ryan Williams] fix a gendered pronoun c71ffed [Ryan Williams] use names on a few boolean parameters 89954aa [Ryan Williams] clarify some comments in {Security,Shuffle}Manager e465dac [Ryan Williams] Saved building-spark.md with Dillinger.io 83e8358 [Ryan Williams] fix pom.xml typo dc4662b [Ryan Williams] typo fixes in tuning.md, configuration.md
* [SPARK-4806] Streaming doc update for 1.2Tathagata Das2014-12-117-551/+819
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Important updates to the streaming programming guide - Make the fault-tolerance properties easier to understand, with information about write ahead logs - Update the information about deploying the spark streaming app with information about Driver HA - Update Receiver guide to discuss reliable vs unreliable receivers. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Josh Rosen <joshrosen@databricks.com> Author: Josh Rosen <rosenville@gmail.com> Closes #3653 from tdas/streaming-doc-update-1.2 and squashes the following commits: f53154a [Tathagata Das] Addressed Josh's comments. ce299e4 [Tathagata Das] Minor update. ca19078 [Tathagata Das] Minor change f746951 [Tathagata Das] Mentioned performance problem with WAL 7787209 [Tathagata Das] Merge branch 'streaming-doc-update-1.2' of github.com:tdas/spark into streaming-doc-update-1.2 2184729 [Tathagata Das] Updated Kafka and Flume guides with reliability information. 2f3178c [Tathagata Das] Added more information about writing reliable receivers in the custom receiver guide. 91aa5aa [Tathagata Das] Improved API Docs menu 5707581 [Tathagata Das] Added Pythn API badge b9c8c24 [Tathagata Das] Merge pull request #26 from JoshRosen/streaming-programming-guide b8c8382 [Josh Rosen] minor fixes a4ef126 [Josh Rosen] Restructure parts of the fault-tolerance section to read a bit nicer when skipping over the headings 65f66cd [Josh Rosen] Fix broken link to fault-tolerance semantics section. f015397 [Josh Rosen] Minor grammar / pluralization fixes. 3019f3a [Josh Rosen] Fix minor Markdown formatting issues aa8bb87 [Tathagata Das] Small update. 195852c [Tathagata Das] Updated based on Josh's comments, updated receiver reliability and deploying section, and also updated configuration. 17b99fb [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-doc-update-1.2 a0217c0 [Tathagata Das] Changed Deploying menu layout 67fcffc [Tathagata Das] Added cluster mode + supervise example to submitting application guide. e45453b [Tathagata Das] Update streaming guide, added deploying section. 192c7a7 [Tathagata Das] Added more info about Python API, and rewrote the checkpointing section.
* SPARK-3526 Add section about data locality to the tuning guideAndrew Ash2014-12-101-0/+33
| | | | | | | | | | | | | | | | | cc kayousterhout I have a few outstanding questions from compiling this documentation: - What's the difference between NO_PREF and ANY? I understand the implications of the ordering but don't know what an example of each would be - Why is NO_PREF ahead of RACK_LOCAL? I would think it'd be better to schedule rack-local tasks ahead of no preference if you could only do one or the other. Is the idea to wait longer and hope for the rack-local tasks to turn into node-local or better? - Will there be a datacenter-local locality level in the future? Apache Cassandra for example has this level Author: Andrew Ash <andrew@andrewash.com> Closes #2519 from ash211/SPARK-3526 and squashes the following commits: 44cff28 [Andrew Ash] Link to spark.locality parameters rather than copying the list 6d5d966 [Andrew Ash] Stay focused on Spark, no astronaut architecture mumbo-jumbo 20e0e31 [Andrew Ash] SPARK-3526 Add section about data locality to the tuning guide
* [SPARK-4771][Docs] Document standalone cluster supervise modeAndrew Or2014-12-101-1/+10
| | | | | | | | | | | tdas looks like streaming already refers to the supervise mode. The link from there is broken though. Author: Andrew Or <andrew@databricks.com> Closes #3627 from andrewor14/document-supervise and squashes the following commits: 9ca0908 [Andrew Or] Wording changes 2b55ed2 [Andrew Or] Document standalone cluster supervise mode
* SPARK-4338. [YARN] Ditch yarn-alpha.Sandy Ryza2014-12-092-24/+3
| | | | | | | | | | | Sorry if this is a little premature with 1.2 still not out the door, but it will make other work like SPARK-4136 and SPARK-2089 a lot easier. Author: Sandy Ryza <sandy@cloudera.com> Closes #3215 from sryza/sandy-spark-4338 and squashes the following commits: 1c5ac08 [Sandy Ryza] Update building Spark docs and remove unnecessary newline 9c1421c [Sandy Ryza] SPARK-4338. Ditch yarn-alpha.
* SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio doc...Sandy Ryza2014-12-081-2/+2
| | | | | | | | | | ...umented default is incorrect for YARN Author: Sandy Ryza <sandy@cloudera.com> Closes #3624 from sryza/sandy-spark-4770 and squashes the following commits: bd81a3a [Sandy Ryza] SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio documented default is incorrect for YARN
* Streaming doc : do you mean inadvertently?CrazyJvm2014-12-051-1/+1
| | | | | | | | Author: CrazyJvm <crazyjvm@gmail.com> Closes #3620 from CrazyJvm/streaming-foreachRDD and squashes the following commits: b72886b [CrazyJvm] do you mean inadvertently?
* Revert "SPARK-2624 add datanucleus jars to the container in yarn-cluster"Andrew Or2014-12-041-15/+0
| | | | This reverts commit a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53.
* [SPARK-4464] Description about configuration options need to be modified in ↵Masayoshi TSUZUKI2014-12-041-2/+10
| | | | | | | | | | | | | | docs. Added description about -h and -host. Modified description about -i and -ip which are now deprecated. Added description about --properties-file. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3329 from tsudukim/feature/SPARK-4464 and squashes the following commits: 6c07caf [Masayoshi TSUZUKI] [SPARK-4464] Description about configuration options need to be modified in docs.
* Fix typo in Spark SQL docs.Andy Konwinski2014-12-041-1/+1
| | | | | | | | Author: Andy Konwinski <andykonwinski@gmail.com> Closes #3611 from andyk/patch-3 and squashes the following commits: 7bab333 [Andy Konwinski] Fix typo in Spark SQL docs.
* [SPARK-4421] Wrong link in spark-standalone.htmlMasayoshi TSUZUKI2014-12-041-1/+1
| | | | | | | | | | Modified the link of building Spark. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3279 from tsudukim/feature/SPARK-4421 and squashes the following commits: 56e31c1 [Masayoshi TSUZUKI] Modified the link of building Spark.
* [SPARK-4652][DOCS] Add docs about spark-git-repo optionlewuathe2014-12-041-0/+5
| | | | | | | | | | | | | | | | There might be some cases when WIPS spark version need to be run on EC2 cluster. In order to setup this type of cluster more easily, add --spark-git-repo option description to ec2 documentation. Author: lewuathe <lewuathe@me.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #3513 from Lewuathe/doc-for-development-spark-cluster and squashes the following commits: 6dae8ee [lewuathe] Wrap consistent with other descriptions cfaf9be [lewuathe] Add docs about spark-git-repo option (Editing / cleanup by Josh Rosen)
* [FIX][DOC] Fix broken links in ml-guide.mdXiangrui Meng2014-12-041-4/+4
| | | | | | | | | | | | and some minor changes in ScalaDoc. Author: Xiangrui Meng <meng@databricks.com> Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits: c559768 [Xiangrui Meng] minor code update ce94da8 [Xiangrui Meng] Java Bean -> JavaBean 0b5c182 [Xiangrui Meng] fix links in ml-guide
* [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixesJoseph K. Bradley2014-12-045-1/+714
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Documentation: * Added ml-guide.md, linked from mllib-guide.md * Updated mllib-guide.md with small section pointing to ml-guide.md Examples: * CrossValidatorExample * SimpleParamsExample * (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md) Bug fixes: * PipelineModel: did not use ParamMaps correctly * UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!) CC: mengxr shivaram etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete. Author: Joseph K. Bradley <joseph@databricks.com> Author: jkbradley <joseph.kurata.bradley@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3588 from jkbradley/ml-package-docs and squashes the following commits: d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit). updated examples for CV and Params for spark.ml c38469c [Joseph K. Bradley] Updated ml-guide with CV examples 99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params. Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold. ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs 3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype 41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
* [docs] Fix outdated comment in tuning guideJoseph K. Bradley2014-12-041-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | When you use the SPARK_JAVA_OPTS env variable, Spark complains: ``` SPARK_JAVA_OPTS was detected (set to ' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps '). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with conf/spark-defaults.conf to set defaults for an application - ./spark-submit with --driver-java-options to set -X options for a driver - spark.executor.extraJavaOptions to set -X options for executors - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker) ``` This updates the docs to redirect the user to the relevant part of the configuration docs. CC: mengxr but please CC someone else as needed Author: Joseph K. Bradley <joseph@databricks.com> Closes #3592 from jkbradley/tuning-doc and squashes the following commits: 0760ce1 [Joseph K. Bradley] fixed outdated comment in tuning guide