aboutsummaryrefslogtreecommitdiff
path: root/docs
Commit message (Collapse)AuthorAgeFilesLines
* [FIX][DOC] Fix broken links in ml-guide.mdXiangrui Meng2014-12-041-4/+4
| | | | | | | | | | | | | | | and some minor changes in ScalaDoc. Author: Xiangrui Meng <meng@databricks.com> Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits: c559768 [Xiangrui Meng] minor code update ce94da8 [Xiangrui Meng] Java Bean -> JavaBean 0b5c182 [Xiangrui Meng] fix links in ml-guide (cherry picked from commit 7e758d709286e73d2c878d4a2d2b4606386142c7) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixesJoseph K. Bradley2014-12-045-1/+714
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Documentation: * Added ml-guide.md, linked from mllib-guide.md * Updated mllib-guide.md with small section pointing to ml-guide.md Examples: * CrossValidatorExample * SimpleParamsExample * (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md) Bug fixes: * PipelineModel: did not use ParamMaps correctly * UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!) CC: mengxr shivaram etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete. Author: Joseph K. Bradley <joseph@databricks.com> Author: jkbradley <joseph.kurata.bradley@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3588 from jkbradley/ml-package-docs and squashes the following commits: d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit). updated examples for CV and Params for spark.ml c38469c [Joseph K. Bradley] Updated ml-guide with CV examples 99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params. Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold. ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs 3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype 41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works. (cherry picked from commit 469a6e5f3bdd5593b3254bc916be8236e7c6cb74) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [docs] Fix outdated comment in tuning guideJoseph K. Bradley2014-12-041-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When you use the SPARK_JAVA_OPTS env variable, Spark complains: ``` SPARK_JAVA_OPTS was detected (set to ' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps '). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with conf/spark-defaults.conf to set defaults for an application - ./spark-submit with --driver-java-options to set -X options for a driver - spark.executor.extraJavaOptions to set -X options for executors - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker) ``` This updates the docs to redirect the user to the relevant part of the configuration docs. CC: mengxr but please CC someone else as needed Author: Joseph K. Bradley <joseph@databricks.com> Closes #3592 from jkbradley/tuning-doc and squashes the following commits: 0760ce1 [Joseph K. Bradley] fixed outdated comment in tuning guide (cherry picked from commit 529439bd506949f272a2b6f099ea549b097428f3) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + ↵Joseph K. Bradley2014-12-043-98/+825
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DecisionTree API fix Major changes: * Added programming guide sections for tree ensembles * Added examples for tree ensembles * Updated DecisionTree programming guide with more info on parameters * **API change**: Standardized the tree parameter for the number of classes (for classification) Minor changes: * Updated decision tree documentation * Updated existing tree and tree ensemble examples * Use train/test split, and compute test error instead of training error. * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix) Note: I know this is a lot of lines, but most is covered by: * Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.) * New examples (which were copied from the programming guide) * The "numClasses" renaming I have run all examples and relevant unit tests. CC: mengxr manishamde codedeft Author: Joseph K. Bradley <joseph@databricks.com> Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #3461 from jkbradley/ensemble-docs and squashes the following commits: 70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide 8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide 6fab846 [Joseph K. Bradley] small fixes based on review b9f8576 [Joseph K. Bradley] updated decision tree doc 375204c [Joseph K. Bradley] fixed python style 2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file. added header. Fixed small bug in same example in the programming guide. 706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small c76c823 [Joseph K. Bradley] added migration guide for mllib abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder 07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification). cdfdfbc [Joseph K. Bradley] added examples for GBT 6372a2b [Joseph K. Bradley] updated decision tree examples to use random split. tested all of them. ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide. still need to update their examples (cherry picked from commit 657a88835d8bf22488b53d50f75281d7dc32442e) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-4711] [mllib] [docs] Programming guide advice on choosing optimizerJoseph K. Bradley2014-12-042-9/+18
| | | | | | | | | | | | | | | | | I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #3569 from jkbradley/lr-doc and squashes the following commits: 654aeb5 [Joseph K. Bradley] updated section header for mllib-optimization 5035ad0 [Joseph K. Bradley] updated based on review 94f6dec [Joseph K. Bradley] Updated linear methods and optimization docs with quick advice on choosing an optimization method (cherry picked from commit 27ab0b8a03b711e8d86b6167df833f012205ccc7) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-4642] Add description about spark.yarn.queue to running-on-YARN document.Masayoshi TSUZUKI2014-12-031-1/+8
| | | | | | | | | | | | | | | | | | | Added descriptions about these parameters. - spark.yarn.queue Modified description about the defalut value of this parameter. - spark.yarn.submit.file.replication Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3500 from tsudukim/feature/SPARK-4642 and squashes the following commits: ce99655 [Masayoshi TSUZUKI] better gramatically. 21cf624 [Masayoshi TSUZUKI] Removed intentionally undocumented properties. 88cac9b [Masayoshi TSUZUKI] [SPARK-4642] Documents about running-on-YARN needs update (cherry picked from commit 692f49378f7d384d5c9c5ab7451a1c1e66f91c50) Signed-off-by: Andrew Or <andrew@databricks.com>
* SPARK-2624 add datanucleus jars to the container in yarn-clusterJim Lim2014-12-031-0/+15
| | | | | | | | | | | | | | | | | If `spark-submit` finds the datanucleus jars, it adds them to the driver's classpath, but does not add it to the container. This patch modifies the yarn deployment class to copy all `datanucleus-*` jars found in `[spark-home]/libs` to the container. Author: Jim Lim <jim@quixey.com> Closes #3238 from jimjh/SPARK-2624 and squashes the following commits: 3633071 [Jim Lim] SPARK-2624 update documentation and comments fe95125 [Jim Lim] SPARK-2624 keep java imports together 6c31fe0 [Jim Lim] SPARK-2624 update documentation 6690fbf [Jim Lim] SPARK-2624 add tests d28d8e9 [Jim Lim] SPARK-2624 add spark.yarn.datanucleus.dir option 84e6cba [Jim Lim] SPARK-2624 add datanucleus jars to the container in yarn-cluster
* [SPARK-4686] Link to allowed master URLs is brokenKay Ousterhout2014-12-021-1/+1
| | | | | | | | | | | | | | | The link points to the old scala programming guide; it should point to the submitting applications page. This should be backported to 1.1.2 (it's been broken as of 1.0). Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #3542 from kayousterhout/SPARK-4686 and squashes the following commits: a8fc43b [Kay Ousterhout] [SPARK-4686] Link to allowed master URLs is broken (cherry picked from commit d9a148ba6a67a01e4bf77c35c41dd4cbc8918c82) Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com>
* [SQL][DOC] Date type in SQL programming guideDaoyuan Wang2014-12-011-0/+23
| | | | | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3535 from adrian-wang/datedoc and squashes the following commits: 18ff1ed [Daoyuan Wang] [DOC] Date type (cherry picked from commit 5edbcbfb61703398a24ce5162a74aba04e365b0c) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SQL] Minor fix for doc and commentwangfei2014-12-011-1/+2
| | | | | | | | | | | Author: wangfei <wangfei1@huawei.com> Closes #3533 from scwf/sql-doc1 and squashes the following commits: 962910b [wangfei] doc and comment fix (cherry picked from commit 7b79957879db4dfcc7c3601cb40ac4fd576259a5) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4258][SQL][DOC] Documents spark.sql.parquet.filterPushdownCheng Lian2014-12-011-6/+16
| | | | | | | | | | | | | | | | | Documents `spark.sql.parquet.filterPushdown`, explains why it's turned off by default and when it's safe to be turned on. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3440) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3440 from liancheng/parquet-filter-pushdown-doc and squashes the following commits: 2104311 [Cheng Lian] Documents spark.sql.parquet.filterPushdown (cherry picked from commit 5db8dcaf494e0dffed4fc22f19b0334d95ab6bfb) Signed-off-by: Michael Armbrust <michael@databricks.com>
* Documentation: add description for repartitionAndSortWithinPartitionsMadhu Siddalingaiah2014-12-011-0/+6
| | | | | | | | | | | | | | Author: Madhu Siddalingaiah <madhu@madhu.com> Closes #3390 from msiddalingaiah/master and squashes the following commits: cbccbfe [Madhu Siddalingaiah] Documentation: replace <b> with <code> (again) 332f7a2 [Madhu Siddalingaiah] Documentation: replace <b> with <code> cd2b05a [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master' 0fc12d7 [Madhu Siddalingaiah] Documentation: add description for repartitionAndSortWithinPartitions (cherry picked from commit 2b233f5fc4beb2c6ed4bc142e923e96f8bad3ec4) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [DOC] Fixes formatting typo in SQL programming guideCheng Lian2014-11-301-2/+0
| | | | | | | | | | | | | | | <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3498) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3498 from liancheng/fix-sql-doc-typo and squashes the following commits: 865ecd7 [Cheng Lian] Fixes formatting typo in SQL programming guide (cherry picked from commit 2a4d389f70b2066b1ac32b081bef44e61fefb03c) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-4656][Doc] Typo in Programming Guide markdownlewuathe2014-11-301-1/+1
| | | | | | | | | | | | | Grammatical error in Programming Guide document Author: lewuathe <lewuathe@me.com> Closes #3412 from Lewuathe/typo-programming-guide and squashes the following commits: a3e2f00 [lewuathe] Typo in Programming Guide markdown (cherry picked from commit a217ec5fd5cd7addc69e538d6ec6dd64956cc8ed) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [DOCS][BUILD] Add instruction to use change-version-to-2.11.sh in 'Building ↵Takuya UESHIN2014-11-301-0/+1
| | | | | | | | | | | | | | | for Scala 2.11'. To build with Scala 2.11, we have to execute `change-version-to-2.11.sh` before Maven execute, otherwise inter-module dependencies are broken. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3361 from ueshin/docs/building-spark_2.11 and squashes the following commits: 1d29126 [Takuya UESHIN] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'. (cherry picked from commit 0fcd24cc542040ff3555290eec7b021062e7e6ac) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulatorCodingCat2014-11-261-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-3628 In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive In this patch, I changed the way for the DAGScheduler to update the accumulator, DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt... Author: CodingCat <zhunansjtu@gmail.com> Closes #2524 from CodingCat/SPARK-732-1 and squashes the following commits: 701a1e8 [CodingCat] roll back change on Accumulator.scala 1433e6f [CodingCat] make MIMA happy b233737 [CodingCat] address Matei's comments 02261b8 [CodingCat] rollback some changes 6b0aff9 [CodingCat] update document 2b2e8cf [CodingCat] updateAccumulator 83b75f8 [CodingCat] style fix 84570d2 [CodingCat] re-enable the bad accumulator guard 1e9e14d [CodingCat] add NPE guard 21b6840 [CodingCat] simplify the patch 88d1f03 [CodingCat] fix rebase error f74266b [CodingCat] add test case for resubmitted result stage 5cf586f [CodingCat] de-duplicate on task level 138f9b3 [CodingCat] make MIMA happy 67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator (cherry picked from commit 5af53ada65f62e6b5987eada288fb48e9211ef9d) Signed-off-by: Matei Zaharia <matei@databricks.com>
* HOTFIX: Updating additional version dataPatrick Wendell2014-11-261-1/+1
|
* [Spark-4509] Revert EC2 tag-based cluster membership patchXiangrui Meng2014-11-251-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | | This PR reverts changes related to tag-based cluster membership. As discussed in SPARK-3332, we didn't figure out a safe strategy to use tags to determine cluster membership, because tagging is not atomic. The following changes are reverted: SPARK-2333: 94053a7b766788bb62e2dbbf352ccbcc75f71fc0 SPARK-3213: 7faf755ae4f0cf510048e432340260a6e609066d SPARK-3608: 78d4220fa0bf2f9ee663e34bbf3544a5313b02f0. I tested launch, login, and destroy. It is easy to check the diff by comparing it to Josh's patch for branch-1.1: https://github.com/apache/spark/pull/2225/files JoshRosen I sent the PR to master. It might be easier for us to keep master and branch-1.2 the same at this time. We can always re-apply the patch once we figure out a stable solution. Author: Xiangrui Meng <meng@databricks.com> Closes #3453 from mengxr/SPARK-4509 and squashes the following commits: f0b708b [Xiangrui Meng] revert 94053a7b766788bb62e2dbbf352ccbcc75f71fc0 4298ea5 [Xiangrui Meng] revert 7faf755ae4f0cf510048e432340260a6e609066d 35963a1 [Xiangrui Meng] Revert "SPARK-3608 Break if the instance tag naming succeeds" (cherry picked from commit 7eba0fbe456c451122d7a2353ff0beca00f15223) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-4546] Improve HistoryServer first time user experienceAndrew Or2014-11-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | The documentation points the user to run the following ``` sbin/start-history-server.sh ``` The first thing this does is throw an exception that complains a log directory is not specified. The exception message itself does not say anything about what to set. Instead we should have a default and a landing page with a better message. The new default log directory is `file:/tmp/spark-events`. This is what it looks like as of this PR: ![after](https://issues.apache.org/jira/secure/attachment/12682985/after.png) Author: Andrew Or <andrew@databricks.com> Closes #3411 from andrewor14/minor-history-improvements and squashes the following commits: f33d6b3 [Andrew Or] Point user to set config if default log dir does not exist fc4c17a [Andrew Or] Improve HistoryServer UX (cherry picked from commit 9afcbe494a3535a9bf7958429b72e989972f82d9) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-4344][DOCS] adding documentation on spark.yarn.user.classpath.firstarahuja2014-11-251-0/+1
| | | | | | | | | | | | | The documentation for the two parameters is the same with a pointer from the standalone parameter to the yarn parameter Author: arahuja <aahuja11@gmail.com> Closes #3209 from arahuja/yarn-classpath-first-param and squashes the following commits: 51cb9b2 [arahuja] [SPARK-4344][DOCS] adding documentation for YARN on userClassPathFirst (cherry picked from commit d240760191f692ee7b88dfc82f06a31a340a88a2) Signed-off-by: Thomas Graves <tgraves@apache.org>
* [DOC][Build] Wrong cmd for build spark with apache hadoop 2.4.X and hive 12wangfei2014-11-241-1/+1
| | | | | | | | | | | | Author: wangfei <wangfei1@huawei.com> Closes #3335 from scwf/patch-10 and squashes the following commits: d343113 [wangfei] add '-Phive' 60d595e [wangfei] [DOC] Wrong cmd for build spark with apache hadoop 2.4.X and Hive 12 support (cherry picked from commit 0fe54cff19759dad2dc2a0950bd6c1d31c95e858) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-4457. Document how to build for Hadoop versions greater than 2.4Sandy Ryza2014-11-241-2/+5
| | | | | | | | | | | | | Author: Sandy Ryza <sandy@cloudera.com> Closes #3322 from sryza/sandy-spark-4457 and squashes the following commits: 5e72b77 [Sandy Ryza] Feedback 0cf05c1 [Sandy Ryza] Caveat be8084b [Sandy Ryza] SPARK-4457. Document how to build for Hadoop versions greater than 2.4 (cherry picked from commit 29372b63185a4a170178b6ec2362d7112f389852) Signed-off-by: Thomas Graves <tgraves@apache.org>
* add Sphinx as a dependency of building docsDavies Liu2014-11-201-1/+6
| | | | | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #3388 from davies/doc_readme and squashes the following commits: daa1482 [Davies Liu] add Sphinx dependency (cherry picked from commit 8cd6eea6298fc8e811dece38c2875e94ff863948) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Updating GraphX programming guide and documentationJoseph E. Gonzalez2014-11-191-144/+216
| | | | | | | | | | | | | This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator. Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Closes #3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits: 4421964 [Joseph E. Gonzalez] updating documentation for graphx (cherry picked from commit 377b06820934cab6d67f3a9182528c7f417a7d98) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-4180] [Core] Prevent creation of multiple active SparkContextsJosh Rosen2014-11-171-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds error-detection logic to throw an exception when attempting to create multiple active SparkContexts in the same JVM, since this is currently unsupported and has been known to cause confusing behavior (see SPARK-2243 for more details). **The solution implemented here is only a partial fix.** A complete fix would have the following properties: 1. Only one SparkContext may ever be under construction at any given time. 2. Once a SparkContext has been successfully constructed, any subsequent construction attempts should fail until the active SparkContext is stopped. 3. If the SparkContext constructor throws an exception, then all resources created in the constructor should be cleaned up (SPARK-4194). 4. If a user attempts to create a SparkContext but the creation fails, then the user should be able to create new SparkContexts. This PR only provides 2) and 4); we should be able to provide all of these properties, but the correct fix will involve larger changes to SparkContext's construction / initialization, so we'll target it for a different Spark release. ### The correct solution: I think that the correct way to do this would be to move the construction of SparkContext's dependencies into a static method in the SparkContext companion object. Specifically, we could make the default SparkContext constructor `private` and change it to accept a `SparkContextDependencies` object that contains all of SparkContext's dependencies (e.g. DAGScheduler, ContextCleaner, etc.). Secondary constructors could call a method on the SparkContext companion object to create the `SparkContextDependencies` and pass the result to the primary SparkContext constructor. For example: ```scala class SparkContext private (deps: SparkContextDependencies) { def this(conf: SparkConf) { this(SparkContext.getDeps(conf)) } } object SparkContext( private[spark] def getDeps(conf: SparkConf): SparkContextDependencies = synchronized { if (anotherSparkContextIsActive) { throw Exception(...) } var dagScheduler: DAGScheduler = null try { dagScheduler = new DAGScheduler(...) [...] } catch { case e: Exception => Option(dagScheduler).foreach(_.stop()) [...] } SparkContextDependencies(dagScheduler, ....) } } ``` This gives us mutual exclusion and ensures that any resources created during the failed SparkContext initialization are properly cleaned up. This indirection is necessary to maintain binary compatibility. In retrospect, it would have been nice if SparkContext had no private constructors and could only be created through builder / factory methods on its companion object, since this buys us lots of flexibility and makes dependency injection easier. ### Alternative solutions: As an alternative solution, we could refactor SparkContext's primary constructor to perform all object creation in a giant `try-finally` block. Unfortunately, this will require us to turn a bunch of `vals` into `vars` so that they can be assigned from the `try` block. If we still want `vals`, we could wrap each `val` in its own `try` block (since the try block can return a value), but this will lead to extremely messy code and won't guard against the introduction of future code which doesn't properly handle failures. The more complex approach outlined above gives us some nice dependency injection benefits, so I think that might be preferable to a `var`-ification. ### This PR's solution: - At the start of the constructor, check whether some other SparkContext is active; if so, throw an exception. - If another SparkContext might be under construction (or has thrown an exception during construction), allow the new SparkContext to begin construction but log a warning (since resources might have been leaked from a failed creation attempt). - At the end of the SparkContext constructor, check whether some other SparkContext constructor has raced and successfully created an active context. If so, throw an exception. This guarantees that no two SparkContexts will ever be active and exposed to users (since we check at the very end of the constructor). If two threads race to construct SparkContexts, then one of them will win and another will throw an exception. This exception can be turned into a warning by setting `spark.driver.allowMultipleContexts = true`. The exception is disabled in unit tests, since there are some suites (such as Hive) that may require more significant refactoring to clean up their SparkContexts. I've made a few changes to other suites' test fixtures to properly clean up SparkContexts so that the unit test logs contain fewer warnings. Author: Josh Rosen <joshrosen@databricks.com> Closes #3121 from JoshRosen/SPARK-4180 and squashes the following commits: 23c7123 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180 d38251b [Josh Rosen] Address latest round of feedback. c0987d3 [Josh Rosen] Accept boolean instead of SparkConf in methods. 85a424a [Josh Rosen] Incorporate more review feedback. 372d0d3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180 f5bb78c [Josh Rosen] Update mvn build, too. d809cb4 [Josh Rosen] Improve handling of failed SparkContext creation attempts. 79a7e6f [Josh Rosen] Fix commented out test a1cba65 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180 7ba6db8 [Josh Rosen] Add utility to set system properties in tests. 4629d5c [Josh Rosen] Set spark.driver.allowMultipleContexts=true in tests. ed17e14 [Josh Rosen] Address review feedback; expose hack workaround for existing unit tests. 1c66070 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180 06c5c54 [Josh Rosen] Add / improve SparkContext cleanup in streaming BasicOperationsSuite d0437eb [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet. c4d35a2 [Josh Rosen] Log long form of creation site to aid debugging. 918e878 [Josh Rosen] Document "one SparkContext per JVM" limitation. afaa7e3 [Josh Rosen] [SPARK-4180] Prevent creations of multiple active SparkContexts. (cherry picked from commit 0f3ceb56c78e7260725a09fba0e10aa193cbda4b) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [DOCS][SQL] Fix broken link to Row class scaladocAndy Konwinski2014-11-171-1/+1
| | | | | | | | | | | Author: Andy Konwinski <andykonwinski@gmail.com> Closes #3323 from andyk/patch-2 and squashes the following commits: 4699fdc [Andy Konwinski] Fix broken link to Row class scaladoc (cherry picked from commit cec1116b4b80c36b36a8a13338b948e4d6ade377) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4363][Doc] Update the Broadcast examplezsxwing2014-11-141-1/+1
| | | | | | | | | | | Author: zsxwing <zsxwing@gmail.com> Closes #3226 from zsxwing/SPARK-4363 and squashes the following commits: 8109914 [zsxwing] Update the Broadcast example (cherry picked from commit 861223ee5bea8e434a9ebb0d53f436ce23809f9c) Signed-off-by: Reynold Xin <rxin@databricks.com>
* SPARK-4375. no longer require -Pscala-2.10Sandy Ryza2014-11-141-2/+2
| | | | | | | | | | | | | | | It seems like the winds might have moved away from this approach, but wanted to post the PR anyway because I got it working and to show what it would look like. Author: Sandy Ryza <sandy@cloudera.com> Closes #3239 from sryza/sandy-spark-4375 and squashes the following commits: 0ffbe95 [Sandy Ryza] Enable -Dscala-2.11 in sbt cd42d94 [Sandy Ryza] Update doc f6644c3 [Sandy Ryza] SPARK-4375 take 2 (cherry picked from commit f5f757e4ed80759dc5668c63d5663651689f8da8) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Support cross building for Scala 2.11Prashant Sharma2014-11-112-12/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let's give this another go using a version of Hive that shades its JLine dependency. Author: Prashant Sharma <prashant.s@imaginea.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #3159 from pwendell/scala-2.11-prashant and squashes the following commits: e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script. f65d17d [Patrick Wendell] Fixing build issue due to merge conflict a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state. 7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant 583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver 3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests." 935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily." 925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily. 2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future. 8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven. 5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs. 2121071 [Patrick Wendell] Migrating version detection to PySpark b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests. 1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11 f5cad4e [Patrick Wendell] Add Scala 2.11 docs 210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline" 48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles. e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only" 67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check 8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only e22b104 [Patrick Wendell] Small fix in pom file ec402ab [Patrick Wendell] Various fixes 0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline 4eaec65 [Prashant Sharma] Changed scripts to ignore target. 5167bea [Prashant Sharma] small correction a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins. 80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests. 034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt. d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11. 6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10 e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted. 937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION cb059b0 [Prashant Sharma] Code review 0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes. (cherry picked from commit daaca14c16dc2c1abc98f15ab8c6f7c14761b627) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-4330][Doc] Link to proper URL for YARN overviewKousuke Saruta2014-11-101-1/+1
| | | | | | | | | | | | | | | In running-on-yarn.md, a link to YARN overview is here. But the URL is to YARN alpha's. It should be stable's. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3196 from sarutak/SPARK-4330 and squashes the following commits: 30baa21 [Kousuke Saruta] Fixed running-on-yarn.md to point proper URL for YARN (cherry picked from commit 3c07b8f08240bafcdff5d174989fb433f4bc80b6) Signed-off-by: Matei Zaharia <matei@databricks.com>
* SPARK-4230. Doc for spark.default.parallelism is incorrectSandy Ryza2014-11-101-2/+5
| | | | | | | | | | | | Author: Sandy Ryza <sandy@cloudera.com> Closes #3107 from sryza/sandy-spark-4230 and squashes the following commits: 37a1d19 [Sandy Ryza] Clear up a couple things 34d53de [Sandy Ryza] SPARK-4230. Doc for spark.default.parallelism is incorrect (cherry picked from commit c6f4e704214097f17d2d6abfbfef4bb208e4339f) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-971 [DOCS] Link to Confluence wiki from project website / documentationSean Owen2014-11-091-0/+1
| | | | | | | | | | | | | This is a trivial change to add links to the wiki from `README.md` and the main docs page. It is already linked to from spark.apache.org. Author: Sean Owen <sowen@cloudera.com> Closes #3169 from srowen/SPARK-971 and squashes the following commits: dcb84d0 [Sean Owen] Add link to wiki from README, docs home page (cherry picked from commit 8c99a47a4f0369ff3c1ecaeb860fa61ee789e987) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SQL][DOC][Minor] Spark SQL Hive now support dynamic partitioningwangfei2014-11-071-1/+0
| | | | | | | | | | | Author: wangfei <wangfei1@huawei.com> Closes #3127 from scwf/patch-9 and squashes the following commits: e39a560 [wangfei] now support dynamic partitioning (cherry picked from commit 636d7bcc96b912f5b5caa91110cd55b55fa38ad8) Signed-off-by: Michael Armbrust <michael@databricks.com>
* SPARK-4040. Update documentation to exemplify use of local (n) value, fo...jay@apache.org2014-11-052-7/+17
| | | | | | | | | | | | | This is a minor docs update which helps to clarify the way local[n] is used for streaming apps. Author: jay@apache.org <jayunit100> Closes #2964 from jayunit100/SPARK-4040 and squashes the following commits: 35b5a5e [jay@apache.org] SPARK-4040: Update documentation to exemplify use of local (n) value. (cherry picked from commit 868cd4c3ca11e6ecc4425b972d9a20c360b52425) Signed-off-by: Matei Zaharia <matei@databricks.com>
* [SPARK-2938] Support SASL authentication in NettyBlockTransferServiceAaron Davidson2014-11-051-1/+0
| | | | | | | | | | | | | | | | | | | Also lays the groundwork for supporting it inside the external shuffle service. Author: Aaron Davidson <aaron@databricks.com> Closes #3087 from aarondav/sasl and squashes the following commits: 3481718 [Aaron Davidson] Delete rogue println 44f8410 [Aaron Davidson] Delete documentation - muahaha! eb9f065 [Aaron Davidson] Improve documentation and add end-to-end test at Spark-level a6b95f1 [Aaron Davidson] Address comments 785bbde [Aaron Davidson] Cleanup 79973cb [Aaron Davidson] Remove unused file 151b3c5 [Aaron Davidson] Add docs, timeout config, better failure handling f6177d7 [Aaron Davidson] Cleanup SASL state upon connection termination 7b42adb [Aaron Davidson] Add unit tests 8191bcb [Aaron Davidson] [SPARK-2938] Support SASL authentication in NettyBlockTransferService
* [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python APIDavies Liu2014-11-041-0/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ``` pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None) :: Experimental :: If `observed` is Vector, conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution, or againt the uniform distribution (by default), with each category having an expected frequency of `1 / len(observed)`. (Note: `observed` cannot contain negative values) If `observed` is matrix, conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0. If `observed` is an RDD of LabeledPoint, conduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical. :param observed: it could be a vector containing the observed categorical counts/relative frequencies, or the contingency matrix (containing either counts or relative frequencies), or an RDD of LabeledPoint containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. :param expected: Vector containing the expected categorical counts/relative frequencies. `expected` is rescaled if the `expected` sum differs from the `observed` sum. :return: ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis. ``` Author: Davies Liu <davies@databricks.com> Closes #3091 from davies/his and squashes the following commits: 145d16c [Davies Liu] address comments 0ab0764 [Davies Liu] fix float 5097d54 [Davies Liu] add Hypothesis test Python API (cherry picked from commit c8abddc5164d8cf11cdede6ab3d5d1ea08028708) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* fixed MLlib Naive-Bayes java example bugDariusz Kobylarz2014-11-041-3/+3
| | | | | | | | | | | | | | the filter tests Double objects by references whereas it should test their values Author: Dariusz Kobylarz <darek.kobylarz@gmail.com> Closes #3081 from dkobylarz/master and squashes the following commits: 5d43a39 [Dariusz Kobylarz] naive bayes example update a304b93 [Dariusz Kobylarz] fixed MLlib Naive-Bayes java example bug (cherry picked from commit bcecd73fdd4d2ec209259cfd57d3ad1d63f028f2) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SQL] More aggressive defaultsMichael Armbrust2014-11-031-5/+13
| | | | | | | | | | | | | | | | | | | | | | | - Turns on compression for in-memory cached data by default - Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory) - Ups the batch size to 10,000 rows - Increases the broadcast threshold to 10mb. - Uses our parquet implementation instead of the hive one by default. - Cache parquet metadata by default. Author: Michael Armbrust <michael@databricks.com> Closes #3064 from marmbrus/fasterDefaults and squashes the following commits: 97ee9f8 [Michael Armbrust] parquet codec docs e641694 [Michael Armbrust] Remote also a12866a [Michael Armbrust] Cache metadata. 2d73acc [Michael Armbrust] Update docs defaults. d63d2d5 [Michael Armbrust] document parquet option da373f9 [Michael Armbrust] More aggressive defaults (cherry picked from commit 25bef7e6951301e93004567fc0cef96bf8d1a224) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4177][Doc]update build doc since JDBC/CLI support hive 13 nowwangfei2014-11-021-10/+7
| | | | | | | | | Author: wangfei <wangfei1@huawei.com> Closes #3042 from scwf/patch-9 and squashes the following commits: 3784ed1 [wangfei] remove 'TODO' 1891553 [wangfei] update build doc since JDBC/CLI support hive 13
* [SPARK-4183] Enable NettyBlockTransferService by defaultAaron Davidson2014-11-021-0/+10
| | | | | | | | | | Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues. Author: Aaron Davidson <aaron@databricks.com> Closes #3049 from aarondav/enable-netty and squashes the following commits: bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
* [SPARK-3466] Limit size of results that a driver collects for each actionDavies Liu2014-11-021-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | Right now, operations like collect() and take() can crash the driver with an OOM if they bring back too many data. This PR will introduce spark.driver.maxResultSize, after setting it, the driver will abort a job if its result is bigger than it. By default, it's 1g (for backward compatibility for most the cases). In local mode, the driver and executor share the same JVM, the default setting can not protect JVM from OOM. cc mateiz Author: Davies Liu <davies@databricks.com> Closes #3003 from davies/collect and squashes the following commits: 248ed5e [Davies Liu] fix compile 272522e [Davies Liu] address comments 2c35773 [Davies Liu] add sizes in message of abort() 5d62303 [Davies Liu] address comments bc3c077 [Davies Liu] Merge branch 'master' of github.com:apache/spark into collect 11f97c5 [Davies Liu] address comments 47b144f [Davies Liu] check the size of result before send and fetch 3d81af2 [Davies Liu] address comments ca8267d [Davies Liu] limit the size of data by collect
* Revert "[SPARK-4183] Enable NettyBlockTransferService by default"Patrick Wendell2014-11-011-10/+0
| | | | This reverts commit 59e626c701227634336110e1bc23afd94c535ede.
* [SPARK-4183] Enable NettyBlockTransferService by defaultAaron Davidson2014-11-011-0/+10
| | | | | | | | | | Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues. Author: Aaron Davidson <aaron@databricks.com> Closes #3049 from aarondav/enable-netty and squashes the following commits: bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
* Streaming KMeans [MLLIB][SPARK-3254]freeman2014-10-311-1/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches. The PR includes: - StreamingKMeans algorithm with decay factor settings - Usage example - Additions to documentation clustering page - Unit tests of basic behavior and decay behaviors tdas mengxr rezazadeh Author: freeman <the.freeman.lab@gmail.com> Author: Jeremy Freeman <the.freeman.lab@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #2942 from freeman-lab/streaming-kmeans and squashes the following commits: b2e5b4a [freeman] Fixes to docs / examples 078617c [Jeremy Freeman] Merge pull request #1 from mengxr/SPARK-3254 2e682c0 [Xiangrui Meng] take discount on previous weights; use BLAS; detect dying clusters 0411bf5 [freeman] Change decay parameterization 9f7aea9 [freeman] Style fixes 374a706 [freeman] Formatting ad9bdc2 [freeman] Use labeled points and predictOnValues in examples 77dbd3f [freeman] Make initialization check an assertion 9cfc301 [freeman] Make random seed an argument 44050a9 [freeman] Simpler constructor c7050d5 [freeman] Fix spacing 2899623 [freeman] Use pattern matching for clarity a4a316b [freeman] Use collect 1472ec5 [freeman] Doc formatting ea22ec8 [freeman] Fix imports 2086bdc [freeman] Log cluster center updates ea9877c [freeman] More documentation 9facbe3 [freeman] Bug fix 5db7074 [freeman] Example usage for StreamingKMeans f33684b [freeman] Add explanation and example to docs b5b5f8d [freeman] Add better documentation a0fd790 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans 9fd9c15 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans b93350f [freeman] Streaming KMeans with decay
* [SPARK-3838][examples][mllib][python] Word2Vec example in pythonAnant2014-10-311-0/+17
| | | | | | | | | | | | | | | | | | | | This pull request refers to issue: https://issues.apache.org/jira/browse/SPARK-3838 Python example for word2vec mengxr Author: Anant <anant.asty@gmail.com> Closes #2952 from anantasty/SPARK-3838 and squashes the following commits: 87bd723 [Anant] remove stop line 4bd439e [Anant] Changes as per code review. Fized error in word2vec python example, simplified example in docs. 3d3c9ee [Anant] Added empty line after python imports 0c90c31 [Anant] Fixed erroneous code. I was still treating each line to be a single word instead of 16 words ee4f5f6 [Anant] Fixes from code review comments c637bcf [Anant] Added word2vec python example to docs 269f31f [Anant] added example in docs c015b14 [Anant] Added python example for word2vec
* [SPARK-4089][Doc][Minor] The version number of Spark in _config.yaml is wrong.Kousuke Saruta2014-10-281-3/+3
| | | | | | | | | | The version number of Spark in docs/_config.yaml for master branch should be 1.2.0 for now. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2943 from sarutak/SPARK-4089 and squashes the following commits: aba7fb4 [Kousuke Saruta] Fixed the version number of Spark in _config.yaml
* [SPARK-3961] [MLlib] [PySpark] Python API for mllib.featureDavies Liu2014-10-281-0/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added completed Python API for MLlib.feature Normalizer StandardScalerModel StandardScaler HashTF IDFModel IDF cc mengxr Author: Davies Liu <davies@databricks.com> Author: Davies Liu <davies.liu@gmail.com> Closes #2819 from davies/feature and squashes the following commits: 4f48f48 [Davies Liu] add a note for HashingTF 67f6d21 [Davies Liu] address comments b628693 [Davies Liu] rollback changes in Word2Vec efb4f4f [Davies Liu] Merge branch 'master' into feature 806c7c2 [Davies Liu] address comments 3abb8c2 [Davies Liu] address comments 59781b9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into feature a405ae7 [Davies Liu] fix tests 7a1891a [Davies Liu] fix tests 486795f [Davies Liu] update programming guide, HashTF -> HashingTF 8a50584 [Davies Liu] Python API for mllib.feature
* [SPARK-4032] Deprecate YARN alpha support in Spark 1.2Prashant Sharma2014-10-271-1/+3
| | | | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #2878 from ScrapCodes/SPARK-4032/deprecate-yarn-alpha and squashes the following commits: 17e9857 [Prashant Sharma] added deperecated comment to Client and ExecutorRunnable. 3a34b1e [Prashant Sharma] Updated docs... 4608dea [Prashant Sharma] [SPARK-4032] Deprecate YARN alpha support in Spark 1.2
* [SQL][DOC] Wrong package name "scala.math.sql" in sql-programming-guide.mdKousuke Saruta2014-10-261-1/+1
| | | | | | | | | | In sql-programming-guide.md, there is a wrong package name "scala.math.sql". Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2873 from sarutak/wrong-packagename-fix and squashes the following commits: 4d5ecf4 [Kousuke Saruta] Fixed wrong package name in sql-programming-guide.md
* [SPARK-2321] Stable pull-based progress / status APIJosh Rosen2014-10-251-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This pull request is a first step towards the implementation of a stable, pull-based progress / status API for Spark (see [SPARK-2321](https://issues.apache.org/jira/browse/SPARK-2321)). For now, I'd like to discuss the basic implementation, API names, and overall interface design. Once we arrive at a good design, I'll go back and add additional methods to expose more information via these API. #### Design goals: - Pull-based API - Usable from Java / Scala / Python (eventually, likely with a wrapper) - Can be extended to expose more information without introducing binary incompatibilities. - Returns immutable objects. - Don't leak any implementation details, preserving our freedom to change the implementation. #### Implementation: - Add public methods (`getJobInfo`, `getStageInfo`) to SparkContext to allow status / progress information to be retrieved. - Add public interfaces (`SparkJobInfo`, `SparkStageInfo`) for our API return values. These interfaces consist entirely of Java-style getter methods. The interfaces are currently implemented in Java. I decided to explicitly separate the interface from its implementation (`SparkJobInfoImpl`, `SparkStageInfoImpl`) in order to prevent users from constructing these responses themselves. -Allow an existing JobProgressListener to be used when constructing a live SparkUI. This allows us to re-use this listeners in the implementation of this status API. There are a few reasons why this listener re-use makes sense: - The status API and web UI are guaranteed to show consistent information. - These listeners are already well-tested. - The same garbage-collection / information retention configurations can apply to both this API and the web UI. - Extend JobProgressListener to maintain `jobId -> Job` and `stageId -> Stage` mappings. The progress API methods are implemented in a separate trait that's mixed into SparkContext. This helps to avoid SparkContext.scala from becoming larger and more difficult to read. Author: Josh Rosen <joshrosen@databricks.com> Author: Josh Rosen <joshrosen@apache.org> Closes #2696 from JoshRosen/progress-reporting-api and squashes the following commits: e6aa78d [Josh Rosen] Add tests. b585c16 [Josh Rosen] Accept SparkListenerBus instead of more specific subclasses. c96402d [Josh Rosen] Address review comments. 2707f98 [Josh Rosen] Expose current stage attempt id c28ba76 [Josh Rosen] Update demo code: 646ff1d [Josh Rosen] Document spark.ui.retainedJobs. 7f47d6d [Josh Rosen] Clean up SparkUI constructors, per Andrew's feedback. b77b3d8 [Josh Rosen] Merge remote-tracking branch 'origin/master' into progress-reporting-api 787444c [Josh Rosen] Move status API methods into trait that can be mixed into SparkContext. f9a9a00 [Josh Rosen] More review comments: 3dc79af [Josh Rosen] Remove creation of unused listeners in SparkContext. 249ca16 [Josh Rosen] Address several review comments: da5648e [Josh Rosen] Add example of basic progress reporting in Java. 7319ffd [Josh Rosen] Add getJobIdsForGroup() and num*Tasks() methods. cc568e5 [Josh Rosen] Add note explaining that interfaces should not be implemented outside of Spark. 6e840d4 [Josh Rosen] Remove getter-style names and "consistent snapshot" semantics: 08cbec9 [Josh Rosen] Begin to sketch the interfaces for a stable, public status API. ac2d13a [Josh Rosen] Add jobId->stage, stageId->stage mappings in JobProgressListener 24de263 [Josh Rosen] Create UI listeners in SparkContext instead of in Tabs: