| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
mengxr https://issues.apache.org/jira/browse/SPARK-11297
Add new code tags to hold the same look and feel with previous documents.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9265 from yinxusen/SPARK-11297.
|
|
|
|
|
|
|
|
|
|
|
|
| |
include_example
mengxr https://issues.apache.org/jira/browse/SPARK-11289
I make some changes in ML feature extractors. I.e. TF-IDF, Word2Vec, and CountVectorizer. I add new example code in spark/examples, hope it is the right place to add those examples.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9266 from yinxusen/SPARK-11289.
|
|
|
|
|
|
|
|
| |
The SQL programming guide's link to the DataFrame functions reference points to the wrong location; this patch fixes that.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #9269 from JoshRosen/SPARK-11299.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes.
The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also [launch R worker script](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRDD.scala#L395).
BTW, [envrionment variable "SPARKR_DRIVER_R"](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L275) is used to locate R shell on the local host.
For your information, PYSPARK has two environment variables serving simliar purpose:
PYSPARK_PYTHON Python binary executable to use for PySpark in both driver and workers (default is `python`).
PYSPARK_DRIVER_PYTHON Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON).
pySpark use the code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L41) to determine the python executable for a python script.
Author: Sun Rui <rui.sun@intel.com>
Closes #9179 from sun-rui/SPARK-10971.
|
|
|
|
|
|
|
|
|
|
| |
A POC code for making example code in user guide testable.
mengxr We still need to talk about the labels in code.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9109 from yinxusen/SPARK-10382.
|
|
|
|
|
|
|
|
| |
Removed typo on line 8 in markdown : "Received" -> "Receiver"
Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu>
Closes #9242 from RohanBhanderi/patch-1.
|
|
|
|
|
|
|
|
| |
There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #8829 from JoshRosen/consolidate-sort-shuffle-implementations.
|
|
|
|
|
|
|
|
|
|
|
| |
Currently log4j.properties file is not uploaded to executor's which is leading them to use the default values. This fix will make sure that file is always uploaded to distributed cache so that executor will use the latest settings.
If user specifies log configurations through --files then executors will be picking configs from --files instead of $SPARK_CONF_DIR/log4j.properties
Author: vundela <vsr@cloudera.com>
Author: Srinivasa Reddy Vundela <vsr@cloudera.com>
Closes #9118 from vundela/master.
|
|
|
|
|
|
|
|
| |
This patch fixes a small typo in the GraphX programming guide
Author: Lukasz Piepiora <lpiepiora@gmail.com>
Closes #9160 from lpiepiora/11174-fix-typo-in-graphx-programming-guide.
|
|
|
|
|
|
| |
Author: Britta Weber <britta.weber@elasticsearch.com>
Closes #9136 from brwe/typo-bellow.
|
|
|
|
|
|
|
|
|
|
| |
Add documentation for configuration:
- spark.sql.ui.retainedExecutions
- spark.streaming.ui.retainedBatches
Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
Closes #9052 from pnpritchard/SPARK-11039.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced:
- **spark.memory.fraction (default 0.75)**: fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records.
- **spark.memory.storageFraction (default 0.5)**: size of the storage region within the space set aside by `spark.memory.fraction`. Cached data may only be evicted if total storage exceeds this region.
- **spark.memory.useLegacyMode (default false)**: whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility.
For a detailed description of the design, see [SPARK-10000](https://issues.apache.org/jira/browse/SPARK-10000). This patch builds on top of the `MemoryManager` interface introduced in #9000.
Author: Andrew Or <andrew@databricks.com>
Closes #9084 from andrewor14/unified-memory-manager.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add application attempt window for Spark on Yarn to ignore old out of window failures, this is useful for long running applications to recover from failures.
Author: jerryshao <sshao@hortonworks.com>
Closes #8857 from jerryshao/SPARK-10739 and squashes the following commits:
36eabdc [jerryshao] change the doc
7f9b77d [jerryshao] Style change
1c9afd0 [jerryshao] Address the comments
caca695 [jerryshao] Add application attempt window for Spark on Yarn
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit improves the documentation around building Spark to
(1) recommend using SBT interactive mode to avoid the overhead of
launching SBT and (2) refer to the wiki page that documents using
SPARK_PREPEND_CLASSES to avoid creating the assembly jar for each
compile.
cc srowen
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes #9068 from kayousterhout/SPARK-11056.
|
|
|
|
|
|
| |
Author: Jean-Baptiste Onofré <jbonofre@apache.org>
Closes #8993 from jbonofre/SPARK-10883-2.
|
|
|
|
|
|
|
|
| |
1.4 docs noted that the units were MB - i have assumed this is still the case
Author: admackin <admackin@users.noreply.github.com>
Closes #9025 from admackin/master.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
spark.mllib
In the Markdown docs for the spark.mllib Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "ChiSqSelector" section in https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md
This JIRA is just for spark.mllib, not spark.ml.
Please let me know if more work is needed, thanks a lot.
Author: Xin Ren <iamshrek@126.com>
Closes #8977 from keypointt/SPARK-10669.
|
|
|
|
|
|
|
|
|
|
|
|
| |
YARN, -master yarn --deploy-mode x vs -master yarn-x'.
Recommend `--master yarn --deploy-mode {cluster,client}` consistently in docs.
Follow-on to https://github.com/apache/spark/pull/8385
CC nssalian
Author: Sean Owen <sowen@cloudera.com>
Closes #8968 from srowen/SPARK-9570.
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-10670
In the Markdown docs for the spark.ml Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "Word2Vec" section in https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/ml-features.md
This JIRA is just for spark.ml, not spark.mllib
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #8901 from hhbyyh/docAPI.
|
|
|
|
|
|
|
|
|
| |
seperate -> separate
sees -> see
Author: David Martin <dmartinpro@users.noreply.github.com>
Closes #8928 from dmartinpro/patch-1.
|
|
|
|
|
|
| |
Author: Bin Wang <wbin00@gmail.com>
Closes #8898 from wb14123/doc.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Scala example under the "Example: Pipeline" heading in this
document initializes the "test" variable to a DataFrame. Because test
is already a DF, there is not need to call test.toDF as the example
does in a subsequent line: model.transform(test.toDF). So, I removed
the extraneous toDF invocation.
Author: Matt Hagen <anonz3000@gmail.com>
Closes #8875 from hagenhaus/SPARK-10663.
|
|
|
|
|
|
|
|
| |
…on for spark.mesos.constraints parameter.
Author: Akash Mishra <akash.mishra20@gmail.com>
Closes #8816 from SleepyThread/constraint-fix.
|
|
|
|
|
|
| |
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #8803 from vanzin/SPARK-10676.
|
|
|
|
|
|
|
|
|
|
| |
* Backticks are processed properly in Spark Properties table
* Removed unnecessary spaces
* See http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html
Author: Jacek Laskowski <jacek.laskowski@deepsense.io>
Closes #8795 from jaceklaskowski/docs-yarn-formatting.
|
|
|
|
|
|
|
|
|
|
| |
It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.
This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.
|
|
|
|
|
|
|
|
| |
Submitting this change on the master branch as requested in https://github.com/apache/spark/pull/8819#issuecomment-141505941
Author: Alexis Seigneurin <alexis.seigneurin@gmail.com>
Closes #8838 from aseigneurin/patch-2.
|
|
|
|
|
|
|
|
|
|
|
|
| |
wrong.
In Spark 1.5.0, Spark SQL is compatible with Hive 0.12.0 through 1.2.1 but the documentation is wrong.
/CC yhuai
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #8776 from sarutak/SPARK-10584-2.
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #8812 from rxin/SPARK-9808-1.
|
| |
|
|
|
|
|
|
|
|
| |
This PR simply uses the default value column for defaults.
Author: Felix Bechstein <felix.bechstein@otto.de>
Closes #8810 from felixb/fix_mesos_doc.
|
|
|
|
|
|
|
|
| |
The [published docs for 1.5.0](http://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/streaming/) have a bunch of test classes in them. The only way I can reproduce this is to `test:compile` before running `unidoc`. To prevent this from happening again, I've added a clean before doc generation.
Author: Michael Armbrust <michael@databricks.com>
Closes #8787 from marmbrus/testsInDocs.
|
|
|
|
|
|
|
|
| |
In the Configuration section, the **spark.yarn.driver.memoryOverhead** and **spark.yarn.am.memoryOverhead**‘s default value should be "driverMemory * 0.10, with minimum of 384" and "AM memory * 0.10, with minimum of 384" respectively. Because from Spark 1.4.0, the **MEMORY_OVERHEAD_FACTOR** is set to 0.1.0, not 0.07.
Author: yangping.wu <wyphao.2007@163.com>
Closes #8797 from 397090770/SparkOnYarnDocError.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Various ML guide cleanups.
* ml-guide.md: Make it easier to access the algorithm-specific guides.
* LDA user guide: EM often begins with useless topics, but running longer generally improves them dramatically. E.g., 10 iterations on a Wikipedia dataset produces useless topics, but 50 iterations produces very meaningful topics.
* mllib-feature-extraction.html#elementwiseproduct: “w” parameter should be “scalingVec”
* Clean up Binarizer user guide a little.
* Document in Pipeline that users should not put an instance into the Pipeline in more than 1 place.
* spark.ml Word2Vec user guide: clean up grammar/writing
* Chi Sq Feature Selector docs: Improve text in doc.
CC: mengxr feynmanliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8752 from jkbradley/mlguide-fixes-1.5.
|
|
|
|
|
|
|
|
|
| |
* a follow-up to 16b6d18613e150c7038c613992d80a7828413e66 as `--num-executors` flag is not suppported.
* links + formatting
Author: Jacek Laskowski <jacek.laskowski@deepsense.io>
Closes #8762 from jaceklaskowski/docs-spark-on-yarn.
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #8350 from rxin/1.6.
|
|
|
|
|
|
|
|
| |
Links work now properly + consistent use of *Spark standalone cluster* (Spark uppercase + lowercase the rest -- seems agreed in the other places in the docs).
Author: Jacek Laskowski <jacek.laskowski@deepsense.io>
Closes #8759 from jaceklaskowski/docs-submitting-apps.
|
|
|
|
|
|
|
|
|
|
|
| |
spark.sql.hive.metastore.version is wrong.
The default value of hive metastore version is 1.2.1 but the documentation says the value of `spark.sql.hive.metastore.version` is 0.13.1.
Also, we cannot get the default value by `sqlContext.getConf("spark.sql.hive.metastore.version")`.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #8739 from sarutak/SPARK-10584.
|
|
|
|
|
|
|
|
| |
Finish deprecating Bagel; remove reference to nonexistent example
Author: Sean Owen <sowen@cloudera.com>
Closes #8731 from srowen/SPARK-10222.
|
|
|
|
|
|
|
|
|
|
| |
LIBSVM data source instead of MLUtils
I fixed to use LIBSVM data source in the example code in spark.ml instead of MLUtils
Author: y-shimizu <y.shimizu0429@gmail.com>
Closes #8697 from y-shimizu/SPARK-10518.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
implementing the sufficientResourcesRegistered method
spark.scheduler.minRegisteredResourcesRatio configuration parameter works for YARN mode but not for Mesos Coarse grained mode.
If the parameter specified default value of 0 will be set for spark.scheduler.minRegisteredResourcesRatio in base class and this method will always return true.
There are no existing test for YARN mode too. Hence not added test for the same.
Author: Akash Mishra <akash.mishra20@gmail.com>
Closes #8672 from SleepyThread/master.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
From JIRA:
Add documentation for tungsten-sort.
From the mailing list "I saw a new "spark.shuffle.manager=tungsten-sort" implemented in
https://issues.apache.org/jira/browse/SPARK-7081, but it can't be found its
corresponding description in
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/configuration.html(Currenlty
there are only 'sort' and 'hash' two options)."
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8638 from holdenk/SPARK-10469-document-tungsten-sort.
|
|
|
|
|
|
|
|
|
|
| |
0.0 (original: 1.0)
Small typo in the example for `LabelledPoint` in the MLLib docs.
Author: Sean Paradiso <seanparadiso@gmail.com>
Closes #8680 from sparadiso/docs_mllib_smalltypo.
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-10249
update user guide since python support added.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #8620 from hhbyyh/swPyDocExample.
|
|
|
|
|
|
|
|
|
|
| |
about rate limiting and backpressure
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #8656 from tdas/SPARK-10492 and squashes the following commits:
986cdd6 [Tathagata Das] Added information on backpressure
|
|
|
|
|
|
| |
Author: Jacek Laskowski <jacek@japila.pl>
Closes #8629 from jaceklaskowski/docs-fixes.
|
|
|
|
|
|
|
|
| |
… main README.
Author: Stephen Hopper <shopper@shopper-osx.local>
Closes #8646 from enragedginger/master.
|
|
|
|
|
|
|
|
| |
We introduced the Netty network module for shuffle in Spark 1.2, and has turned it on by default for 3 releases. The old ConnectionManager is difficult to maintain. If we merge the patch now, by the time it is released, it would be 1 yr for which ConnectionManager is off by default. It's time to remove it.
Author: Reynold Xin <rxin@databricks.com>
Closes #8161 from rxin/SPARK-9767.
|
|
|
|
|
|
|
|
|
|
|
| |
guides and python docs
- Fixed information around Python API tags in streaming programming guides
- Added missing stuff in python docs
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #8595 from tdas/SPARK-10440.
|
|
|
|
|
|
|
|
|
| |
Support running pyspark with cluster mode on Mesos!
This doesn't upload any scripts, so if running in a remote Mesos requires the user to specify the script from a available URI.
Author: Timothy Chen <tnachen@gmail.com>
Closes #8349 from tnachen/mesos_python.
|