| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adjust the default values of decision tree, based on the memory requirement discussed in https://github.com/apache/spark/pull/2125 :
1. maxMemoryInMB: 128 -> 256
2. maxBins: 100 -> 32
3. maxDepth: 4 -> 5 (in some example code)
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #2322 from mengxr/tree-defaults and squashes the following commits:
cda453a [Xiangrui Meng] fix tests
5900445 [Xiangrui Meng] update comments
8c81831 [Xiangrui Meng] update default values of tree:
|
|
|
|
|
|
|
|
| |
Author: Henry Cook <hcook@eecs.berkeley.edu>
Closes #2316 from hcook/sql-docs and squashes the following commits:
373f94b [Henry Cook] Minor edits to sql programming guide.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See compiled doc at
http://people.apache.org/~rxin/tmp/openstack-swift/_site/storage-openstack-swift.html
This is based on #1010. Closes #1010.
Author: Reynold Xin <rxin@apache.org>
Author: Gil Vernik <gilv@il.ibm.com>
Closes #2298 from rxin/openstack-swift and squashes the following commits:
ff4e394 [Reynold Xin] Two minor comments from Patrick.
279f6de [Reynold Xin] core-sites -> core-site
dfb8fea [Reynold Xin] Updated based on Gil's suggestion.
846f5cb [Reynold Xin] Added a link from overview page.
0447c9f [Reynold Xin] Removed sample code.
e9c3761 [Reynold Xin] Merge pull request #1010 from gilv/master
9233fef [Gil Vernik] Fixed typos
6994827 [Gil Vernik] Merge pull request #1 from rxin/openstack
ac0679e [Reynold Xin] Fixed an unclosed tr.
47ce99d [Reynold Xin] Merge branch 'master' into openstack
cca7192 [Gil Vernik] Removed white spases from pom.xml
99f095d [Reynold Xin] Pending openstack changes.
eb22295 [Reynold Xin] Merge pull request #1010 from gilv/master
39a9737 [Gil Vernik] Spark integration with Openstack Swift
c977658 [Gil Vernik] Merge branch 'master' of https://github.com/gilv/spark
2aba763 [Gil Vernik] Fix to docs/openstack-integration.md
9b625b5 [Gil Vernik] Merge branch 'master' of https://github.com/gilv/spark
eff538d [Gil Vernik] SPARK-938 - Openstack Swift object storage support
ce483d7 [Gil Vernik] SPARK-938 - Openstack Swift object storage support
b6c37ef [Gil Vernik] Openstack Swift support
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sort-based shuffle has lower memory usage and seems to outperform hash-based in almost all of our testing.
Author: Reynold Xin <rxin@apache.org>
Closes #2178 from rxin/sort-shuffle and squashes the following commits:
713d341 [Reynold Xin] Fixed test failures by setting spark.shuffle.compress to the same value as spark.shuffle.spill.compress.
85165e6 [Reynold Xin] Fixed a comment typo.
aa0d372 [Reynold Xin] [SPARK-3280] Made sort-based shuffle the default implementation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #2258 from marmbrus/sqlDocUpdate and squashes the following commits:
f3d450b [Michael Armbrust] fix brackets
bea3bfa [Michael Armbrust] Davies suggestions
3a29fe2 [Michael Armbrust] tighten visibility
a71aa36 [Michael Armbrust] Draft of doc updates
52932c0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into sqlDocUpdate
1e8c849 [Yin Huai] Update the example used for applySchema.
9457c39 [Yin Huai] Update doc.
31ba240 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeDoc
29bc668 [Yin Huai] Draft doc for data type and schema APIs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Improvements to the kinesis integration guide from @cfregly
- More information about unified input dstreams in main guide
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Chris Fregly <chris@fregly.com>
Closes #2307 from tdas/streaming-doc-fix1 and squashes the following commits:
ec40b5d [Tathagata Das] Updated figure with kinesis
fdb9c5e [Tathagata Das] Fixed style issues with kinesis guide
036d219 [Chris Fregly] updated kinesis docs and added an arch diagram
24f622a [Tathagata Das] More modifications.
|
|
|
|
|
|
|
|
|
|
| |
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #2251 from sarutak/SPARK-3378 and squashes the following commits:
0bfe234 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3378
bb5938f [Kousuke Saruta] Replaced rest of "SparkSQL" with "Spark SQL"
6df66de [Kousuke Saruta] Replaced "SparkSQL" with "Spark SQL"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Updated the main streaming programming guide, and also added source-specific guides for Kafka, Flume, Kinesis.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Jacek Laskowski <jacek@japila.pl>
Closes #2254 from tdas/streaming-doc-fix and squashes the following commits:
e45c6d7 [Jacek Laskowski] More fixes from an old PR
5125316 [Tathagata Das] Fixed links
dc02f26 [Tathagata Das] Refactored streaming kinesis guide and made many other changes.
acbc3e3 [Tathagata Das] Fixed links between streaming guides.
cb7007f [Tathagata Das] Added Streaming + Flume integration guide.
9bd9407 [Tathagata Das] Updated streaming programming guide with additional information from SPARK-2419.
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Include kinesis in the unidocs
- Hide non-public classes from docs
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #2239 from tdas/kinesis-doc-fix and squashes the following commits:
156e20c [Tathagata Das] More fixes, based on PR comments.
e9a6c01 [Tathagata Das] Fixed docs related to kinesis
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As [reported on the dev list](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC2-tp8107p8131.html):
* Code fencing with triple-backticks doesn’t seem to work like it does on GitHub. Newlines are lost. Instead, use 4-space indent to format small code blocks.
* Nested bullets need 2 leading spaces, not 1.
* Spellcheck!
Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Author: nchammas <nicholas.chammas@gmail.com>
Closes #2201 from nchammas/sql-doc-fixes and squashes the following commits:
873f889 [Nicholas Chammas] [Docs] fix skip-api flag
5195e0c [Nicholas Chammas] [Docs] SQL doc formatting and typo fixes
3b26c8d [nchammas] [Spark QA] Link to console output on test time out
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The executors and the driver may not share the same Spark home. There is currently one way to set the executor side Spark home in Mesos, through setting `spark.home`. However, this is neither documented nor intuitive. This PR adds a more specific config `spark.mesos.executor.home` and exposes this to the user.
liancheng tnachen
Author: Andrew Or <andrewor14@gmail.com>
Closes #2166 from andrewor14/mesos-spark-home and squashes the following commits:
b87965e [Andrew Or] Merge branch 'master' of github.com:apache/spark into mesos-spark-home
f6abb2e [Andrew Or] Document spark.mesos.executor.home
ca7846d [Andrew Or] Add more specific configuration for executor Spark home in Mesos
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The only updates are in DecisionTree.
CC: mengxr
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Closes #2146 from jkbradley/mllib-migration and squashes the following commits:
5a1f487 [Joseph K. Bradley] small edit to doc
411d6d9 [Joseph K. Bradley] Added migration guide for v1.0 to v1.1. The only updates are in DecisionTree.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. renamed mllib-basics to mllib-data-types
1. renamed mllib-stats to mllib-statistics
1. moved random data generation to the bottom of mllib-stats
1. updated toc accordingly
atalwalkar
Author: Xiangrui Meng <meng@databricks.com>
Closes #2151 from mengxr/mllib-doc-1.1 and squashes the following commits:
0bd79f3 [Xiangrui Meng] add mllib-data-types
b64a5d7 [Xiangrui Meng] update the content list of basis statistics in mllib-guide
f625cc2 [Xiangrui Meng] move mllib-basics to mllib-data-types
4d69250 [Xiangrui Meng] move random data generation to the bottom of statistics
e64f3ce [Xiangrui Meng] move mllib-stats.md to mllib-statistics.md
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When using Mesos with the fine-grained mode, a Spark job can run into a dead lock on low allocatable memory on Mesos slaves. As a work-around 32 MB (= Mesos MIN_MEM) are allocated for each task, to ensure Mesos making new offers after task completion.
From my perspective, it would be better to fix this problem in Mesos by dropping the constraint on memory for offers, but as temporary solution this patch helps to avoid the dead lock on current Mesos versions.
See [[MESOS-1688] No offers if no memory is allocatable](https://issues.apache.org/jira/browse/MESOS-1688) for details for this problem.
Author: Martin Weindel <martin.weindel@gmail.com>
Closes #1860 from MartinWeindel/master and squashes the following commits:
5762030 [Martin Weindel] reverting work-around
a6bf837 [Martin Weindel] added known issue for issue MESOS-1688
d9d2ca6 [Martin Weindel] work around for problem with Mesos offering semantic (see [https://issues.apache.org/jira/browse/MESOS-1688])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Documentation updated for the Statistics Toolkit of MLlib. mengxr atalwalkar
https://issues.apache.org/jira/browse/SPARK-2839
P.S. Accidentally closed #2123. New commits didn't show up after I reopened the PR. I've opened this instead and closed the old one.
Author: Burak <brkyvz@gmail.com>
Closes #2130 from brkyvz/StatsLib-Docs and squashes the following commits:
a54a855 [Burak] [SPARK-2839][MLlib] Addressed comments
bfc6896 [Burak] [SPARK-2839][MLlib] Added a more specific link to colStats() for pyspark
213fe3f [Burak] [SPARK-2839][MLlib] Modifications made according to review
fec4d9d [Burak] [SPARK-2830][MLlib] Stats Toolkit documentation updated
|
|
|
|
|
|
|
|
|
|
| |
to mention `-Pnetlib-lgpl` option. atalwalkar
Author: Xiangrui Meng <meng@databricks.com>
Closes #2128 from mengxr/mllib-native and squashes the following commits:
4cbba57 [Xiangrui Meng] update mllib dependencies
|
|
|
|
|
|
|
|
|
|
| |
It should be `spark-env.sh` rather than `spark.env.sh`.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #2119 from liancheng/fix-mesos-doc and squashes the following commits:
f360548 [Cheng Lian] Fixed a typo in docs/running-on-mesos.md
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Update the documentation to reflect the fact we can handle roughly square matrices.
Author: Reza Zadeh <rizlar@gmail.com>
Closes #2070 from rezazadeh/svddocs and squashes the following commits:
826b8fe [Reza Zadeh] left singular vectors
3f34fc6 [Reza Zadeh] PCA is still TS
7ffa2aa [Reza Zadeh] better title
aeaf39d [Reza Zadeh] More docs
788ed13 [Reza Zadeh] add computational cost explanation
6429c59 [Reza Zadeh] Add link to rowmatrix docs
1eeab8b [Reza Zadeh] Update SVD documentation to reflect roughly square
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Documentation for newly added feature transformations:
1. TF-IDF
2. StandardScaler
3. Normalizer
Author: DB Tsai <dbtsai@alpinenow.com>
Closes #2068 from dbtsai/transformer-documentation and squashes the following commits:
109f324 [DB Tsai] address feedback
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
and Thrift JDBC server is absent in proper document -
The most important things I mentioned in #1885 is as follows.
* People who build Spark is not always programmer.
* If a person who build Spark is not a programmer, he/she won't read programmer's guide before building.
So, how to build for using CLI and JDBC server is not only in programmer's guide.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #2080 from sarutak/SPARK-2963 and squashes the following commits:
ee07c76 [Kousuke Saruta] Modified regression of the description about building for using Thrift JDBC server and CLI
ed53329 [Kousuke Saruta] Modified description and notaton of proper noun
07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md
6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963
c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Updated DecisionTree documentation, with examples for Java, Python.
Added same Java example to code as well.
CC: @mengxr @manishamde @atalwalkar
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Closes #2063 from jkbradley/dt-docs and squashes the following commits:
2dd2c19 [Joseph K. Bradley] Last updates based on github review.
9dd1b6b [Joseph K. Bradley] Updated decision tree doc.
d802369 [Joseph K. Bradley] Updates based on comments: cache data, corrected doc text.
b9bee04 [Joseph K. Bradley] Updated DT examples
57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example in docs, and corrected doc example as needed.
d939a92 [Joseph K. Bradley] Updated DecisionTree documentation. Added Java, Python examples.
|
|
|
|
|
|
|
|
|
|
|
|
| |
atalwalkar srowen
Author: Xiangrui Meng <meng@databricks.com>
Closes #2064 from mengxr/als-doc and squashes the following commits:
b2e20ab [Xiangrui Meng] introduced -> discussed
98abdd7 [Xiangrui Meng] add reference
339bd08 [Xiangrui Meng] add a section about regularization parameter in ALS
|
|
|
|
|
|
|
|
|
|
|
| |
Moved TF-IDF before Word2Vec because the former is more basic. I also added a link for Word2Vec. atalwalkar
Author: Xiangrui Meng <meng@databricks.com>
Closes #2061 from mengxr/tfidf-doc and squashes the following commits:
ca04c70 [Xiangrui Meng] address comments
a5ea4b4 [Xiangrui Meng] add tf-idf user guide
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently we have a separate profile called hive-thriftserver. I originally suggested this in case users did not want to bundle the thriftserver, but it's ultimately lead to a lot of confusion. Since the thriftserver is only a few classes, I don't see a really good reason to isolate it from the rest of Hive. So let's go ahead and just include it in the same profile to simplify things.
This has been suggested in the past by liancheng.
Author: Patrick Wendell <pwendell@gmail.com>
Closes #2006 from pwendell/hiveserver and squashes the following commits:
742ea40 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into hiveserver
034ad47 [Patrick Wendell] SPARK-3092: Always include the thriftserver when -Phive is enabled.
|
|
|
|
|
|
|
|
| |
Author: Ken Takagiwa <ugw.gi.world@gmail.com>
Closes #2042 from giwa/patch-1 and squashes the following commits:
216fe0e [Ken Takagiwa] Fixed wrong links
|
|
|
|
|
|
|
|
|
|
|
| |
because NB treats feature values as term frequencies. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #2038 from mengxr/nb-neg and squashes the following commits:
52c37c3 [Xiangrui Meng] address comments
65f892d [Xiangrui Meng] detect negative values in nb
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added a documentation section on StreamingLR to the ``MLlib - Linear Methods``, including a worked example.
mengxr tdas
Author: freeman <the.freeman.lab@gmail.com>
Closes #2047 from freeman-lab/streaming-lr-docs and squashes the following commits:
568d250 [freeman] Tweaks to wording / formatting
05a1139 [freeman] Added documentation and example for StreamingLR
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Though we don't use default argument for methods in RandomRDDs, it is still not easy for Java users to use because the output type is either `RDD[Double]` or `RDD[Vector]`. Java users should expect `JavaDoubleRDD` and `JavaRDD[Vector]`, respectively. We should create dedicated methods for Java users, and allow default arguments in Scala methods in RandomRDDs, to make life easier for both Java and Scala users. This PR also contains documentation for random data generation. brkyvz
Author: Xiangrui Meng <meng@databricks.com>
Closes #2041 from mengxr/stat-doc and squashes the following commits:
fc5eedf [Xiangrui Meng] add missing comma
ffde810 [Xiangrui Meng] address comments
aef6d07 [Xiangrui Meng] add doc for random data generation
b99d94b [Xiangrui Meng] add java-friendly methods to RandomRDDs
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Uses the name tag to identify machines in a cluster.
- Allows overriding the security group name so it doesn't need to coincide with the cluster name.
- Outputs the request id's of up to 10 pending spot instance requests.
Author: Vida Ha <vida@databricks.com>
Closes #1899 from vidaha/vida/ec2-reuse-security-group and squashes the following commits:
c80d5c3 [Vida Ha] wrap retries in a try catch block
b2989d5 [Vida Ha] SPARK-2333: spark_ec2 script should allow option for existing security group
|
|
|
|
|
|
|
|
|
|
| |
Candidate splits were inconsistent with the example.
Author: Matt Forbes <matt@tellapart.com>
Closes #1837 from emef/tree-doc and squashes the following commits:
3be14a1 [Matt Forbes] Fix typo in decision tree docs
|
|
|
|
|
|
|
|
|
|
|
| |
This definitely needs review as I am not familiar with this part of Spark.
I tested this locally and it did seem to work.
Author: Patrick Wendell <pwendell@gmail.com>
Closes #1937 from pwendell/scheduler and squashes the following commits:
b858e33 [Patrick Wendell] SPARK-3025: Allow JDBC clients to set a fair scheduler pool
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mengxr
Documentation for Word2Vec
Author: Liquan Pei <liquanpei@gmail.com>
Closes #2003 from Ishiihara/Word2Vec-doc and squashes the following commits:
4ff11d4 [Liquan Pei] minor fix
8d7458f [Liquan Pei] code reformat
6df0dcb [Liquan Pei] add Word2Vec documentation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
fixed markup, separated out sections more-clearly, more thorough explanations
Author: Chris Fregly <chris@fregly.com>
Closes #1757 from cfregly/master and squashes the following commits:
9b1c71a [Chris Fregly] better explained why spark checkpoints are disabled in the example (due to no stateful operations being used)
0f37061 [Chris Fregly] SPARK-1981: (Kinesis streaming support) updated streaming-kinesis.md
862df67 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
8e1ae2e [Chris Fregly] Merge remote-tracking branch 'upstream/master'
4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method
0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl
691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams
0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
74e5c7c [Chris Fregly] updated per TD's feedback. simplified examples, updated docs
e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
bf614e9 [Chris Fregly] per matei's feedback: moved the kinesis examples into the examples/ dir
d17ca6d [Chris Fregly] per TD's feedback: updated docs, simplified the KinesisUtils api
912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class
db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client
338997e [Chris Fregly] improve build docs for kinesis
828f8ae [Chris Fregly] more cleanup
e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
cd68c0d [Chris Fregly] fixed typos and backward compatibility
d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #1632 from sarutak/SPARK-2677 and squashes the following commits:
cddbc7b [Kousuke Saruta] Removed Exception throwing when ConnectionManager#handleMessage receives ack for non-referenced message
d3bd2a8 [Kousuke Saruta] Modified configuration.md for spark.core.connection.ack.timeout
e85f88b [Kousuke Saruta] Removed useless synchronized blocks
7ed48be [Kousuke Saruta] Modified ConnectionManager to use ackTimeoutMonitor ConnectionManager-wide
9b620a6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
0dd9ad3 [Kousuke Saruta] Modified typo in ConnectionManagerSuite.scala
7cbb8ca [Kousuke Saruta] Modified to match with scalastyle
8a73974 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
ade279a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
0174d6a [Kousuke Saruta] Modified ConnectionManager.scala to handle the case remote Executor cannot ack
a454239 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
9b7b7c1 [Kousuke Saruta] (WIP) Modifying ConnectionManager.scala
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead.
Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring.
This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or, at least, reliance upon the feature) is hopefully minimal.
Author: Aaron Davidson <aaron@databricks.com>
Closes #1321 from aarondav/allowlocal and squashes the following commits:
136b253 [Aaron Davidson] Fix DAGSchedulerSuite
5599d55 [Aaron Davidson] [RFC] Disable local execution of Spark jobs by default
|
|
|
|
|
|
|
|
|
|
| |
These configs looked inconsistent from the rest.
Author: Andrew Or <andrewor14@gmail.com>
Closes #1936 from andrewor14/docs-code and squashes the following commits:
15f578a [Andrew Or] Add <code> tag
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
and CLI for SparkSQL
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #1885 from sarutak/SPARK-2963 and squashes the following commits:
ed53329 [Kousuke Saruta] Modified description and notaton of proper noun
07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md
6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963
c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy".
Author: Reynold Xin <rxin@apache.org>
Closes #1873 from rxin/compressionCodecShortForm and squashes the following commits:
9f50962 [Reynold Xin] Specify short-form compression codec names first.
63f78ee [Reynold Xin] Updated configuration documentation.
47b3848 [Reynold Xin] [SPARK-2953] Allow using short names for io compression codecs
|
|
|
|
|
|
|
|
|
|
|
|
| |
As per discussions with Xiangrui, I've reorganized and edited the mllib documentation.
Author: Ameet Talwalkar <atalwalkar@gmail.com>
Closes #1908 from atalwalkar/master and squashes the following commits:
fe6938a [Ameet Talwalkar] made xiangruis suggested changes
840028b [Ameet Talwalkar] made xiangruis suggested changes
7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In SPARK-1946(PR #900), configuration <code>spark.scheduler.minRegisteredExecutorsRatio</code> was introduced. However, in standalone mode, there is a race condition where isReady() can return true because totalExpectedExecutors has not been correctly set.
Because expected executors is uncertain in standalone mode, the PR try to use CPU cores(<code>--total-executor-cores</code>) as expected resources to judge whether SchedulerBackend is ready.
Author: li-zhihui <zhihui.li@intel.com>
Author: Li Zhihui <zhihui.li@intel.com>
Closes #1525 from li-zhihui/fixre4s and squashes the following commits:
e9a630b [Li Zhihui] Rename variable totalExecutors and clean codes
abf4860 [Li Zhihui] Push down variable totalExpectedResources to children classes
ca54bd9 [li-zhihui] Format log with String interpolation
88c7dc6 [li-zhihui] Few codes and docs refactor
41cf47e [li-zhihui] Fix race condition at SchedulerBackend.isReady in standalone mode
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
sorting/aggregation and # partitions is small
As described in https://issues.apache.org/jira/browse/SPARK-2787, right now sort-based shuffle is more expensive than hash-based for map operations that do no partial aggregation or sorting, such as groupByKey. This is because it has to serialize each data item twice (once when spilling to intermediate files, and then again when merging these files object-by-object). This patch adds a code path to just write separate files directly if the # of output partitions is small, and concatenate them at the end to produce a sorted file.
On the unit test side, I added some tests that force or don't force this bypass path to be used, and checked that our tests for other features (e.g. all the operations) cover both cases.
Author: Matei Zaharia <matei@databricks.com>
Closes #1799 from mateiz/SPARK-2787 and squashes the following commits:
88cf26a [Matei Zaharia] Fix rebase
10233af [Matei Zaharia] Review comments
398cb95 [Matei Zaharia] Fix looking up shuffle manager in conf
ca3efd9 [Matei Zaharia] Add docs for shuffle manager properties, and allow short names for them
d0ae3c5 [Matei Zaharia] Fix some comments
90d084f [Matei Zaharia] Add code path to bypass merge-sort in ExternalSorter, and tests
31e5d7c [Matei Zaharia] Move existing logic for writing partitioned files into ExternalSorter
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The goal of this PR is to allow users of Spark to write tight firewall rules for their clusters. This is currently not possible because Spark uses random ports in many places, notably the communication between executors and drivers. The changes in this PR are based on top of ash211's changes in #1107.
The list covered here may or may not be the complete set of port needed for Spark to operate perfectly. However, as of the latest commit there are no known sources of random ports (except in tests). I have not documented a few of the more obscure configs.
My spark-env.sh looks like this:
```
export SPARK_MASTER_PORT=6060
export SPARK_WORKER_PORT=7070
export SPARK_MASTER_WEBUI_PORT=9090
export SPARK_WORKER_WEBUI_PORT=9091
```
and my spark-defaults.conf looks like this:
```
spark.master spark://andrews-mbp:6060
spark.driver.port 5001
spark.fileserver.port 5011
spark.broadcast.port 5021
spark.replClassServer.port 5031
spark.blockManager.port 5041
spark.executor.port 5051
```
Author: Andrew Or <andrewor14@gmail.com>
Author: Andrew Ash <andrew@andrewash.com>
Closes #1777 from andrewor14/configure-ports and squashes the following commits:
621267b [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
8a6b820 [Andrew Or] Use a random UI port during tests
7da0493 [Andrew Or] Fix tests
523c30e [Andrew Or] Add test for isBindCollision
b97b02a [Andrew Or] Minor fixes
c22ad00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
93d359f [Andrew Or] Executors connect to wrong port when collision occurs
d502e5f [Andrew Or] Handle port collisions when creating Akka systems
a2dd05c [Andrew Or] Patrick's comment nit
86461e2 [Andrew Or] Remove spark.executor.env.port and spark.standalone.client.port
1d2d5c6 [Andrew Or] Fix ports for standalone cluster mode
cb3be88 [Andrew Or] Various doc fixes (broken link, format etc.)
e837cde [Andrew Or] Remove outdated TODOs
bfbab28 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
de1b207 [Andrew Or] Update docs to reflect new ports
b565079 [Andrew Or] Add spark.ports.maxRetries
2551eb2 [Andrew Or] Remove spark.worker.watcher.port
151327a [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
9868358 [Andrew Or] Add a few miscellaneous ports
6016e77 [Andrew Or] Add spark.executor.port
8d836e6 [Andrew Or] Also document SPARK_{MASTER/WORKER}_WEBUI_PORT
4d9e6f3 [Andrew Or] Fix super subtle bug
3f8e51b [Andrew Or] Correct erroneous docs...
e111d08 [Andrew Or] Add names for UI services
470f38c [Andrew Or] Special case non-"Address already in use" exceptions
1d7e408 [Andrew Or] Treat 0 ports specially + return correct ConnectionManager port
ba32280 [Andrew Or] Minor fixes
6b550b0 [Andrew Or] Assorted fixes
73fbe89 [Andrew Or] Move start service logic to Utils
ec676f4 [Andrew Or] Merge branch 'SPARK-2157' of github.com:ash211/spark into configure-ports
038a579 [Andrew Ash] Trust the server start function to report the port the service started on
7c5bdc4 [Andrew Ash] Fix style issue
0347aef [Andrew Ash] Unify port fallback logic to a single place
24a4c32 [Andrew Ash] Remove type on val to match surrounding style
9e4ad96 [Andrew Ash] Reformat for style checker
5d84e0e [Andrew Ash] Document new port configuration options
066dc7a [Andrew Ash] Fix up HttpServer port increments
cad16da [Andrew Ash] Add fallover increment logic for HttpServer
c5a0568 [Andrew Ash] Fix ConnectionManager to retry with increment
b80d2fd [Andrew Ash] Make Spark's block manager port configurable
17c79bb [Andrew Ash] Add a configuration option for spark-shell's class server
f34115d [Andrew Ash] SPARK-1176 Add port configuration for HttpBroadcast
49ee29b [Andrew Ash] SPARK-1174 Add port configuration for HttpFileServer
1c0981a [Andrew Ash] Make port in HttpServer configurable
|
|
|
|
|
|
|
|
|
|
| |
This can substantially reduce memory usage during shuffle.
Author: Reynold Xin <rxin@apache.org>
Closes #1781 from rxin/SPARK-2503-spark.shuffle.file.buffer.kb and squashes the following commits:
104b8d8 [Reynold Xin] [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Note that this also documents spark.executorEnv.* which to me means its public. If we don't want that please speak up.
Author: Thomas Graves <tgraves@apache.org>
Closes #1512 from tgravescs/SPARK-1680 and squashes the following commits:
11525df [Thomas Graves] more doc changes
553bad0 [Thomas Graves] fix documentation
152bf7c [Thomas Graves] fix docs
5382326 [Thomas Graves] try fix docs
32f86a4 [Thomas Graves] use configs for specifying environment variables on YARN
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds support for giving accumulators user-visible names and displaying accumulator values in the web UI. This allows users to create custom counters that can display in the UI. The current approach displays both the accumulator deltas caused by each task and a "current" value of the accumulator totals for each stage, which gets update as tasks finish.
Currently in Spark developers have been extending the `TaskMetrics` functionality to provide custom instrumentation for RDD's. This provides a potentially nicer alternative of going through the existing accumulator framework (actually `TaskMetrics` and accumulators are on an awkward collision course as we add more features to the former). The current patch demo's how we can use the feature to provide instrumentation for RDD input sizes. The nice thing about going through accumulators is that users can read the current value of the data being tracked in their programs. This could be useful to e.g. decide to short-circuit a Spark stage depending on how things are going.
![counters](https://cloud.githubusercontent.com/assets/320616/3488815/6ee7bc34-0505-11e4-84ce-e36d9886e2cf.png)
Author: Patrick Wendell <pwendell@gmail.com>
Closes #1309 from pwendell/metrics and squashes the following commits:
8815308 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into HEAD
93fbe0f [Patrick Wendell] Other minor fixes
cc43f68 [Patrick Wendell] Updating unit tests
c991b1b [Patrick Wendell] Moving some code into the Accumulators class
9a9ba3c [Patrick Wendell] More merge fixes
c5ace9e [Patrick Wendell] More merge conflicts
1da15e3 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into metrics
9860c55 [Patrick Wendell] Potential solution to posting listener events
0bb0e33 [Patrick Wendell] Remove "display" variable and assume display = name.isDefined
0ec4ac7 [Patrick Wendell] Java API's
e95bf69 [Patrick Wendell] Stash
be97261 [Patrick Wendell] Style fix
8407308 [Patrick Wendell] Removing examples in Hadoop and RDD class
64d405f [Patrick Wendell] Adding missing file
5d8b156 [Patrick Wendell] Changes based on Kay's review.
9f18bad [Patrick Wendell] Minor style changes and tests
7a63abc [Patrick Wendell] Adding Json serialization and responding to Reynold's feedback
ad85076 [Patrick Wendell] Example of using named accumulators for custom RDD metrics.
0b72660 [Patrick Wendell] Initial WIP example of supporing globally named accumulators.
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-2859
Kryo project has been migrated from googlecode to github, hence we need to update its URL in related docs such as tuning.md.
Author: Guancheng (G.C.) Chen <chenguancheng@gmail.com>
Closes #1782 from gchen/kryo-docs and squashes the following commits:
b62543c [Guancheng (G.C.) Chen] update url of Kryo project
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It was easier to combine these 2 jira since they touch many of the same places. This pr adds the following:
- adds modify acls
- adds admin acls (list of admins/users that get added to both view and modify acls)
- modify Kill button on UI to take modify acls into account
- changes config name of spark.ui.acls.enable to spark.acls.enable since I choose poorly in original name. We keep backwards compatibility so people can still use spark.ui.acls.enable. The acls should apply to any web ui as well as any CLI interfaces.
- send view and modify acls information on to YARN so that YARN interfaces can use (yarn cli for killing applications for example).
Author: Thomas Graves <tgraves@apache.org>
Closes #1196 from tgravescs/SPARK-1890 and squashes the following commits:
8292eb1 [Thomas Graves] review comments
b92ec89 [Thomas Graves] remove unneeded variable from applistener
4c765f4 [Thomas Graves] Add in admin acls
72eb0ac [Thomas Graves] Add modify acls
|
|
|
|
|
|
|
|
|
|
|
| |
Add a config (spark.yarn.access.namenodes) to allow applications running on yarn to access other secure HDFS cluster. User just specifies the namenodes of the other clusters and we get Tokens for those and ship them with the spark application.
Author: Thomas Graves <tgraves@apache.org>
Closes #1159 from tgravescs/spark-1528 and squashes the following commits:
ddbcd16 [Thomas Graves] review comments
0ac8501 [Thomas Graves] SPARK-1528 - add support for accessing remote HDFS
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@apache.org>
Closes #1780 from rxin/kryo-init-size and squashes the following commits:
551b935 [Reynold Xin] [SPARK-2856] Decrease initial buffer size for Kryo to 64KB.
|