aboutsummaryrefslogtreecommitdiff
path: root/docs
Commit message (Collapse)AuthorAgeFilesLines
* [Docs] SQL doc formatting and typo fixesNicholas Chammas2014-08-292-59/+52
| | | | | | | | | | | | | | | | | | | As [reported on the dev list](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC2-tp8107p8131.html): * Code fencing with triple-backticks doesn’t seem to work like it does on GitHub. Newlines are lost. Instead, use 4-space indent to format small code blocks. * Nested bullets need 2 leading spaces, not 1. * Spellcheck! Author: Nicholas Chammas <nicholas.chammas@gmail.com> Author: nchammas <nicholas.chammas@gmail.com> Closes #2201 from nchammas/sql-doc-fixes and squashes the following commits: 873f889 [Nicholas Chammas] [Docs] fix skip-api flag 5195e0c [Nicholas Chammas] [Docs] SQL doc formatting and typo fixes 3b26c8d [nchammas] [Spark QA] Link to console output on test time out (cherry picked from commit 53aa8316e88980c6f46d3b9fc90d935a4738a370) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-3264] Allow users to set executor Spark home in MesosAndrew Or2014-08-281-0/+10
| | | | | | | | | | | | | | | | | The executors and the driver may not share the same Spark home. There is currently one way to set the executor side Spark home in Mesos, through setting `spark.home`. However, this is neither documented nor intuitive. This PR adds a more specific config `spark.mesos.executor.home` and exposes this to the user. liancheng tnachen Author: Andrew Or <andrewor14@gmail.com> Closes #2166 from andrewor14/mesos-spark-home and squashes the following commits: b87965e [Andrew Or] Merge branch 'master' of github.com:apache/spark into mesos-spark-home f6abb2e [Andrew Or] Document spark.mesos.executor.home ca7846d [Andrew Or] Add more specific configuration for executor Spark home in Mesos (cherry picked from commit 41dc5987d9abeca6fc0f5935c780d48f517cdf95) Signed-off-by: Andrew Or <andrewor14@gmail.com>
* [SPARK-3227] [mllib] Added migration guide for v1.0 to v1.1Joseph K. Bradley2014-08-271-1/+27
| | | | | | | | | | | | | | | | The only updates are in DecisionTree. CC: mengxr Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #2146 from jkbradley/mllib-migration and squashes the following commits: 5a1f487 [Joseph K. Bradley] small edit to doc 411d6d9 [Joseph K. Bradley] Added migration guide for v1.0 to v1.1. The only updates are in DecisionTree. (cherry picked from commit 171a41cb034f4ea80f6a3c91a6872970de16a14a) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-2830][MLLIB] doc update for 1.1Xiangrui Meng2014-08-274-86/+87
| | | | | | | | | | | | | | | | | | | | | | 1. renamed mllib-basics to mllib-data-types 1. renamed mllib-stats to mllib-statistics 1. moved random data generation to the bottom of mllib-stats 1. updated toc accordingly atalwalkar Author: Xiangrui Meng <meng@databricks.com> Closes #2151 from mengxr/mllib-doc-1.1 and squashes the following commits: 0bd79f3 [Xiangrui Meng] add mllib-data-types b64a5d7 [Xiangrui Meng] update the content list of basis statistics in mllib-guide f625cc2 [Xiangrui Meng] move mllib-basics to mllib-data-types 4d69250 [Xiangrui Meng] move random data generation to the bottom of statistics e64f3ce [Xiangrui Meng] move mllib-stats.md to mllib-statistics.md (cherry picked from commit 43dfc84f883822ea27b6e312d4353bf301c2e7ef) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* Fix unclosed HTML tag in Yarn docs.Josh Rosen2014-08-261-1/+1
|
* [SPARK-2839][MLlib] Stats Toolkit documentation updatedBurak2014-08-261-41/+331
| | | | | | | | | | | | | | | | | | | | Documentation updated for the Statistics Toolkit of MLlib. mengxr atalwalkar https://issues.apache.org/jira/browse/SPARK-2839 P.S. Accidentally closed #2123. New commits didn't show up after I reopened the PR. I've opened this instead and closed the old one. Author: Burak <brkyvz@gmail.com> Closes #2130 from brkyvz/StatsLib-Docs and squashes the following commits: a54a855 [Burak] [SPARK-2839][MLlib] Addressed comments bfc6896 [Burak] [SPARK-2839][MLlib] Added a more specific link to colStats() for pyspark 213fe3f [Burak] [SPARK-2839][MLlib] Modifications made according to review fec4d9d [Burak] [SPARK-2830][MLlib] Stats Toolkit documentation updated (cherry picked from commit 1208f72ac78960fe5060187761479b2a9a417c1b) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-3226][MLLIB] doc update for native librariesXiangrui Meng2014-08-261-10/+15
| | | | | | | | | | | | | to mention `-Pnetlib-lgpl` option. atalwalkar Author: Xiangrui Meng <meng@databricks.com> Closes #2128 from mengxr/mllib-native and squashes the following commits: 4cbba57 [Xiangrui Meng] update mllib dependencies (cherry picked from commit adbd5c1636669fc474ab02b54cd1ced353f68712) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* Fixed a typo in docs/running-on-mesos.mdCheng Lian2014-08-251-1/+1
| | | | | | | | | | | | | It should be `spark-env.sh` rather than `spark.env.sh`. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2119 from liancheng/fix-mesos-doc and squashes the following commits: f360548 [Cheng Lian] Fixed a typo in docs/running-on-mesos.md (cherry picked from commit 805fec845b7aa8b4763e3e0e34bec6c3872469f4) Signed-off-by: Josh Rosen <joshrosen@apache.org>
* [MLlib][SPARK-2997] Update SVD documentation to reflect roughly squareReza Zadeh2014-08-241-6/+23
| | | | | | | | | | | | | | | | | | | Update the documentation to reflect the fact we can handle roughly square matrices. Author: Reza Zadeh <rizlar@gmail.com> Closes #2070 from rezazadeh/svddocs and squashes the following commits: 826b8fe [Reza Zadeh] left singular vectors 3f34fc6 [Reza Zadeh] PCA is still TS 7ffa2aa [Reza Zadeh] better title aeaf39d [Reza Zadeh] More docs 788ed13 [Reza Zadeh] add computational cost explanation 6429c59 [Reza Zadeh] Add link to rowmatrix docs 1eeab8b [Reza Zadeh] Update SVD documentation to reflect roughly square (cherry picked from commit b1b20301b3a1b35564d61e58eb5964d5ad5e4d7d) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-2841][MLlib] Documentation for feature transformationsDB Tsai2014-08-241-2/+107
| | | | | | | | | | | | | | | | Documentation for newly added feature transformations: 1. TF-IDF 2. StandardScaler 3. Normalizer Author: DB Tsai <dbtsai@alpinenow.com> Closes #2068 from dbtsai/transformer-documentation and squashes the following commits: 109f324 [DB Tsai] address feedback (cherry picked from commit 572952ae615895efaaabcd509d582262000c0852) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-2963] REGRESSION - The description about how to build for using CLI ↵Kousuke Saruta2014-08-221-4/+7
| | | | | | | | | | | | | | | | | | | | | and Thrift JDBC server is absent in proper document - The most important things I mentioned in #1885 is as follows. * People who build Spark is not always programmer. * If a person who build Spark is not a programmer, he/she won't read programmer's guide before building. So, how to build for using CLI and JDBC server is not only in programmer's guide. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2080 from sarutak/SPARK-2963 and squashes the following commits: ee07c76 [Kousuke Saruta] Modified regression of the description about building for using Thrift JDBC server and CLI ed53329 [Kousuke Saruta] Modified description and notaton of proper noun 07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md 6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963 c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL
* [SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples)Joseph K. Bradley2014-08-211-69/+283
| | | | | | | | | | | | | | | | | | | | Updated DecisionTree documentation, with examples for Java, Python. Added same Java example to code as well. CC: @mengxr @manishamde @atalwalkar Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #2063 from jkbradley/dt-docs and squashes the following commits: 2dd2c19 [Joseph K. Bradley] Last updates based on github review. 9dd1b6b [Joseph K. Bradley] Updated decision tree doc. d802369 [Joseph K. Bradley] Updates based on comments: cache data, corrected doc text. b9bee04 [Joseph K. Bradley] Updated DT examples 57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example in docs, and corrected doc example as needed. d939a92 [Joseph K. Bradley] Updated DecisionTree documentation. Added Java, Python examples. (cherry picked from commit 050f8d01e47b9b67b02ce50d83fb7b4e528b7204) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-2843][MLLIB] add a section about regularization parameter in ALSXiangrui Meng2014-08-201-0/+11
| | | | | | | | | | | | | | | atalwalkar srowen Author: Xiangrui Meng <meng@databricks.com> Closes #2064 from mengxr/als-doc and squashes the following commits: b2e20ab [Xiangrui Meng] introduced -> discussed 98abdd7 [Xiangrui Meng] add reference 339bd08 [Xiangrui Meng] add a section about regularization parameter in ALS (cherry picked from commit e0f946265b9ea5bc48849cf7794c2c03d5e29fba) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-3143][MLLIB] add tf-idf user guideXiangrui Meng2014-08-201-3/+80
| | | | | | | | | | | | | | Moved TF-IDF before Word2Vec because the former is more basic. I also added a link for Word2Vec. atalwalkar Author: Xiangrui Meng <meng@databricks.com> Closes #2061 from mengxr/tfidf-doc and squashes the following commits: ca04c70 [Xiangrui Meng] address comments a5ea4b4 [Xiangrui Meng] add tf-idf user guide (cherry picked from commit e1571874f26c1df2dfd5ac2959612372716cd2d8) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* SPARK-3092 [SQL]: Always include the thriftserver when -Phive is enabled.Patrick Wendell2014-08-202-9/+3
| | | | | | | | | | | | | | | | Currently we have a separate profile called hive-thriftserver. I originally suggested this in case users did not want to bundle the thriftserver, but it's ultimately lead to a lot of confusion. Since the thriftserver is only a few classes, I don't see a really good reason to isolate it from the rest of Hive. So let's go ahead and just include it in the same profile to simplify things. This has been suggested in the past by liancheng. Author: Patrick Wendell <pwendell@gmail.com> Closes #2006 from pwendell/hiveserver and squashes the following commits: 742ea40 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into hiveserver 034ad47 [Patrick Wendell] SPARK-3092: Always include the thriftserver when -Phive is enabled. (cherry picked from commit f2f26c2a1dc6d60078c3be9c3d11a21866d9a24f) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [DOCS] Fixed wrong linksKen Takagiwa2014-08-191-2/+2
| | | | | | | | | | | Author: Ken Takagiwa <ugw.gi.world@gmail.com> Closes #2042 from giwa/patch-1 and squashes the following commits: 216fe0e [Ken Takagiwa] Fixed wrong links (cherry picked from commit 8a74e4b2a8c7dab154b406539487cf29d578d208) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-3130][MLLIB] detect negative values in naive BayesXiangrui Meng2014-08-191-1/+2
| | | | | | | | | | | | | | because NB treats feature values as term frequencies. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #2038 from mengxr/nb-neg and squashes the following commits: 52c37c3 [Xiangrui Meng] address comments 65f892d [Xiangrui Meng] detect negative values in nb (cherry picked from commit 068b6fe6a10eb1c6b2102d88832203267f030e85) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-3112][MLLIB] Add documentation and example for StreamingLRfreeman2014-08-191-0/+75
| | | | | | | | | | | | | | | | Added a documentation section on StreamingLR to the ``MLlib - Linear Methods``, including a worked example. mengxr tdas Author: freeman <the.freeman.lab@gmail.com> Closes #2047 from freeman-lab/streaming-lr-docs and squashes the following commits: 568d250 [freeman] Tweaks to wording / formatting 05a1139 [freeman] Added documentation and example for StreamingLR (cherry picked from commit c7252b0097cfacd36f17357d195b12a59e503b35) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-3136][MLLIB] Create Java-friendly methods in RandomRDDsXiangrui Meng2014-08-192-2/+74
| | | | | | | | | | | | | | | | Though we don't use default argument for methods in RandomRDDs, it is still not easy for Java users to use because the output type is either `RDD[Double]` or `RDD[Vector]`. Java users should expect `JavaDoubleRDD` and `JavaRDD[Vector]`, respectively. We should create dedicated methods for Java users, and allow default arguments in Scala methods in RandomRDDs, to make life easier for both Java and Scala users. This PR also contains documentation for random data generation. brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #2041 from mengxr/stat-doc and squashes the following commits: fc5eedf [Xiangrui Meng] add missing comma ffde810 [Xiangrui Meng] address comments aef6d07 [Xiangrui Meng] add doc for random data generation b99d94b [Xiangrui Meng] add java-friendly methods to RandomRDDs (cherry picked from commit 825d4fe47b9c4d48de88622dd48dcf83beb8b80a) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* SPARK-2333 - spark_ec2 script should allow option for existing security groupVida Ha2014-08-191-6/+8
| | | | | | | | | | | | | | | | - Uses the name tag to identify machines in a cluster. - Allows overriding the security group name so it doesn't need to coincide with the cluster name. - Outputs the request id's of up to 10 pending spot instance requests. Author: Vida Ha <vida@databricks.com> Closes #1899 from vidaha/vida/ec2-reuse-security-group and squashes the following commits: c80d5c3 [Vida Ha] wrap retries in a try catch block b2989d5 [Vida Ha] SPARK-2333: spark_ec2 script should allow option for existing security group (cherry picked from commit 94053a7b766788bb62e2dbbf352ccbcc75f71fc0) Signed-off-by: Josh Rosen <joshrosen@apache.org>
* Fix typo in decision tree docsMatt Forbes2014-08-181-2/+2
| | | | | | | | | | | | | Candidate splits were inconsistent with the example. Author: Matt Forbes <matt@tellapart.com> Closes #1837 from emef/tree-doc and squashes the following commits: 3be14a1 [Matt Forbes] Fix typo in decision tree docs (cherry picked from commit cd0720ca77894d481fb73a8b5bb517013843cb1e) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* SPARK-3025 [SQL]: Allow JDBC clients to set a fair scheduler poolPatrick Wendell2014-08-181-0/+5
| | | | | | | | | | | | | | This definitely needs review as I am not familiar with this part of Spark. I tested this locally and it did seem to work. Author: Patrick Wendell <pwendell@gmail.com> Closes #1937 from pwendell/scheduler and squashes the following commits: b858e33 [Patrick Wendell] SPARK-3025: Allow JDBC clients to set a fair scheduler pool (cherry picked from commit 6bca8898a1aa4ca7161492229bac1748b3da2ad7) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-2842][MLlib]Word2Vec documentationLiquan Pei2014-08-171-1/+62
| | | | | | | | | | | | | | | | mengxr Documentation for Word2Vec Author: Liquan Pei <liquanpei@gmail.com> Closes #2003 from Ishiihara/Word2Vec-doc and squashes the following commits: 4ff11d4 [Liquan Pei] minor fix 8d7458f [Liquan Pei] code reformat 6df0dcb [Liquan Pei] add Word2Vec documentation (cherry picked from commit eef779b8d631de971d440051cae21040f4de558f) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-1981] updated streaming-kinesis.mdChris Fregly2014-08-171-48/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fixed markup, separated out sections more-clearly, more thorough explanations Author: Chris Fregly <chris@fregly.com> Closes #1757 from cfregly/master and squashes the following commits: 9b1c71a [Chris Fregly] better explained why spark checkpoints are disabled in the example (due to no stateful operations being used) 0f37061 [Chris Fregly] SPARK-1981: (Kinesis streaming support) updated streaming-kinesis.md 862df67 [Chris Fregly] Merge remote-tracking branch 'upstream/master' 8e1ae2e [Chris Fregly] Merge remote-tracking branch 'upstream/master' 4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method 0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl 691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams 0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master' 74e5c7c [Chris Fregly] updated per TD's feedback. simplified examples, updated docs e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master' bf614e9 [Chris Fregly] per matei's feedback: moved the kinesis examples into the examples/ dir d17ca6d [Chris Fregly] per TD's feedback: updated docs, simplified the KinesisUtils api 912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master' 21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master' 6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client 338997e [Chris Fregly] improve build docs for kinesis 828f8ae [Chris Fregly] more cleanup e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master' cd68c0d [Chris Fregly] fixed typos and backward compatibility d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master' b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support (cherry picked from commit 99243288b049f4a4fb4ba0505ea2310be5eb4bd2) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-2677] BasicBlockFetchIterator#next can wait foreverKousuke Saruta2014-08-161-0/+9
| | | | | | | | | | | | | | | | | | | | | | Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #1632 from sarutak/SPARK-2677 and squashes the following commits: cddbc7b [Kousuke Saruta] Removed Exception throwing when ConnectionManager#handleMessage receives ack for non-referenced message d3bd2a8 [Kousuke Saruta] Modified configuration.md for spark.core.connection.ack.timeout e85f88b [Kousuke Saruta] Removed useless synchronized blocks 7ed48be [Kousuke Saruta] Modified ConnectionManager to use ackTimeoutMonitor ConnectionManager-wide 9b620a6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677 0dd9ad3 [Kousuke Saruta] Modified typo in ConnectionManagerSuite.scala 7cbb8ca [Kousuke Saruta] Modified to match with scalastyle 8a73974 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677 ade279a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677 0174d6a [Kousuke Saruta] Modified ConnectionManager.scala to handle the case remote Executor cannot ack a454239 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677 9b7b7c1 [Kousuke Saruta] (WIP) Modifying ConnectionManager.scala (cherry picked from commit 76fa0eaf515fd6771cdd69422b1259485debcae5) Signed-off-by: Josh Rosen <joshrosen@apache.org>
* [SPARK-3029] Disable local execution of Spark jobs by defaultAaron Davidson2014-08-141-0/+9
| | | | | | | | | | | | | | | | | | Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead. Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring. This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or, at least, reliance upon the feature) is hopefully minimal. Author: Aaron Davidson <aaron@databricks.com> Closes #1321 from aarondav/allowlocal and squashes the following commits: 136b253 [Aaron Davidson] Fix DAGSchedulerSuite 5599d55 [Aaron Davidson] [RFC] Disable local execution of Spark jobs by default (cherry picked from commit d069c5d9d2f6ce06389ca2ddf0b3ae4db72c5797) Signed-off-by: Reynold Xin <rxin@apache.org>
* [Docs] Add missing <code> tags (minor)Andrew Or2014-08-131-2/+2
| | | | | | | | | | | | | These configs looked inconsistent from the rest. Author: Andrew Or <andrewor14@gmail.com> Closes #1936 from andrewor14/docs-code and squashes the following commits: 15f578a [Andrew Or] Add <code> tag (cherry picked from commit e4245656438d00714ebd59e89c4de3fdaae83494) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2963] [SQL] There no documentation about building to use HiveServer ↵Kousuke Saruta2014-08-131-0/+9
| | | | | | | | | | | | | | | | and CLI for SparkSQL Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #1885 from sarutak/SPARK-2963 and squashes the following commits: ed53329 [Kousuke Saruta] Modified description and notaton of proper noun 07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md 6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963 c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL (cherry picked from commit 869f06c759c29b09c8dc72e0e4034c03f908ba30) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-2953] Allow using short names for io compression codecsReynold Xin2014-08-121-3/+5
| | | | | | | | | | | | | | | Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy". Author: Reynold Xin <rxin@apache.org> Closes #1873 from rxin/compressionCodecShortForm and squashes the following commits: 9f50962 [Reynold Xin] Specify short-form compression codec names first. 63f78ee [Reynold Xin] Updated configuration documentation. 47b3848 [Reynold Xin] [SPARK-2953] Allow using short names for io compression codecs (cherry picked from commit 676f98289dad61c091bb45bd35a2b9613b22d64a) Signed-off-by: Reynold Xin <rxin@apache.org>
* SPARK-2830 [MLlib]: re-organize mllib documentationAmeet Talwalkar2014-08-1210-220/+317
| | | | | | | | | | | | | | | As per discussions with Xiangrui, I've reorganized and edited the mllib documentation. Author: Ameet Talwalkar <atalwalkar@gmail.com> Closes #1908 from atalwalkar/master and squashes the following commits: fe6938a [Ameet Talwalkar] made xiangruis suggested changes 840028b [Ameet Talwalkar] made xiangruis suggested changes 7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation (cherry picked from commit c235b83e2782cce0626ecc403c0a67e442be52c1) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-2635] Fix race condition at SchedulerBackend.isReady in standalone modeli-zhihui2014-08-081-6/+7
| | | | | | | | | | | | | | | | | | | | In SPARK-1946(PR #900), configuration <code>spark.scheduler.minRegisteredExecutorsRatio</code> was introduced. However, in standalone mode, there is a race condition where isReady() can return true because totalExpectedExecutors has not been correctly set. Because expected executors is uncertain in standalone mode, the PR try to use CPU cores(<code>--total-executor-cores</code>) as expected resources to judge whether SchedulerBackend is ready. Author: li-zhihui <zhihui.li@intel.com> Author: Li Zhihui <zhihui.li@intel.com> Closes #1525 from li-zhihui/fixre4s and squashes the following commits: e9a630b [Li Zhihui] Rename variable totalExecutors and clean codes abf4860 [Li Zhihui] Push down variable totalExpectedResources to children classes ca54bd9 [li-zhihui] Format log with String interpolation 88c7dc6 [li-zhihui] Few codes and docs refactor 41cf47e [li-zhihui] Fix race condition at SchedulerBackend.isReady in standalone mode (cherry picked from commit 28dbae85aaf6842e22cd7465cb11cb34d58fc56d) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-2787: Make sort-based shuffle write files directly when there's no ↵Matei Zaharia2014-08-071-0/+18
| | | | | | | | | | | | | | | | | | | | | | | sorting/aggregation and # partitions is small As described in https://issues.apache.org/jira/browse/SPARK-2787, right now sort-based shuffle is more expensive than hash-based for map operations that do no partial aggregation or sorting, such as groupByKey. This is because it has to serialize each data item twice (once when spilling to intermediate files, and then again when merging these files object-by-object). This patch adds a code path to just write separate files directly if the # of output partitions is small, and concatenate them at the end to produce a sorted file. On the unit test side, I added some tests that force or don't force this bypass path to be used, and checked that our tests for other features (e.g. all the operations) cover both cases. Author: Matei Zaharia <matei@databricks.com> Closes #1799 from mateiz/SPARK-2787 and squashes the following commits: 88cf26a [Matei Zaharia] Fix rebase 10233af [Matei Zaharia] Review comments 398cb95 [Matei Zaharia] Fix looking up shuffle manager in conf ca3efd9 [Matei Zaharia] Add docs for shuffle manager properties, and allow short names for them d0ae3c5 [Matei Zaharia] Fix some comments 90d084f [Matei Zaharia] Add code path to bypass merge-sort in ExternalSorter, and tests 31e5d7c [Matei Zaharia] Move existing logic for writing partitioned files into ExternalSorter (cherry picked from commit 6906b69cf568015f20c7d7c77cbcba650e5431a9) Signed-off-by: Reynold Xin <rxin@apache.org>
* Updating versions for Spark 1.1.0Patrick Wendell2014-08-061-2/+2
|
* [SPARK-2157] Enable tight firewall rules for SparkAndrew Or2014-08-063-90/+179
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The goal of this PR is to allow users of Spark to write tight firewall rules for their clusters. This is currently not possible because Spark uses random ports in many places, notably the communication between executors and drivers. The changes in this PR are based on top of ash211's changes in #1107. The list covered here may or may not be the complete set of port needed for Spark to operate perfectly. However, as of the latest commit there are no known sources of random ports (except in tests). I have not documented a few of the more obscure configs. My spark-env.sh looks like this: ``` export SPARK_MASTER_PORT=6060 export SPARK_WORKER_PORT=7070 export SPARK_MASTER_WEBUI_PORT=9090 export SPARK_WORKER_WEBUI_PORT=9091 ``` and my spark-defaults.conf looks like this: ``` spark.master spark://andrews-mbp:6060 spark.driver.port 5001 spark.fileserver.port 5011 spark.broadcast.port 5021 spark.replClassServer.port 5031 spark.blockManager.port 5041 spark.executor.port 5051 ``` Author: Andrew Or <andrewor14@gmail.com> Author: Andrew Ash <andrew@andrewash.com> Closes #1777 from andrewor14/configure-ports and squashes the following commits: 621267b [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports 8a6b820 [Andrew Or] Use a random UI port during tests 7da0493 [Andrew Or] Fix tests 523c30e [Andrew Or] Add test for isBindCollision b97b02a [Andrew Or] Minor fixes c22ad00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports 93d359f [Andrew Or] Executors connect to wrong port when collision occurs d502e5f [Andrew Or] Handle port collisions when creating Akka systems a2dd05c [Andrew Or] Patrick's comment nit 86461e2 [Andrew Or] Remove spark.executor.env.port and spark.standalone.client.port 1d2d5c6 [Andrew Or] Fix ports for standalone cluster mode cb3be88 [Andrew Or] Various doc fixes (broken link, format etc.) e837cde [Andrew Or] Remove outdated TODOs bfbab28 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports de1b207 [Andrew Or] Update docs to reflect new ports b565079 [Andrew Or] Add spark.ports.maxRetries 2551eb2 [Andrew Or] Remove spark.worker.watcher.port 151327a [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports 9868358 [Andrew Or] Add a few miscellaneous ports 6016e77 [Andrew Or] Add spark.executor.port 8d836e6 [Andrew Or] Also document SPARK_{MASTER/WORKER}_WEBUI_PORT 4d9e6f3 [Andrew Or] Fix super subtle bug 3f8e51b [Andrew Or] Correct erroneous docs... e111d08 [Andrew Or] Add names for UI services 470f38c [Andrew Or] Special case non-"Address already in use" exceptions 1d7e408 [Andrew Or] Treat 0 ports specially + return correct ConnectionManager port ba32280 [Andrew Or] Minor fixes 6b550b0 [Andrew Or] Assorted fixes 73fbe89 [Andrew Or] Move start service logic to Utils ec676f4 [Andrew Or] Merge branch 'SPARK-2157' of github.com:ash211/spark into configure-ports 038a579 [Andrew Ash] Trust the server start function to report the port the service started on 7c5bdc4 [Andrew Ash] Fix style issue 0347aef [Andrew Ash] Unify port fallback logic to a single place 24a4c32 [Andrew Ash] Remove type on val to match surrounding style 9e4ad96 [Andrew Ash] Reformat for style checker 5d84e0e [Andrew Ash] Document new port configuration options 066dc7a [Andrew Ash] Fix up HttpServer port increments cad16da [Andrew Ash] Add fallover increment logic for HttpServer c5a0568 [Andrew Ash] Fix ConnectionManager to retry with increment b80d2fd [Andrew Ash] Make Spark's block manager port configurable 17c79bb [Andrew Ash] Add a configuration option for spark-shell's class server f34115d [Andrew Ash] SPARK-1176 Add port configuration for HttpBroadcast 49ee29b [Andrew Ash] SPARK-1174 Add port configuration for HttpFileServer 1c0981a [Andrew Ash] Make port in HttpServer configurable (cherry picked from commit 09f7e4587bbdf74207d2629e8c1314f93d865999) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB.Reynold Xin2014-08-051-1/+1
| | | | | | | | | | | | | This can substantially reduce memory usage during shuffle. Author: Reynold Xin <rxin@apache.org> Closes #1781 from rxin/SPARK-2503-spark.shuffle.file.buffer.kb and squashes the following commits: 104b8d8 [Reynold Xin] [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB. (cherry picked from commit acff9a7f13b98f10a08aea1d11cfa685c3419367) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2856] Decrease initial buffer size for Kryo to 64KB.Reynold Xin2014-08-051-1/+1
| | | | | | | | | | | Author: Reynold Xin <rxin@apache.org> Closes #1780 from rxin/kryo-init-size and squashes the following commits: 551b935 [Reynold Xin] [SPARK-2856] Decrease initial buffer size for Kryo to 64KB. (cherry picked from commit 184048f80b6fa160c89d5bb47b937a0a89534a95) Signed-off-by: Reynold Xin <rxin@apache.org>
* SPARK-1680: use configs for specifying environment variables on YARNThomas Graves2014-08-052-5/+25
| | | | | | | | | | | | | | | | | Note that this also documents spark.executorEnv.* which to me means its public. If we don't want that please speak up. Author: Thomas Graves <tgraves@apache.org> Closes #1512 from tgravescs/SPARK-1680 and squashes the following commits: 11525df [Thomas Graves] more doc changes 553bad0 [Thomas Graves] fix documentation 152bf7c [Thomas Graves] fix docs 5382326 [Thomas Graves] try fix docs 32f86a4 [Thomas Graves] use configs for specifying environment variables on YARN (cherry picked from commit 41e0a21b22ccd2788dc079790788e505b0d4e37d) Signed-off-by: Thomas Graves <tgraves@apache.org>
* SPARK-2380: Support displaying accumulator values in the web UIPatrick Wendell2014-08-051-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds support for giving accumulators user-visible names and displaying accumulator values in the web UI. This allows users to create custom counters that can display in the UI. The current approach displays both the accumulator deltas caused by each task and a "current" value of the accumulator totals for each stage, which gets update as tasks finish. Currently in Spark developers have been extending the `TaskMetrics` functionality to provide custom instrumentation for RDD's. This provides a potentially nicer alternative of going through the existing accumulator framework (actually `TaskMetrics` and accumulators are on an awkward collision course as we add more features to the former). The current patch demo's how we can use the feature to provide instrumentation for RDD input sizes. The nice thing about going through accumulators is that users can read the current value of the data being tracked in their programs. This could be useful to e.g. decide to short-circuit a Spark stage depending on how things are going. ![counters](https://cloud.githubusercontent.com/assets/320616/3488815/6ee7bc34-0505-11e4-84ce-e36d9886e2cf.png) Author: Patrick Wendell <pwendell@gmail.com> Closes #1309 from pwendell/metrics and squashes the following commits: 8815308 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into HEAD 93fbe0f [Patrick Wendell] Other minor fixes cc43f68 [Patrick Wendell] Updating unit tests c991b1b [Patrick Wendell] Moving some code into the Accumulators class 9a9ba3c [Patrick Wendell] More merge fixes c5ace9e [Patrick Wendell] More merge conflicts 1da15e3 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into metrics 9860c55 [Patrick Wendell] Potential solution to posting listener events 0bb0e33 [Patrick Wendell] Remove "display" variable and assume display = name.isDefined 0ec4ac7 [Patrick Wendell] Java API's e95bf69 [Patrick Wendell] Stash be97261 [Patrick Wendell] Style fix 8407308 [Patrick Wendell] Removing examples in Hadoop and RDD class 64d405f [Patrick Wendell] Adding missing file 5d8b156 [Patrick Wendell] Changes based on Kay's review. 9f18bad [Patrick Wendell] Minor style changes and tests 7a63abc [Patrick Wendell] Adding Json serialization and responding to Reynold's feedback ad85076 [Patrick Wendell] Example of using named accumulators for custom RDD metrics. 0b72660 [Patrick Wendell] Initial WIP example of supporing globally named accumulators.
* [SPARK-2859] Update url of Kryo project in related docsGuancheng (G.C.) Chen2014-08-051-2/+2
| | | | | | | | | | | | | | | JIRA Issue: https://issues.apache.org/jira/browse/SPARK-2859 Kryo project has been migrated from googlecode to github, hence we need to update its URL in related docs such as tuning.md. Author: Guancheng (G.C.) Chen <chenguancheng@gmail.com> Closes #1782 from gchen/kryo-docs and squashes the following commits: b62543c [Guancheng (G.C.) Chen] update url of Kryo project (cherry picked from commit ac3440f4f3c4b79070ffec7db0b08ad062b4df90) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-1890 and SPARK-1891- add admin and modify aclsThomas Graves2014-08-052-7/+27
| | | | | | | | | | | | | | | | | | | | | | It was easier to combine these 2 jira since they touch many of the same places. This pr adds the following: - adds modify acls - adds admin acls (list of admins/users that get added to both view and modify acls) - modify Kill button on UI to take modify acls into account - changes config name of spark.ui.acls.enable to spark.acls.enable since I choose poorly in original name. We keep backwards compatibility so people can still use spark.ui.acls.enable. The acls should apply to any web ui as well as any CLI interfaces. - send view and modify acls information on to YARN so that YARN interfaces can use (yarn cli for killing applications for example). Author: Thomas Graves <tgraves@apache.org> Closes #1196 from tgravescs/SPARK-1890 and squashes the following commits: 8292eb1 [Thomas Graves] review comments b92ec89 [Thomas Graves] remove unneeded variable from applistener 4c765f4 [Thomas Graves] Add in admin acls 72eb0ac [Thomas Graves] Add modify acls (cherry picked from commit 1c5555a23d3aa40423d658cfbf2c956ad415a6b1) Signed-off-by: Thomas Graves <tgraves@apache.org>
* SPARK-1528 - spark on yarn, add support for accessing remote HDFSThomas Graves2014-08-051-0/+7
| | | | | | | | | | | | | | Add a config (spark.yarn.access.namenodes) to allow applications running on yarn to access other secure HDFS cluster. User just specifies the namenodes of the other clusters and we get Tokens for those and ship them with the spark application. Author: Thomas Graves <tgraves@apache.org> Closes #1159 from tgravescs/spark-1528 and squashes the following commits: ddbcd16 [Thomas Graves] review comments 0ac8501 [Thomas Graves] SPARK-1528 - add support for accessing remote HDFS (cherry picked from commit 2c0f705e26ca3dfc43a1e9a0722c0e57f67c970a) Signed-off-by: Thomas Graves <tgraves@apache.org>
* [SPARK-2857] Correct properties to set Master / Worker portsAndrew Or2014-08-051-2/+2
| | | | | | | | | | | | | | `master.ui.port` and `worker.ui.port` were never picked up by SparkConf, simply because they are not prefixed with "spark." Unfortunately, this is also currently the documented way of setting these values. Author: Andrew Or <andrewor14@gmail.com> Closes #1779 from andrewor14/master-worker-port and squashes the following commits: 8475e95 [Andrew Or] Update docs to reflect changes in configs 4db3d5d [Andrew Or] Stop using configs that don't actually work (cherry picked from commit a646a365e3beb8d0cd7e492e625ce68ee9439a07) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-2792. Fix reading too much or too little data from each stream in ↵Matei Zaharia2014-08-041-1/+1
| | | | | | | | | | | | | | | | | | | | ExternalMap / Sorter All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed. In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues. Author: Matei Zaharia <matei@databricks.com> Closes #1722 from mateiz/spark-2792 and squashes the following commits: 5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too 18fe865 [Matei Zaharia] Update docs on objectStreamReset 576ee83 [Matei Zaharia] Allow objectStreamReset to be 0 0374217 [Matei Zaharia] Remove super paranoid code to close file handles bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too 0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap 9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
* [SPARK-2784][SQL] Deprecate hql() method in favor of a config option, ↵Michael Armbrust2014-08-031-9/+9
| | | | | | | | | | | | | | | | | | | | | | | 'spark.sql.dialect' Many users have reported being confused by the distinction between the `sql` and `hql` methods. Specifically, many users think that `sql(...)` cannot be used to read hive tables. In this PR I introduce a new configuration option `spark.sql.dialect` that picks which dialect with be used for parsing. For SQLContext this must be set to `sql`. In `HiveContext` it defaults to `hiveql` but can also be set to `sql`. The `hql` and `hiveql` methods continue to act the same but are now marked as deprecated. **This is a possibly breaking change for some users unless they set the dialect manually, though this is unlikely.** For example: `hiveContex.sql("SELECT 1")` will now throw a parsing exception by default. Author: Michael Armbrust <michael@databricks.com> Closes #1746 from marmbrus/sqlLanguageConf and squashes the following commits: ad375cc [Michael Armbrust] Merge remote-tracking branch 'apache/master' into sqlLanguageConf 20c43f8 [Michael Armbrust] override function instead of just setting the value 7e4ae93 [Michael Armbrust] Deprecate hql() method in favor of a config option, 'spark.sql.dialect' (cherry picked from commit 236dfac6769016e433b2f6517cda2d308dea74bc) Signed-off-by: Michael Armbrust <michael@databricks.com>
* SPARK-2712 - Add a small note to maven doc that mvn package must happen ...Stephen Boesch2014-08-031-1/+6
| | | | | | | | | | | | | | Per request by Reynold adding small note about proper sequencing of build then test. Author: Stephen Boesch <javadba@gmail.com> Closes #1615 from javadba/docs and squashes the following commits: 6c3183e [Stephen Boesch] Moved updated testing blurb per PWendell 5764757 [Stephen Boesch] SPARK-2712 - Add a small note to maven doc that mvn package must happen before test (cherry picked from commit f8cd143b6b1b4d8aac87c229e5af263b0319b3ea) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-2739][SQL] Rename registerAsTable to registerTempTableMichael Armbrust2014-08-021-9/+9
| | | | | | | | | | | | | | | | There have been user complaints that the difference between `registerAsTable` and `saveAsTable` is too subtle. This PR addresses this by renaming `registerAsTable` to `registerTempTable`, which more clearly reflects what is happening. `registerAsTable` remains, but will cause a deprecation warning. Author: Michael Armbrust <michael@databricks.com> Closes #1743 from marmbrus/registerTempTable and squashes the following commits: d031348 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable 4dff086 [Michael Armbrust] Fix .java files too 89a2f12 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable 0b7b71e [Michael Armbrust] Rename registerAsTable to registerTempTable (cherry picked from commit 1a8043739dc1d9435def6ea3c6341498ba52b708) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-1981] Add AWS Kinesis streaming supportChris Fregly2014-08-023-6/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Author: Chris Fregly <chris@fregly.com> Closes #1434 from cfregly/master and squashes the following commits: 4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method 0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl 691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams 0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master' 74e5c7c [Chris Fregly] updated per TD's feedback. simplified examples, updated docs e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master' bf614e9 [Chris Fregly] per matei's feedback: moved the kinesis examples into the examples/ dir d17ca6d [Chris Fregly] per TD's feedback: updated docs, simplified the KinesisUtils api 912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master' 21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master' 6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client 338997e [Chris Fregly] improve build docs for kinesis 828f8ae [Chris Fregly] more cleanup e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master' cd68c0d [Chris Fregly] fixed typos and backward compatibility d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master' b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support (cherry picked from commit 91f9504e6086fac05b40545099f9818949c24bca) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SQL] Documentation: Explain cacheTable commandCrazyJvm2014-08-011-0/+10
| | | | | | | | | | | | | add the `cacheTable` specification Author: CrazyJvm <crazyjvm@gmail.com> Closes #1681 from CrazyJvm/sql-programming-guide-cache and squashes the following commits: 0a231e0 [CrazyJvm] grammar fixes a04020e [CrazyJvm] modify title to Cached tables 18b6594 [CrazyJvm] fix format 2cbbf58 [CrazyJvm] add cacheTable guide
* SPARK-2099. Report progress while task is running.Sandy Ryza2014-08-011-0/+7
| | | | | | | | | | | | | | | | | | This is a sketch of a patch that allows the UI to show metrics for tasks that have not yet completed. It adds a heartbeat every 2 seconds from the executors to the driver, reporting metrics for all of the executor's tasks. It still needs unit tests, polish, and cluster testing, but I wanted to put it up to get feedback on the approach. Author: Sandy Ryza <sandy@cloudera.com> Closes #1056 from sryza/sandy-spark-2099 and squashes the following commits: 93b9fdb [Sandy Ryza] Up heartbeat interval to 10 seconds and other tidying 132aec7 [Sandy Ryza] Heartbeat and HeartbeatResponse are already Serializable as case classes 38dffde [Sandy Ryza] Additional review feedback and restore test that was removed in BlockManagerSuite 51fa396 [Sandy Ryza] Remove hostname race, add better comments about threading, and some stylistic improvements 3084f10 [Sandy Ryza] Make TaskUIData a case class again 3bda974 [Sandy Ryza] Stylistic fixes 0dae734 [Sandy Ryza] SPARK-2099. Report progress while task is running.
* Docs: monitoring, streaming programming guidekballou2014-07-312-4/+4
| | | | | | | | | | | | | | | Fix several awkward wordings and grammatical issues in the following documents: * docs/monitoring.md * docs/streaming-programming-guide.md Author: kballou <kballou@devnulllabs.io> Closes #1662 from kennyballou/grammar_fixes and squashes the following commits: e1b8ad6 [kballou] Docs: monitoring, streaming programming guide