spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[MINOR][DOCS] typo in docs/configuration.md	Kai Jiang	2015-11-14	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \|	`<\code>` end tag missing backslash in docs/configuration.md{L308-L339} ref #8795 Author: Kai Jiang <jiangkai@gmail.com> Closes #9715 from vectorijk/minor-typo-docs.
*	[SPARK-11336] Add links to example codes	Xusen Yin	2015-11-13	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-11336 mengxr I add a hyperlink of Spark on Github and a hint of their existences in Spark code repo in each code example. I remove the config key for changing the example code dir, since we assume all examples should be in spark/examples. The hyperlink, though we cannot use it now, since the Spark v1.6.0 has not been released yet, can be used after the release. So it is not a problem. I add some screen shots, so you can get an instant feeling. <img width="949" alt="screen shot 2015-10-27 at 10 47 18 pm" src="https://cloud.githubusercontent.com/assets/2637239/10780634/bd20e072-7cfc-11e5-8960-def4fc62a8ea.png"> <img width="1144" alt="screen shot 2015-10-27 at 10 47 31 pm" src="https://cloud.githubusercontent.com/assets/2637239/10780636/c3f6e180-7cfc-11e5-80b2-233589f4a9a3.png"> Author: Xusen Yin <yinxusen@gmail.com> Closes #9320 from yinxusen/SPARK-11336.
*	[SPARK-11723][ML][DOC] Use LibSVM data source rather than ↵	Yanbo Liang	2015-11-13	4	-19/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MLUtils.loadLibSVMFile to load DataFrame Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include: * Use libSVM data source for all example codes under examples/ml, and remove unused import. * Use libSVM data source for user guides under ml-*** which were omitted by #8697. * Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```. * Code cleanup. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9690 from yanboliang/spark-11723.
*	[SPARK-11445][DOCS] Replaced example code in mllib-ensembles.md using ↵	Rishabh Bhardwaj	2015-11-13	1	-514/+12
\| \| \| \| \| \| \| \| \| \| \|	include_example I have made the required changes and tested. Kindly review the changes. Author: Rishabh Bhardwaj <rbnext29@gmail.com> Closes #9407 from rishabhbhardwaj/SPARK-11445.
*	[SPARK-11629][ML][PYSPARK][DOC] Python example code for Multilayer ↵	Yanbo Liang	2015-11-12	1	-66/+5
\| \| \| \| \| \| \| \| \| \|	Perceptron Classification Add Python example code for Multilayer Perceptron Classification, and make example code in user guide document testable. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9594 from yanboliang/spark-11629.
*	[SPARK-11667] Update dynamic allocation docs to reflect supported cluster ↵	Andrew Or	2015-11-12	1	-28/+27
\| \| \| \| \| \| \| \|	managers Author: Andrew Or <andrew@databricks.com> Closes #9637 from andrewor14/update-da-docs.
*	[SPARK-11670] Fix incorrect kryo buffer default value in docs	Andrew Or	2015-11-12	1	-2/+2
\| \| \| \| \| \| \| \|	<img width="931" alt="screen shot 2015-11-11 at 1 53 21 pm" src="https://cloud.githubusercontent.com/assets/2133137/11108261/35d183d4-889a-11e5-9572-85e9d6cebd26.png"> Author: Andrew Or <andrew@databricks.com> Closes #9638 from andrewor14/fix-kryo-docs.
*	[SPARK-11335][STREAMING] update kafka direct python docs on how to get the ↵	Nick Evans	2015-11-11	1	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \|	offset ranges for a KafkaRDD tdas koeninger This updates the Spark Streaming + Kafka Integration Guide doc with a working method to access the offsets of a `KafkaRDD` through Python. Author: Nick Evans <me@nicolasevans.org> Closes #9289 from manygrams/update_kafka_direct_python_docs.
*	[SPARK-6152] Use shaded ASM5 to support closure cleaning of Java 8 compiled ↵	Josh Rosen	2015-11-11	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	classes This patch modifies Spark's closure cleaner (and a few other places) to use ASM 5, which is necessary in order to support cleaning of closures that were compiled by Java 8. In order to avoid ASM dependency conflicts, Spark excludes ASM from all of its dependencies and uses a shaded version of ASM 4 that comes from `reflectasm` (see [SPARK-782](https://issues.apache.org/jira/browse/SPARK-782) and #232). This patch updates Spark to use a shaded version of ASM 5.0.4 that was published by the Apache XBean project; the POM used to create the shaded artifact can be found at https://github.com/apache/geronimo-xbean/blob/xbean-4.4/xbean-asm5-shaded/pom.xml. http://movingfulcrum.tumblr.com/post/80826553604/asm-framework-50-the-missing-migration-guide was a useful resource while upgrading the code to use the new ASM5 opcodes. I also added a new regression tests in the `java8-tests` subproject; the existing tests were insufficient to catch this bug, which only affected Scala 2.11 user code which was compiled targeting Java 8. Author: Josh Rosen <joshrosen@databricks.com> Closes #9512 from JoshRosen/SPARK-6152.
*	[SPARK-11550][DOCS] Replace example code in mllib-optimization.md using ↵	Pravin Gadakh	2015-11-10	1	-143/+2
\| \| \| \| \| \| \| \|	include_example Author: Pravin Gadakh <pravingadakh177@gmail.com> Closes #9516 from pravingadakh/SPARK-11550.
*	[SPARK-11382] Replace example code in mllib-decision-tree.md using ↵	Xusen Yin	2015-11-10	1	-247/+6
\| \| \| \| \| \| \| \| \| \| \| \|	include_example https://issues.apache.org/jira/browse/SPARK-11382 B.T.W. I fix an error in naive_bayes_example.py. Author: Xusen Yin <yinxusen@gmail.com> Closes #9596 from yinxusen/SPARK-11382.
*	[SPARK-11360][DOC] Loss of nullability when writing parquet files	gatorsmile	2015-11-09	1	-1/+2
\| \| \| \| \| \| \| \|	This fix is to add one line to explain the current behavior of Spark SQL when writing Parquet files. All columns are forced to be nullable for compatibility reasons. Author: gatorsmile <gatorsmile@gmail.com> Closes #9314 from gatorsmile/lossNull.
*	[SPARK-11548][DOCS] Replaced example code in ↵	Rishabh Bhardwaj	2015-11-09	1	-135/+3
\| \| \| \| \| \| \| \| \| \|	mllib-collaborative-filtering.md using include_example Kindly review the changes. Author: Rishabh Bhardwaj <rbnext29@gmail.com> Closes #9519 from rishabhbhardwaj/SPARK-11337.
*	[SPARK-11552][DOCS][Replaced example code in ml-decision-tree.md using ↵	sachin aggarwal	2015-11-09	1	-330/+8
\| \| \| \| \| \| \| \| \| \|	include_example] I have tested it on my local, it is working fine, please review Author: sachin aggarwal <different.sachin@gmail.com> Closes #9539 from agsachin/SPARK-11552-real.
*	[SPARK-11581][DOCS] Example mllib code in documentation incorrectly computes MSE	Bharat Lal	2015-11-09	1	-1/+1
\| \| \| \| \| \|	Author: Bharat Lal <bharat.iisc@gmail.com> Closes #9560 from bharatl/SPARK-11581.
*	[DOCS] Fix typo for Python section on unifying Kafka streams	chriskang90	2015-11-09	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	1) kafkaStreams is a list. The list should be unpacked when passing it into the streaming context union method, which accepts a variable number of streams. 2) print() should be pprint() for pyspark. This contribution is my original work, and I license the work to the project under the project's open source license. Author: chriskang90 <jckang@uchicago.edu> Closes #9545 from c-kang/streaming_python_typo.
*	[SPARK-10689][ML][DOC] User guide and example code for AFTSurvivalRegression	Yanbo Liang	2015-11-09	2	-0/+97
\| \| \| \| \| \| \| \|	Add user guide and example code for ```AFTSurvivalRegression```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9491 from yanboliang/spark-10689.
*	[DOC][MINOR][SQL] Fix internal link	Rohit Agarwal	2015-11-09	1	-1/+1
\| \| \| \| \| \| \| \|	It doesn't show up as a hyperlink currently. It will show up as a hyperlink after this change. Author: Rohit Agarwal <mindprince@gmail.com> Closes #9544 from mindprince/patch-2.
*	[SPARK-10046][SQL] Hive warehouse dir not set in current directory when not …	xin Wu	2015-11-08	1	-2/+4
\| \| \| \| \| \| \| \|	Doc change to align with HiveConf default in terms of where to create `warehouse` directory. Author: xin Wu <xinwu@us.ibm.com> Closes #9365 from xwu0226/spark-10046-commit.
*	[DOC][SQL] Remove redundant out-of-place python snippet	Rohit Agarwal	2015-11-08	1	-9/+0
\| \| \| \| \| \| \| \|	This snippet seems to be mistakenly introduced at two places in #5348. Author: Rohit Agarwal <mindprince@gmail.com> Closes #9540 from mindprince/patch-1.
*	[SPARK-11476][DOCS] Incorrect function referred to in MLib Random data ↵	Sean Owen	2015-11-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	generation documentation Fix Python example to use normalRDD as advertised Author: Sean Owen <sowen@cloudera.com> Closes #9529 from srowen/SPARK-11476.
*	[MINOR][ML][DOC] Rename weights to coefficients in user guide	Yanbo Liang	2015-11-05	1	-12/+12
\| \| \| \| \| \| \| \|	We should use ```coefficients``` rather than ```weights``` in user guide that freshman can get the right conventional name at the outset. mengxr vectorijk Author: Yanbo Liang <ybliang8@gmail.com> Closes #9493 from yanboliang/docs-coefficients.
*	[SPARK-11491] Update build to use Scala 2.10.5	Josh Rosen	2015-11-04	1	-1/+1
\| \| \| \| \| \| \| \|	Spark should build against Scala 2.10.5, since that includes a fix for Scaladoc that will fix doc snapshot publishing: https://issues.scala-lang.org/browse/SI-8479 Author: Josh Rosen <joshrosen@databricks.com> Closes #9450 from JoshRosen/upgrade-to-scala-2.10.5.
*	[SPARK-11197][SQL] add doc for run SQL on files directly	Wenchen Fan	2015-11-04	1	-0/+38
\| \| \| \| \| \|	Author: Wenchen Fan <wenchen@databricks.com> Closes #9467 from cloud-fan/doc.
*	[SPARK-11443] Reserve space lines	Xusen Yin	2015-11-04	1	-1/+1
\| \| \| \| \| \| \| \|	The trim_codeblock(lines) function in include_example.rb removes some blank lines in the code. Author: Xusen Yin <yinxusen@gmail.com> Closes #9400 from yinxusen/SPARK-11443.
*	[SPARK-11380][DOCS] Replace example code in mllib-frequent-pattern-mining.md ↵	Pravin Gadakh	2015-11-04	1	-161/+7
\| \| \| \| \| \| \| \| \|	using include_example Author: Pravin Gadakh <pravingadakh177@gmail.com> Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #9340 from pravingadakh/SPARK-11380.
*	[DOC] Missing link to R DataFrame API doc	lewuathe	2015-11-03	1	-1/+1
\| \| \| \| \| \| \|	Author: lewuathe <lewuathe@me.com> Author: Lewuathe <lewuathe@me.com> Closes #9394 from Lewuathe/missing-link-to-R-dataframe.
*	[SPARK-11407][SPARKR] Add doc for running from RStudio	felixcheung	2015-11-03	1	-3/+43
\| \| \| \| \| \| \| \| \| \|	![image](https://cloud.githubusercontent.com/assets/8969467/10871746/612ba44a-80a4-11e5-99a0-40b9931dee52.png) (This is without css, but you get the idea) shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #9401 from felixcheung/rstudioprogrammingguide.
*	[SPARK-11383][DOCS] Replaced example code in ↵	Rishabh Bhardwaj	2015-11-02	2	-207/+6
\| \| \| \| \| \| \| \| \| \| \|	mllib-naive-bayes.md/mllib-isotonic-regression.md using include_example I have made the required changes in mllib-naive-bayes.md/mllib-isotonic-regression.md and also verified them. Kindle Review it. Author: Rishabh Bhardwaj <rbnext29@gmail.com> Closes #9353 from rishabhbhardwaj/SPARK-11383.
*	[SPARK-11305][DOCS] Remove Third-Party Hadoop Distributions Doc Page	Sean Owen	2015-11-01	5	-125/+18
\| \| \| \| \| \| \| \| \| \|	Remove Hadoop third party distro page, and move Hadoop cluster config info to configuration page CC pwendell Author: Sean Owen <sowen@cloudera.com> Closes #9298 from srowen/SPARK-11305.
*	[SPARK-11340][SPARKR] Support setting driver properties when starting Spark ↵	felixcheung	2015-10-30	1	-8/+20
\| \| \| \| \| \| \| \| \| \| \| \| \|	from R programmatically or from RStudio Mapping spark.driver.memory from sparkEnvir to spark-submit commandline arguments. shivaram suggested that we possibly add other spark.driver.* properties - do we want to add all of those? I thought those could be set in SparkConf? sun-rui Author: felixcheung <felixcheung_m@hotmail.com> Closes #9290 from felixcheung/rdrivermem.
*	[SPARK-11318] Include hive profile in make-distribution.sh command	tedyu	2015-10-29	1	-1/+1
\| \| \| \| \| \|	Author: tedyu <yuzhihong@gmail.com> Closes #9281 from tedyu/master.
*	Typo in mllib-evaluation-metrics.md	Mageswaran.D	2015-10-28	1	-2/+2
\| \| \| \| \| \| \| \|	Recall by threshold snippet was using "precisionByThreshold" Author: Mageswaran.D <mageswaran1989@gmail.com> Closes #9333 from Mageswaran1989/Typo_in_mllib-evaluation-metrics.md.
*	[SPARK-11297] Add new code tags	Xusen Yin	2015-10-26	1	-0/+4
\| \| \| \| \| \| \| \| \| \|	mengxr https://issues.apache.org/jira/browse/SPARK-11297 Add new code tags to hold the same look and feel with previous documents. Author: Xusen Yin <yinxusen@gmail.com> Closes #9265 from yinxusen/SPARK-11297.
*	[SPARK-11289][DOC] Substitute code examples in ML features extractors with ↵	Xusen Yin	2015-10-26	1	-209/+8
\| \| \| \| \| \| \| \| \| \| \| \|	include_example mengxr https://issues.apache.org/jira/browse/SPARK-11289 I make some changes in ML feature extractors. I.e. TF-IDF, Word2Vec, and CountVectorizer. I add new example code in spark/examples, hope it is the right place to add those examples. Author: Xusen Yin <yinxusen@gmail.com> Closes #9266 from yinxusen/SPARK-11289.
*	[SPARK-11299][DOC] Fix link to Scala DataFrame Functions reference	Josh Rosen	2015-10-25	1	-1/+1
\| \| \| \| \| \| \| \|	The SQL programming guide's link to the DataFrame functions reference points to the wrong location; this patch fixes that. Author: Josh Rosen <joshrosen@databricks.com> Closes #9269 from JoshRosen/SPARK-11299.
*	[SPARK-10971][SPARKR] RRunner should allow setting path to Rscript.	Sun Rui	2015-10-23	1	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes. The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also [launch R worker script](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRDD.scala#L395). BTW, [envrionment variable "SPARKR_DRIVER_R"](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L275) is used to locate R shell on the local host. For your information, PYSPARK has two environment variables serving simliar purpose: PYSPARK_PYTHON Python binary executable to use for PySpark in both driver and workers (default is `python`). PYSPARK_DRIVER_PYTHON Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON). pySpark use the code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L41) to determine the python executable for a python script. Author: Sun Rui <rui.sun@intel.com> Closes #9179 from sun-rui/SPARK-10971.
*	[SPARK-10382] Make example code in user guide testable	Xusen Yin	2015-10-23	1	-0/+96
\| \| \| \| \| \| \| \| \| \|	A POC code for making example code in user guide testable. mengxr We still need to talk about the labels in code. Author: Xusen Yin <yinxusen@gmail.com> Closes #9109 from yinxusen/SPARK-10382.
*	Fix typo "Received" to "Receiver" in streaming-kafka-integration.md	Rohan Bhanderi	2015-10-23	1	-1/+1
\| \| \| \| \| \| \| \|	Removed typo on line 8 in markdown : "Received" -> "Receiver" Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu> Closes #9242 from RohanBhanderi/patch-1.
*	[SPARK-10708] Consolidate sort shuffle implementations	Josh Rosen	2015-10-22	1	-5/+2
\| \| \| \| \| \| \| \|	There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together. Author: Josh Rosen <joshrosen@databricks.com> Closes #8829 from JoshRosen/consolidate-sort-shuffle-implementations.
*	[SPARK-11105] [YARN] Distribute log4j.properties to executors	vundela	2015-10-20	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \|	Currently log4j.properties file is not uploaded to executor's which is leading them to use the default values. This fix will make sure that file is always uploaded to distributed cache so that executor will use the latest settings. If user specifies log configurations through --files then executors will be picking configs from --files instead of $SPARK_CONF_DIR/log4j.properties Author: vundela <vsr@cloudera.com> Author: Srinivasa Reddy Vundela <vsr@cloudera.com> Closes #9118 from vundela/master.
*	[SPARK-11174] [DOCS] Fix typo in the GraphX programming guide	Lukasz Piepiora	2015-10-18	1	-1/+1
\| \| \| \| \| \| \| \|	This patch fixes a small typo in the GraphX programming guide Author: Lukasz Piepiora <lpiepiora@gmail.com> Closes #9160 from lpiepiora/11174-fix-typo-in-graphx-programming-guide.
*	fix typo bellow -> below	Britta Weber	2015-10-15	2	-3/+3
\| \| \| \| \| \|	Author: Britta Weber <britta.weber@elasticsearch.com> Closes #9136 from brwe/typo-bellow.
*	[SPARK-11039][Documentation][Web UI] Document additional ui configurations	Nick Pritchard	2015-10-15	1	-0/+14
\| \| \| \| \| \| \| \| \| \|	Add documentation for configuration: - spark.sql.ui.retainedExecutions - spark.streaming.ui.retainedBatches Author: Nick Pritchard <nicholas.pritchard@falkonry.com> Closes #9052 from pnpritchard/SPARK-11039.
*	[SPARK-10983] Unified memory manager	Andrew Or	2015-10-13	1	-29/+70
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced: - spark.memory.fraction (default 0.75): fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. - spark.memory.storageFraction (default 0.5): size of the storage region within the space set aside by `spark.memory.fraction`. Cached data may only be evicted if total storage exceeds this region. - spark.memory.useLegacyMode (default false): whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility. For a detailed description of the design, see [SPARK-10000](https://issues.apache.org/jira/browse/SPARK-10000). This patch builds on top of the `MemoryManager` interface introduced in #9000. Author: Andrew Or <andrew@databricks.com> Closes #9084 from andrewor14/unified-memory-manager.
*	[SPARK-10739] [YARN] Add application attempt window for Spark on Yarn	jerryshao	2015-10-12	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add application attempt window for Spark on Yarn to ignore old out of window failures, this is useful for long running applications to recover from failures. Author: jerryshao <sshao@hortonworks.com> Closes #8857 from jerryshao/SPARK-10739 and squashes the following commits: 36eabdc [jerryshao] change the doc 7f9b77d [jerryshao] Style change 1c9afd0 [jerryshao] Address the comments caca695 [jerryshao] Add application attempt window for Spark on Yarn
*	[SPARK-11056] Improve documentation of SBT build.	Kay Ousterhout	2015-10-12	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit improves the documentation around building Spark to (1) recommend using SBT interactive mode to avoid the overhead of launching SBT and (2) refer to the wiki page that documents using SPARK_PREPEND_CLASSES to avoid creating the assembly jar for each compile. cc srowen Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9068 from kayousterhout/SPARK-11056.
*	[SPARK-10883] Add a note about how to build Spark sub-modules (reactor)	Jean-Baptiste Onofré	2015-10-08	1	-0/+11
\| \| \| \| \| \|	Author: Jean-Baptiste Onofré <jbonofre@apache.org> Closes #8993 from jbonofre/SPARK-10883-2.
*	Akka framesize units should be specified	admackin	2015-10-08	1	-1/+1
\| \| \| \| \| \| \| \|	1.4 docs noted that the units were MB - i have assumed this is still the case Author: admackin <admackin@users.noreply.github.com> Closes #9025 from admackin/master.
*	[SPARK-10669] [DOCS] Link to each language's API in codetabs in ML docs: ↵	Xin Ren	2015-10-07	15	-30/+274
\| \| \| \| \| \| \| \| \| \| \| \| \|	spark.mllib In the Markdown docs for the spark.mllib Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "ChiSqSelector" section in https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md This JIRA is just for spark.mllib, not spark.ml. Please let me know if more work is needed, thanks a lot. Author: Xin Ren <iamshrek@126.com> Closes #8977 from keypointt/SPARK-10669.