spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-11965][ML][DOC] Update user guide for RFormula feature interactions	Yanbo Liang	2016-01-25	1	-1/+19
\| \| \| \| \| \| \| \|	Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10222 from yanboliang/spark-11965.
*	[SPARK-12760][DOCS] inaccurate description for difference between local vs ↵	Sean Owen	2016-01-23	1	-4/+4
\| \| \| \| \| \| \| \| \| \|	cluster mode in closure handling Clarify that modifying a driver local variable won't have the desired effect in cluster modes, and may or may not work as intended in local mode Author: Sean Owen <sowen@cloudera.com> Closes #10866 from srowen/SPARK-12760.
*	[SPARK-12760][DOCS] invalid lambda expression in python example for …	Mortada Mehyar	2016-01-23	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	…local vs cluster srowen thanks for the PR at https://github.com/apache/spark/pull/10866! sorry it took me a while. This is related to https://github.com/apache/spark/pull/10866, basically the assignment in the lambda expression in the python example is actually invalid ``` In [1]: data = [1, 2, 3, 4, 5] In [2]: counter = 0 In [3]: rdd = sc.parallelize(data) In [4]: rdd.foreach(lambda x: counter += x) File "<ipython-input-4-fcb86c182bad>", line 1 rdd.foreach(lambda x: counter += x) ^ SyntaxError: invalid syntax ``` Author: Mortada Mehyar <mortada.mehyar@gmail.com> Closes #10867 from mortada/doc_python_fix.
*	[SPARK-7997][CORE] Remove Akka from Spark Core and Streaming	Shixiong Zhu	2016-01-22	3	-88/+9
\| \| \| \| \| \| \| \| \| \| \| \|	- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka. - Remove HttpFileServer - Remove Akka configs from SparkConf and SSLOptions - Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it. - Update comments and docs Author: Shixiong Zhu <shixiong@databricks.com> Closes #10854 from zsxwing/remove-akka.
*	[SPARK-12534][DOC] update documentation to list command line equivalent to ↵	felixcheung	2016-01-21	3	-6/+36
\| \| \| \| \| \| \| \| \| \|	properties Several Spark properties equivalent to Spark submit command line options are missing. Author: felixcheung <felixcheung_m@hotmail.com> Closes #10491 from felixcheung/sparksubmitdoc.
*	[SPARK-12204][SPARKR] Implement drop method for DataFrame in SparkR.	Sun Rui	2016-01-20	1	-0/+13
\| \| \| \| \| \|	Author: Sun Rui <rui.sun@intel.com> Closes #10201 from sun-rui/SPARK-12204.
*	[SPARK-7799][SPARK-12786][STREAMING] Add "streaming-akka" project	Shixiong Zhu	2016-01-20	2	-12/+41
\| \| \| \| \| \| \| \| \| \| \| \| \|	Include the following changes: 1. Add "streaming-akka" project and org.apache.spark.streaming.akka.AkkaUtils for creating an actorStream 2. Remove "StreamingContext.actorStream" and "JavaStreamingContext.actorStream" 3. Update the ActorWordCount example and add the JavaActorWordCount example 4. Make "streaming-zeromq" depend on "streaming-akka" and update the codes accordingly Author: Shixiong Zhu <shixiong@databricks.com> Closes #10744 from zsxwing/streaming-akka-2.
*	[SPARK-12232][SPARKR] New R API for read.table to avoid name conflict	felixcheung	2016-01-19	1	-7/+4
\| \| \| \| \| \| \| \|	shivaram sorry it took longer to fix some conflicts, this is the change to add an alias for `table` Author: felixcheung <felixcheung_m@hotmail.com> Closes #10406 from felixcheung/readtable.
*	[SPARK-2750][WEB UI] Add https support to the Web UI	scwf	2016-01-19	2	-6/+65
\| \| \| \| \| \| \| \| \|	Author: scwf <wangfei1@huawei.com> Author: Marcelo Vanzin <vanzin@cloudera.com> Author: WangTaoTheTonic <wangtao111@huawei.com> Author: w00228970 <wangfei1@huawei.com> Closes #10238 from vanzin/SPARK-2750.
*	[SPARK-12894][DOCUMENT] Add deploy instructions for Python in Kinesis ↵	Shixiong Zhu	2016-01-18	1	-2/+12
\| \| \| \| \| \| \| \| \| \|	integration doc This PR added instructions to get Kinesis assembly jar for Python users in the Kinesis integration page like Kafka doc. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10822 from zsxwing/kinesis-doc.
*	[SPARK-12814][DOCUMENT] Add deploy instructions for Python in flume ↵	Shixiong Zhu	2016-01-18	2	-4/+13
\| \| \| \| \| \| \| \| \| \|	integration doc This PR added instructions to get flume assembly jar for Python users in the flume integration page like Kafka doc. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10746 from zsxwing/flume-doc.
*	[SPARK-12722][DOCS] Fixed typo in Pipeline example	Jeff Lam	2016-01-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline ``` val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model") ``` should be ``` val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model") ``` cc: jkbradley Author: Jeff Lam <sha0lin@alumni.carnegiemellon.edu> Closes #10769 from Agent007/SPARK-12722.
*	[SPARK-12842][TEST-HADOOP2.7] Add Hadoop 2.7 build profile	Josh Rosen	2016-01-15	1	-1/+2
\| \| \| \| \| \| \| \| \| \|	This patch adds a Hadoop 2.7 build profile in order to let us automate tests against that version. /cc rxin srowen Author: Josh Rosen <joshrosen@databricks.com> Closes #10775 from JoshRosen/add-hadoop-2.7-profile.
*	[SPARK-2930] clarify docs on using webhdfs with spark.yarn.access.nam…	Tom Graves	2016-01-15	1	-4/+4
\| \| \| \| \| \| \| \|	…enodes Author: Tom Graves <tgraves@yahoo-inc.com> Closes #10699 from tgravescs/SPARK-2930.
*	[SPARK-12703][MLLIB][DOC][PYTHON] Fixed pyspark.mllib.clustering.KMeans user ↵	Joseph K. Bradley	2016-01-13	1	-5/+1
\| \| \| \| \| \| \| \| \| \|	guide example Fixed WSSSE computeCost in Python mllib KMeans user guide example by using new computeCost method API in Python. Author: Joseph K. Bradley <joseph@databricks.com> Closes #10707 from jkbradley/kmeans-doc-fix.
*	[SPARK-12805][MESOS] Fixes documentation on Mesos run modes	Luc Bourlier	2016-01-13	1	-7/+5
\| \| \| \| \| \| \| \|	The default run has changed, but the documentation didn't fully reflect the change. Author: Luc Bourlier <luc.bourlier@typesafe.com> Closes #10740 from skyluc/issue/mesos-modes-doc.
*	[SPARK-5273][MLLIB][DOCS] Improve documentation examples for LinearRegression	Sean Owen	2016-01-12	1	-3/+5
\| \| \| \| \| \| \| \| \| \|	Use a much smaller step size in LinearRegressionWithSGD MLlib examples to achieve a reasonable RMSE. Our training folks hit this exact same issue when concocting an example and had the same solution. Author: Sean Owen <sowen@cloudera.com> Closes #10675 from srowen/SPARK-5273.
*	[SPARK-12758][SQL] add note to Spark SQL Migration guide about TimestampType ↵	Brandon Bradley	2016-01-11	1	-0/+5
\| \| \| \| \| \| \| \| \| \|	casting Warning users about casting changes. Author: Brandon Bradley <bradleytastic@gmail.com> Closes #10708 from blbradley/spark-12758.
*	[SPARK-12735] Consolidate & move spark-ec2 to AMPLab managed repository.	Reynold Xin	2016-01-09	4	-199/+2
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #10673 from rxin/SPARK-12735.
*	[SPARK-4819] Remove Guava's "Optional" from public API	Sean Owen	2016-01-08	1	-1/+0
\| \| \| \| \| \| \| \| \| \|	Replace Guava `Optional` with (an API clone of) Java 8 `java.util.Optional` (edit: and a clone of Guava `Optional`) See also https://github.com/apache/spark/pull/10512 Author: Sean Owen <sowen@cloudera.com> Closes #10513 from srowen/SPARK-4819.
*	[DOCUMENTATION] doc fix of job scheduling	Jeff Zhang	2016-01-08	1	-1/+1
\| \| \| \| \| \| \| \|	spark.shuffle.service.enabled is spark application related configuration, it is not necessary to set it in yarn-site.xml Author: Jeff Zhang <zjffdu@apache.org> Closes #10657 from zjffdu/doc-fix.
*	[SPARK-12507][STREAMING][DOCUMENT] Expose closeFileAfterWrite and ↵	Shixiong Zhu	2016-01-07	2	-7/+23
\| \| \| \| \| \| \| \| \| \|	allowBatching configurations for Streaming /cc tdas brkyvz Author: Shixiong Zhu <shixiong@databricks.com> Closes #10453 from zsxwing/streaming-conf.
*	[STREAMING][DOCS][EXAMPLES] Minor fixes	Jacek Laskowski	2016-01-07	1	-4/+4
\| \| \| \| \| \|	Author: Jacek Laskowski <jacek@japila.pl> Closes #10603 from jaceklaskowski/streaming-actor-custom-receiver.
*	[DOC] fix 'spark.memory.offHeap.enabled' default value to false	zzcclp	2016-01-06	1	-1/+1
\| \| \| \| \| \| \| \|	modify 'spark.memory.offHeap.enabled' default value to false Author: zzcclp <xm_zzc@sina.com> Closes #10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.
*	[SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0	Josh Rosen	2016-01-06	1	-11/+0
\| \| \| \| \| \| \| \| \| \| \| \|	This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code. Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs. For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads. Author: Josh Rosen <joshrosen@databricks.com> Closes #10534 from JoshRosen/remove-ttl-based-cleaning.
*	[SPARK-12368][ML][DOC] Better doc for the binary classification evaluator' ↵	BenFradet	2016-01-06	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	metricName For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC". Also, in the documentation, it is said that: "The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators." However, the method is called setMetricName. This PR aims to fix both issues. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10328 from BenFradet/SPARK-12368.
*	[SPARK-12570][ML][DOC] DecisionTreeRegressor: provide variance of ↵	Yanbo Liang	2016-01-05	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \|	prediction: user guide update Update user guide doc for ```DecisionTreeRegressor``` providing variance of prediction. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10594 from yanboliang/spark-12570.
*	[SPARKR][DOC] minor doc update for version in migration guide	felixcheung	2016-01-05	1	-3/+3
\| \| \| \| \| \| \| \| \|	checked that the change is in Spark 1.6.0. shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #10574 from felixcheung/rwritemodedoc.
*	[SPARK-12579][SQL] Force user-specified JDBC driver to take precedence	Josh Rosen	2016-01-04	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection. In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection. This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly). If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different). This patch is inspired by a similar patch that I made to the `spark-redshift` library (https://github.com/databricks/spark-redshift/pull/143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons). Author: Josh Rosen <joshrosen@databricks.com> Closes #10519 from JoshRosen/jdbc-driver-precedence.
*	[SPARK-12588] Remove HttpBroadcast in Spark 2.0.	Reynold Xin	2015-12-30	2	-28/+4
\| \| \| \| \| \| \| \|	We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been undocumented since then. It's time to remove it in Spark 2.0. Author: Reynold Xin <rxin@databricks.com> Closes #10531 from rxin/SPARK-12588.
*	[SPARK-12429][STREAMING][DOC] Add Accumulator and Broadcast example for ↵	Shixiong Zhu	2015-12-22	2	-3/+168
\| \| \| \| \| \| \| \| \| \|	Streaming This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10385 from zsxwing/accumulator-broadcast-example.
*	[SPARK-12487][STREAMING][DOCUMENT] Add docs for Kafka message handler	Shixiong Zhu	2015-12-22	1	-0/+3
\| \| \| \| \| \|	Author: Shixiong Zhu <shixiong@databricks.com> Closes #10439 from zsxwing/kafka-message-handler-doc.
*	[SPARK-11807] Remove support for Hadoop < 2.2	Reynold Xin	2015-12-21	1	-14/+4
\| \| \| \| \| \| \| \|	i.e. Hadoop 1 and Hadoop 2.0 Author: Reynold Xin <rxin@databricks.com> Closes #10404 from rxin/SPARK-11807.
*	[SPARK-12388] change default compression to lz4	Davies Liu	2015-12-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy. After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4). [1] https://github.com/ning/jvm-compressor-benchmark/wiki cc rxin Author: Davies Liu <davies@databricks.com> Closes #10342 from davies/lz4.
*	[SPARK-11808] Remove Bagel.	Reynold Xin	2015-12-19	2	-160/+0
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #10395 from rxin/SPARK-11808.
*	Bump master version to 2.0.0-SNAPSHOT.	Reynold Xin	2015-12-19	1	-2/+2
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #10387 from rxin/version-bump.
*	[SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels	gatorsmile	2015-12-18	2	-7/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs. davies Is this inconsistency intentional? Thanks! Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY. Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`. Author: gatorsmile <gatorsmile@gmail.com> Closes #10092 from gatorsmile/persistStorageLevel.
*	[SPARK-11985][STREAMING][KINESIS][DOCS] Update Kinesis docs	Burak Yavuz	2015-12-18	1	-9/+45
\| \| \| \| \| \| \| \| \| \|	- Provide example on `message handler` - Provide bit on KPL record de-aggregation - Fix typos Author: Burak Yavuz <brkyvz@gmail.com> Closes #9970 from brkyvz/kinesis-docs.
*	[SPARK-11608][MLLIB][DOC] Added migration guide for MLlib 1.6	Joseph K. Bradley	2015-12-16	2	-15/+42
\| \| \| \| \| \| \| \| \| \|	No known breaking changes, but some deprecations and changes of behavior. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #10235 from jkbradley/mllib-guide-update-1.6.
*	[SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for ↵	Yu ISHIKAWA	2015-12-16	2	-0/+36
\| \| \| \| \| \| \| \| \| \| \|	bisecting k-means This PR includes only an example code in order to finish it quickly. I'll send another PR for the docs soon. Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9952 from yu-iskw/SPARK-6518.
*	[SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml	Yu ISHIKAWA	2015-12-16	1	-0/+71
\| \| \| \| \| \| \| \|	cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #10244 from yu-iskw/SPARK-12215.
*	[SPARK-12318][SPARKR] Save mode in SparkR should be error by default	Jeff Zhang	2015-12-16	1	-1/+8
\| \| \| \| \| \| \| \|	shivaram Please help review. Author: Jeff Zhang <zjffdu@apache.org> Closes #10290 from zjffdu/SPARK-12318.
*	[SPARK-12324][MLLIB][DOC] Fixes the sidebar in the ML documentation	Timothy Hunter	2015-12-16	3	-33/+141
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixes the sidebar, using a pure CSS mechanism to hide it when the browser's viewport is too narrow. Credit goes to the original author Titan-C (mentioned in the NOTICE). Note that I am not a CSS expert, so I can only address comments up to some extent. Default view: <img width="936" alt="screen shot 2015-12-14 at 12 46 39 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793597/6d1d6eda-a261-11e5-836b-6eb2054e9054.png"> When collapsed manually by the user: <img width="1004" alt="screen shot 2015-12-14 at 12 54 02 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793669/c991989e-a261-11e5-8bf6-aecf3bdb6319.png"> Disappears when column is too narrow: <img width="697" alt="screen shot 2015-12-14 at 12 47 22 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793607/7754dbcc-a261-11e5-8b15-e0d074b0e47c.png"> Can still be opened by the user if necessary: <img width="651" alt="screen shot 2015-12-14 at 12 51 15 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793612/7bf82968-a261-11e5-9cc3-e827a7a6b2b0.png"> Author: Timothy Hunter <timhunter@databricks.com> Closes #10297 from thunterdb/12324.
*	[SPARK-10123][DEPLOY] Support specifying deploy mode from configuration	jerryshao	2015-12-15	1	-3/+12
\| \| \| \| \| \| \| \|	Please help to review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #10195 from jerryshao/SPARK-10123.
*	[SPARK-12351][MESOS] Add documentation about submitting Spark with mesos ↵	Timothy Chen	2015-12-15	2	-6/+35
\| \| \| \| \| \| \| \| \| \|	cluster mode. Adding more documentation about submitting jobs with mesos cluster mode. Author: Timothy Chen <tnachen@gmail.com> Closes #10086 from tnachen/mesos_supervise_docs.
*	[MINOR][DOC] Fix broken word2vec link	BenFradet	2015-12-14	1	-1/+1
\| \| \| \| \| \| \| \|	Follow-up of [SPARK-12199](https://issues.apache.org/jira/browse/SPARK-12199) and #10193 where a broken link has been left as is. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10282 from BenFradet/SPARK-12199.
*	[SPARK-12199][DOC] Follow-up: Refine example code in ml-features.md	Xusen Yin	2015-12-12	1	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-12199 Follow-up PR of SPARK-11551. Fix some errors in ml-features.md mengxr Author: Xusen Yin <yinxusen@gmail.com> Closes #10193 from yinxusen/SPARK-12199.
*	[SPARK-12217][ML] Document invalid handling for StringIndexer	BenFradet	2015-12-11	1	-0/+36
\| \| \| \| \| \| \| \| \| \|	Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation. I wonder if I should also add a snippet to the code example, input welcome. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10257 from BenFradet/SPARK-12217.
*	[SPARK-11964][DOCS][ML] Add in Pipeline Import/Export Documentation	anabranch	2015-12-11	1	-0/+13
\| \| \| \| \| \| \| \| \|	Adding in Pipeline Import and Export Documentation. Author: anabranch <wac.chambers@gmail.com> Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #10179 from anabranch/master.
*	[STREAMING][DOC][MINOR] Update the description of direct Kafka stream doc	jerryshao	2015-12-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	With the merge of [SPARK-8337](https://issues.apache.org/jira/browse/SPARK-8337), now the Python API has the same functionalities compared to Scala/Java, so here changing the description to make it more precise. zsxwing tdas , please review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #10246 from jerryshao/direct-kafka-doc-update.