| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
are masked by functions with same name in SparkR
Added tests for function that are reported as masked, to make sure the base:: or stats:: function can be called.
For those we can't call, added them to SparkR programming guide.
It would seem to me `table, sample, subset, filter, cov` not working are not actually expected - I investigated/experimented with them but couldn't get them to work. It looks like as they are defined in base or stats they are missing the S3 generic, eg.
```
> methods("transform")
[1] transform,ANY-method transform.data.frame
[3] transform,DataFrame-method transform.default
see '?methods' for accessing help and source code
> methods("subset")
[1] subset.data.frame subset,DataFrame-method subset.default
[4] subset.matrix
see '?methods' for accessing help and source code
Warning message:
In .S3methods(generic.function, class, parent.frame()) :
function 'subset' appears not to be S3 generic; found functions that look like S3 methods
```
Any idea?
More information on masking:
http://www.ats.ucla.edu/stat/r/faq/referencing_objects.htm
http://www.sfu.ca/~sweldon/howTo/guide4.pdf
This is what the output doc looks like (minus css):
![image](https://cloud.githubusercontent.com/assets/8969467/11229714/2946e5de-8d4d-11e5-94b0-dda9696b6fdd.png)
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #9785 from felixcheung/rmasked.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
codes
This PR includes:
* Update SparkR:::glm, SparkR:::summary API docs.
* Update SparkR machine learning user guide and example codes to show:
* supporting feature interaction in R formula.
* summary for gaussian GLM model.
* coefficients for binomial GLM model.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9727 from yanboliang/spark-11684.
|
|
|
|
|
|
|
|
| |
Based on my conversions with people, I believe the consensus is that the coarse-grained mode is more stable and easier to reason about. It is best to use that as the default rather than the more flaky fine-grained mode.
Author: Reynold Xin <rxin@databricks.com>
Closes #9795 from rxin/SPARK-11809.
|
|
|
|
|
|
|
|
|
|
| |
JIRA issue https://issues.apache.org/jira/browse/SPARK-11728.
The ml-ensembles.md file contains `OneVsRestExample`. Instead of writing new code files of two `OneVsRestExample`s, I use two existing files in the examples directory, they are `OneVsRestExample.scala` and `JavaOneVsRestExample.scala`.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9716 from yinxusen/SPARK-11728.
|
|
|
|
|
|
|
|
| |
JIRA link: https://issues.apache.org/jira/browse/SPARK-11729
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9713 from yinxusen/SPARK-11729.
|
|
|
|
|
|
|
|
|
|
| |
This PR adds a new option `spark.sql.hive.thriftServer.singleSession` for disabling multi-session support in the Thrift server.
Note that this option is added as a Spark configuration (retrieved from `SparkConf`) rather than Spark SQL configuration (retrieved from `SQLConf`). This is because all SQL configurations are session-ized. Since multi-session support is by default on, no JDBC connection can modify global configurations like the newly added one.
Author: Cheng Lian <lian@databricks.com>
Closes #9740 from liancheng/spark-11089.single-session-option.
|
|
|
|
|
|
|
|
| |
MESOS_NATIVE_LIBRARY was renamed in favor of MESOS_NATIVE_JAVA_LIBRARY. This commit fixes the reference in the documentation.
Author: Philipp Hoffmann <mail@philipphoffmann.de>
Closes #9768 from philipphoffmann/patch-2.
|
|
|
|
|
|
|
|
|
|
|
| |
In the **[Task Launching Overheads](http://spark.apache.org/docs/latest/streaming-programming-guide.html#task-launching-overheads)** section,
>Task Serialization: Using Kryo serialization for serializing tasks can reduce the task sizes, and therefore reduce the time taken to send them to the slaves.
as we known **Task Serialization** is configuration by **spark.closure.serializer** parameter, but currently only the Java serializer is supported. If we set **spark.closure.serializer** to **org.apache.spark.serializer.KryoSerializer**, then this will throw a exception.
Author: yangping.wu <wyphao.2007@163.com>
Closes #9734 from 397090770/397090770-patch-1.
|
|
|
|
|
|
| |
Author: Andrew Or <andrew@databricks.com>
Closes #9676 from andrewor14/memory-management-docs.
|
|
|
|
|
|
|
|
|
|
|
| |
`<\code>` end tag missing backslash in
docs/configuration.md{L308-L339}
ref #8795
Author: Kai Jiang <jiangkai@gmail.com>
Closes #9715 from vectorijk/minor-typo-docs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-11336
mengxr I add a hyperlink of Spark on Github and a hint of their existences in Spark code repo in each code example. I remove the config key for changing the example code dir, since we assume all examples should be in spark/examples.
The hyperlink, though we cannot use it now, since the Spark v1.6.0 has not been released yet, can be used after the release. So it is not a problem.
I add some screen shots, so you can get an instant feeling.
<img width="949" alt="screen shot 2015-10-27 at 10 47 18 pm" src="https://cloud.githubusercontent.com/assets/2637239/10780634/bd20e072-7cfc-11e5-8960-def4fc62a8ea.png">
<img width="1144" alt="screen shot 2015-10-27 at 10 47 31 pm" src="https://cloud.githubusercontent.com/assets/2637239/10780636/c3f6e180-7cfc-11e5-80b2-233589f4a9a3.png">
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9320 from yinxusen/SPARK-11336.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
MLUtils.loadLibSVMFile to load DataFrame
Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include:
* Use libSVM data source for all example codes under examples/ml, and remove unused import.
* Use libSVM data source for user guides under ml-*** which were omitted by #8697.
* Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```.
* Code cleanup.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9690 from yanboliang/spark-11723.
|
|
|
|
|
|
|
|
|
|
|
| |
include_example
I have made the required changes and tested.
Kindly review the changes.
Author: Rishabh Bhardwaj <rbnext29@gmail.com>
Closes #9407 from rishabhbhardwaj/SPARK-11445.
|
|
|
|
|
|
|
|
|
|
| |
Perceptron Classification
Add Python example code for Multilayer Perceptron Classification, and make example code in user guide document testable. mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9594 from yanboliang/spark-11629.
|
|
|
|
|
|
|
|
| |
managers
Author: Andrew Or <andrew@databricks.com>
Closes #9637 from andrewor14/update-da-docs.
|
|
|
|
|
|
|
|
| |
<img width="931" alt="screen shot 2015-11-11 at 1 53 21 pm" src="https://cloud.githubusercontent.com/assets/2133137/11108261/35d183d4-889a-11e5-9572-85e9d6cebd26.png">
Author: Andrew Or <andrew@databricks.com>
Closes #9638 from andrewor14/fix-kryo-docs.
|
|
|
|
|
|
|
|
|
|
|
|
| |
offset ranges for a KafkaRDD
tdas koeninger
This updates the Spark Streaming + Kafka Integration Guide doc with a working method to access the offsets of a `KafkaRDD` through Python.
Author: Nick Evans <me@nicolasevans.org>
Closes #9289 from manygrams/update_kafka_direct_python_docs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
classes
This patch modifies Spark's closure cleaner (and a few other places) to use ASM 5, which is necessary in order to support cleaning of closures that were compiled by Java 8.
In order to avoid ASM dependency conflicts, Spark excludes ASM from all of its dependencies and uses a shaded version of ASM 4 that comes from `reflectasm` (see [SPARK-782](https://issues.apache.org/jira/browse/SPARK-782) and #232). This patch updates Spark to use a shaded version of ASM 5.0.4 that was published by the Apache XBean project; the POM used to create the shaded artifact can be found at https://github.com/apache/geronimo-xbean/blob/xbean-4.4/xbean-asm5-shaded/pom.xml.
http://movingfulcrum.tumblr.com/post/80826553604/asm-framework-50-the-missing-migration-guide was a useful resource while upgrading the code to use the new ASM5 opcodes.
I also added a new regression tests in the `java8-tests` subproject; the existing tests were insufficient to catch this bug, which only affected Scala 2.11 user code which was compiled targeting Java 8.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #9512 from JoshRosen/SPARK-6152.
|
|
|
|
|
|
|
|
| |
include_example
Author: Pravin Gadakh <pravingadakh177@gmail.com>
Closes #9516 from pravingadakh/SPARK-11550.
|
|
|
|
|
|
|
|
|
|
|
|
| |
include_example
https://issues.apache.org/jira/browse/SPARK-11382
B.T.W. I fix an error in naive_bayes_example.py.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9596 from yinxusen/SPARK-11382.
|
|
|
|
|
|
|
|
| |
This fix is to add one line to explain the current behavior of Spark SQL when writing Parquet files. All columns are forced to be nullable for compatibility reasons.
Author: gatorsmile <gatorsmile@gmail.com>
Closes #9314 from gatorsmile/lossNull.
|
|
|
|
|
|
|
|
|
|
| |
mllib-collaborative-filtering.md using include_example
Kindly review the changes.
Author: Rishabh Bhardwaj <rbnext29@gmail.com>
Closes #9519 from rishabhbhardwaj/SPARK-11337.
|
|
|
|
|
|
|
|
|
|
| |
include_example]
I have tested it on my local, it is working fine, please review
Author: sachin aggarwal <different.sachin@gmail.com>
Closes #9539 from agsachin/SPARK-11552-real.
|
|
|
|
|
|
| |
Author: Bharat Lal <bharat.iisc@gmail.com>
Closes #9560 from bharatl/SPARK-11581.
|
|
|
|
|
|
|
|
|
|
|
| |
1) kafkaStreams is a list. The list should be unpacked when passing it into the streaming context union method, which accepts a variable number of streams.
2) print() should be pprint() for pyspark.
This contribution is my original work, and I license the work to the project under the project's open source license.
Author: chriskang90 <jckang@uchicago.edu>
Closes #9545 from c-kang/streaming_python_typo.
|
|
|
|
|
|
|
|
| |
Add user guide and example code for ```AFTSurvivalRegression```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9491 from yanboliang/spark-10689.
|
|
|
|
|
|
|
|
| |
It doesn't show up as a hyperlink currently. It will show up as a hyperlink after this change.
Author: Rohit Agarwal <mindprince@gmail.com>
Closes #9544 from mindprince/patch-2.
|
|
|
|
|
|
|
|
| |
Doc change to align with HiveConf default in terms of where to create `warehouse` directory.
Author: xin Wu <xinwu@us.ibm.com>
Closes #9365 from xwu0226/spark-10046-commit.
|
|
|
|
|
|
|
|
| |
This snippet seems to be mistakenly introduced at two places in #5348.
Author: Rohit Agarwal <mindprince@gmail.com>
Closes #9540 from mindprince/patch-1.
|
|
|
|
|
|
|
|
|
|
| |
generation documentation
Fix Python example to use normalRDD as advertised
Author: Sean Owen <sowen@cloudera.com>
Closes #9529 from srowen/SPARK-11476.
|
|
|
|
|
|
|
|
| |
We should use ```coefficients``` rather than ```weights``` in user guide that freshman can get the right conventional name at the outset. mengxr vectorijk
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9493 from yanboliang/docs-coefficients.
|
|
|
|
|
|
|
|
| |
Spark should build against Scala 2.10.5, since that includes a fix for Scaladoc that will fix doc snapshot publishing: https://issues.scala-lang.org/browse/SI-8479
Author: Josh Rosen <joshrosen@databricks.com>
Closes #9450 from JoshRosen/upgrade-to-scala-2.10.5.
|
|
|
|
|
|
| |
Author: Wenchen Fan <wenchen@databricks.com>
Closes #9467 from cloud-fan/doc.
|
|
|
|
|
|
|
|
| |
The trim_codeblock(lines) function in include_example.rb removes some blank lines in the code.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9400 from yinxusen/SPARK-11443.
|
|
|
|
|
|
|
|
|
| |
using include_example
Author: Pravin Gadakh <pravingadakh177@gmail.com>
Author: Pravin Gadakh <prgadakh@in.ibm.com>
Closes #9340 from pravingadakh/SPARK-11380.
|
|
|
|
|
|
|
| |
Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <lewuathe@me.com>
Closes #9394 from Lewuathe/missing-link-to-R-dataframe.
|
|
|
|
|
|
|
|
|
|
| |
![image](https://cloud.githubusercontent.com/assets/8969467/10871746/612ba44a-80a4-11e5-99a0-40b9931dee52.png)
(This is without css, but you get the idea)
shivaram
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #9401 from felixcheung/rstudioprogrammingguide.
|
|
|
|
|
|
|
|
|
|
|
| |
mllib-naive-bayes.md/mllib-isotonic-regression.md using include_example
I have made the required changes in mllib-naive-bayes.md/mllib-isotonic-regression.md and also verified them.
Kindle Review it.
Author: Rishabh Bhardwaj <rbnext29@gmail.com>
Closes #9353 from rishabhbhardwaj/SPARK-11383.
|
|
|
|
|
|
|
|
|
|
| |
Remove Hadoop third party distro page, and move Hadoop cluster config info to configuration page
CC pwendell
Author: Sean Owen <sowen@cloudera.com>
Closes #9298 from srowen/SPARK-11305.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
from R programmatically or from RStudio
Mapping spark.driver.memory from sparkEnvir to spark-submit commandline arguments.
shivaram suggested that we possibly add other spark.driver.* properties - do we want to add all of those? I thought those could be set in SparkConf?
sun-rui
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #9290 from felixcheung/rdrivermem.
|
|
|
|
|
|
| |
Author: tedyu <yuzhihong@gmail.com>
Closes #9281 from tedyu/master.
|
|
|
|
|
|
|
|
| |
Recall by threshold snippet was using "precisionByThreshold"
Author: Mageswaran.D <mageswaran1989@gmail.com>
Closes #9333 from Mageswaran1989/Typo_in_mllib-evaluation-metrics.md.
|
|
|
|
|
|
|
|
|
|
| |
mengxr https://issues.apache.org/jira/browse/SPARK-11297
Add new code tags to hold the same look and feel with previous documents.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9265 from yinxusen/SPARK-11297.
|
|
|
|
|
|
|
|
|
|
|
|
| |
include_example
mengxr https://issues.apache.org/jira/browse/SPARK-11289
I make some changes in ML feature extractors. I.e. TF-IDF, Word2Vec, and CountVectorizer. I add new example code in spark/examples, hope it is the right place to add those examples.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9266 from yinxusen/SPARK-11289.
|
|
|
|
|
|
|
|
| |
The SQL programming guide's link to the DataFrame functions reference points to the wrong location; this patch fixes that.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #9269 from JoshRosen/SPARK-11299.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes.
The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also [launch R worker script](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRDD.scala#L395).
BTW, [envrionment variable "SPARKR_DRIVER_R"](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L275) is used to locate R shell on the local host.
For your information, PYSPARK has two environment variables serving simliar purpose:
PYSPARK_PYTHON Python binary executable to use for PySpark in both driver and workers (default is `python`).
PYSPARK_DRIVER_PYTHON Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON).
pySpark use the code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L41) to determine the python executable for a python script.
Author: Sun Rui <rui.sun@intel.com>
Closes #9179 from sun-rui/SPARK-10971.
|
|
|
|
|
|
|
|
|
|
| |
A POC code for making example code in user guide testable.
mengxr We still need to talk about the labels in code.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9109 from yinxusen/SPARK-10382.
|
|
|
|
|
|
|
|
| |
Removed typo on line 8 in markdown : "Received" -> "Receiver"
Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu>
Closes #9242 from RohanBhanderi/patch-1.
|
|
|
|
|
|
|
|
| |
There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #8829 from JoshRosen/consolidate-sort-shuffle-implementations.
|
|
|
|
|
|
|
|
|
|
|
| |
Currently log4j.properties file is not uploaded to executor's which is leading them to use the default values. This fix will make sure that file is always uploaded to distributed cache so that executor will use the latest settings.
If user specifies log configurations through --files then executors will be picking configs from --files instead of $SPARK_CONF_DIR/log4j.properties
Author: vundela <vsr@cloudera.com>
Author: Srinivasa Reddy Vundela <vsr@cloudera.com>
Closes #9118 from vundela/master.
|