| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #10673 from rxin/SPARK-12735.
|
|
|
|
|
|
|
|
|
|
| |
Replace Guava `Optional` with (an API clone of) Java 8 `java.util.Optional` (edit: and a clone of Guava `Optional`)
See also https://github.com/apache/spark/pull/10512
Author: Sean Owen <sowen@cloudera.com>
Closes #10513 from srowen/SPARK-4819.
|
|
|
|
|
|
|
|
| |
spark.shuffle.service.enabled is spark application related configuration, it is not necessary to set it in yarn-site.xml
Author: Jeff Zhang <zjffdu@apache.org>
Closes #10657 from zjffdu/doc-fix.
|
|
|
|
|
|
|
|
|
|
| |
allowBatching configurations for Streaming
/cc tdas brkyvz
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #10453 from zsxwing/streaming-conf.
|
|
|
|
|
|
| |
Author: Jacek Laskowski <jacek@japila.pl>
Closes #10603 from jaceklaskowski/streaming-actor-custom-receiver.
|
|
|
|
|
|
|
|
| |
modify 'spark.memory.offHeap.enabled' default value to false
Author: zzcclp <xm_zzc@sina.com>
Closes #10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code.
Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs.
For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #10534 from JoshRosen/remove-ttl-based-cleaning.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
metricName
For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC".
Also, in the documentation, it is said that:
"The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators."
However, the method is called setMetricName.
This PR aims to fix both issues.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10328 from BenFradet/SPARK-12368.
|
|
|
|
|
|
|
|
|
|
|
|
| |
prediction: user guide update
Update user guide doc for ```DecisionTreeRegressor``` providing variance of prediction.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10594 from yanboliang/spark-12570.
|
|
|
|
|
|
|
|
|
| |
checked that the change is in Spark 1.6.0.
shivaram
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #10574 from felixcheung/rwritemodedoc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection.
In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection.
This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly).
If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different).
This patch is inspired by a similar patch that I made to the `spark-redshift` library (https://github.com/databricks/spark-redshift/pull/143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons).
Author: Josh Rosen <joshrosen@databricks.com>
Closes #10519 from JoshRosen/jdbc-driver-precedence.
|
|
|
|
|
|
|
|
| |
We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been undocumented since then. It's time to remove it in Spark 2.0.
Author: Reynold Xin <rxin@databricks.com>
Closes #10531 from rxin/SPARK-12588.
|
|
|
|
|
|
|
|
|
|
| |
Streaming
This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #10385 from zsxwing/accumulator-broadcast-example.
|
|
|
|
|
|
| |
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #10439 from zsxwing/kafka-message-handler-doc.
|
|
|
|
|
|
|
|
| |
i.e. Hadoop 1 and Hadoop 2.0
Author: Reynold Xin <rxin@databricks.com>
Closes #10404 from rxin/SPARK-11807.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy.
After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4).
[1] https://github.com/ning/jvm-compressor-benchmark/wiki
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes #10342 from davies/lz4.
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #10395 from rxin/SPARK-11808.
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #10387 from rxin/version-bump.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.
davies Is this inconsistency intentional? Thanks!
Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.
Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.
Author: gatorsmile <gatorsmile@gmail.com>
Closes #10092 from gatorsmile/persistStorageLevel.
|
|
|
|
|
|
|
|
|
|
| |
- Provide example on `message handler`
- Provide bit on KPL record de-aggregation
- Fix typos
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #9970 from brkyvz/kinesis-docs.
|
|
|
|
|
|
|
|
|
|
| |
No known breaking changes, but some deprecations and changes of behavior.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #10235 from jkbradley/mllib-guide-update-1.6.
|
|
|
|
|
|
|
|
|
|
|
| |
bisecting k-means
This PR includes only an example code in order to finish it quickly.
I'll send another PR for the docs soon.
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #9952 from yu-iskw/SPARK-6518.
|
|
|
|
|
|
|
|
| |
cc jkbradley
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #10244 from yu-iskw/SPARK-12215.
|
|
|
|
|
|
|
|
| |
shivaram Please help review.
Author: Jeff Zhang <zjffdu@apache.org>
Closes #10290 from zjffdu/SPARK-12318.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This fixes the sidebar, using a pure CSS mechanism to hide it when the browser's viewport is too narrow.
Credit goes to the original author Titan-C (mentioned in the NOTICE).
Note that I am not a CSS expert, so I can only address comments up to some extent.
Default view:
<img width="936" alt="screen shot 2015-12-14 at 12 46 39 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793597/6d1d6eda-a261-11e5-836b-6eb2054e9054.png">
When collapsed manually by the user:
<img width="1004" alt="screen shot 2015-12-14 at 12 54 02 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793669/c991989e-a261-11e5-8bf6-aecf3bdb6319.png">
Disappears when column is too narrow:
<img width="697" alt="screen shot 2015-12-14 at 12 47 22 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793607/7754dbcc-a261-11e5-8b15-e0d074b0e47c.png">
Can still be opened by the user if necessary:
<img width="651" alt="screen shot 2015-12-14 at 12 51 15 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793612/7bf82968-a261-11e5-9cc3-e827a7a6b2b0.png">
Author: Timothy Hunter <timhunter@databricks.com>
Closes #10297 from thunterdb/12324.
|
|
|
|
|
|
|
|
| |
Please help to review, thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes #10195 from jerryshao/SPARK-10123.
|
|
|
|
|
|
|
|
|
|
| |
cluster mode.
Adding more documentation about submitting jobs with mesos cluster mode.
Author: Timothy Chen <tnachen@gmail.com>
Closes #10086 from tnachen/mesos_supervise_docs.
|
|
|
|
|
|
|
|
| |
Follow-up of [SPARK-12199](https://issues.apache.org/jira/browse/SPARK-12199) and #10193 where a broken link has been left as is.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10282 from BenFradet/SPARK-12199.
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-12199
Follow-up PR of SPARK-11551. Fix some errors in ml-features.md
mengxr
Author: Xusen Yin <yinxusen@gmail.com>
Closes #10193 from yinxusen/SPARK-12199.
|
|
|
|
|
|
|
|
|
|
| |
Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation.
I wonder if I should also add a snippet to the code example, input welcome.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10257 from BenFradet/SPARK-12217.
|
|
|
|
|
|
|
|
|
| |
Adding in Pipeline Import and Export Documentation.
Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>
Closes #10179 from anabranch/master.
|
|
|
|
|
|
|
|
|
|
| |
With the merge of [SPARK-8337](https://issues.apache.org/jira/browse/SPARK-8337), now the Python API has the same functionalities compared to Scala/Java, so here changing the description to make it more precise.
zsxwing tdas , please review, thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes #10246 from jerryshao/direct-kafka-doc-update.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds documentation for Spark configurations that affect off-heap memory and makes some naming and validation improvements for those configs.
- Change `spark.memory.offHeapSize` to `spark.memory.offHeap.size`. This is fine because this configuration has not shipped in any Spark release yet (it's new in Spark 1.6).
- Deprecated `spark.unsafe.offHeap` in favor of a new `spark.memory.offHeap.enabled` configuration. The motivation behind this change is to gather all memory-related configurations under the same prefix.
- Add a check which prevents users from setting `spark.memory.offHeap.enabled=true` when `spark.memory.offHeap.size == 0`. After SPARK-11389 (#9344), which was committed in Spark 1.6, Spark enforces a hard limit on the amount of off-heap memory that it will allocate to tasks. As a result, enabling off-heap execution memory without setting `spark.memory.offHeap.size` will lead to immediate OOMs. The new configuration validation makes this scenario easier to diagnose, helping to avoid user confusion.
- Document these configurations on the configuration page.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #10237 from JoshRosen/SPARK-12251.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This avoids bringing up yet another HTTP server on the driver, and
instead reuses the file server already managed by the driver's
RpcEnv. As a bonus, the repl now inherits the security features of
the network library.
There's also a small change to create the directory for storing classes
under the root temp dir for the application (instead of directly
under java.io.tmpdir).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #9923 from vanzin/SPARK-11563.
|
|
|
|
|
|
|
|
|
|
|
|
| |
spark.mllib and mllib in the documentation.
Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark).
It also removes some files that I forgot to delete with #10207
Author: Timothy Hunter <timhunter@databricks.com>
Closes #10234 from thunterdb/12212.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR adds document for `basePath`, which is a new parameter used by `HadoopFsRelation`.
The compiled doc is shown below.
![image](https://cloud.githubusercontent.com/assets/2072857/11673132/1ba01192-9dcb-11e5-98d9-ac0b4e92e98c.png)
JIRA: https://issues.apache.org/jira/browse/SPARK-11678
Author: Yin Huai <yhuai@databricks.com>
Closes #10211 from yhuai/basePathDoc.
|
|
|
|
|
|
|
|
|
|
| |
from 1.1
Migration from 1.1 section added to the GraphX doc in 1.2.0 (see https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#migrating-from-spark-11) uses \{{site.SPARK_VERSION}} as the version where changes were introduced, it should be just 1.2.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes #10206 from aray/graphx-doc-1.1-migration.
|
|
|
|
|
|
|
|
|
| |
PR on behalf of somideshmukh, thanks!
Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>
Closes #10219 from yinxusen/SPARK-11551.
|
|
|
|
|
|
|
|
|
|
| |
This PR moves pieces of the spark.ml user guide to reflect suggestions in SPARK-8517. It does not introduce new content, as requested.
<img width="192" alt="screen shot 2015-12-08 at 11 36 00 am" src="https://cloud.githubusercontent.com/assets/7594753/11666166/e82b84f2-9d9f-11e5-8904-e215424d8444.png">
Author: Timothy Hunter <timhunter@databricks.com>
Closes #10207 from thunterdb/spark-8517.
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #10060 from marmbrus/docs.
|
|
|
|
|
|
|
|
| |
Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10166 from BenFradet/SPARK-12159.
|
|
|
|
|
|
|
|
|
|
| |
This reverts PR #10002, commit 78209b0ccaf3f22b5e2345dfb2b98edfdb746819.
The original PR wasn't tested on Jenkins before being merged.
Author: Cheng Lian <lian@databricks.com>
Closes #10200 from liancheng/revert-pr-10002.
|
|
|
|
|
|
|
|
| |
Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10006 from yanboliang/spark-11958.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
include_example
Made new patch contaning only markdown examples moved to exmaple/folder.
Ony three java code were not shfted since they were contaning compliation error ,these classes are
1)StandardScale 2)NormalizerExample 3)VectorIndexer
Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>
Closes #10002 from somideshmukh/SomilBranch1.33.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-11963
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9962 from yinxusen/SPARK-11963.
|
|
|
|
|
|
| |
Author: rotems <roter>
Closes #10078 from Botnaim/KryoMultipleCustomRegistrators.
|
|
|
|
|
|
|
|
|
|
| |
conflicts with dplyr
shivaram
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #10119 from felixcheung/rdocdplyrmasked.
|
|
|
|
|
|
|
|
| |
\cc mengxr
Author: Jeff Zhang <zjffdu@apache.org>
Closes #10093 from zjffdu/mllib_typo.
|
|
|
|
|
|
|
|
|
|
| |
The existing `spark.memory.fraction` (default 0.75) gives the system 25% of the space to work with. For small heaps, this is not enough: e.g. default 1GB leaves only 250MB system memory. This is especially a problem in local mode, where the driver and executor are crammed in the same JVM. Members of the community have reported driver OOM's in such cases.
**New proposal.** We now reserve 300MB before taking the 75%. For 1GB JVMs, this leaves `(1024 - 300) * 0.75 = 543MB` for execution and storage. This is proposal (1) listed in the [JIRA](https://issues.apache.org/jira/browse/SPARK-12081).
Author: Andrew Or <andrew@databricks.com>
Closes #10081 from andrewor14/unified-memory-small-heaps.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-11961
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9965 from yinxusen/SPARK-11961.
|