| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
The existing `spark.memory.fraction` (default 0.75) gives the system 25% of the space to work with. For small heaps, this is not enough: e.g. default 1GB leaves only 250MB system memory. This is especially a problem in local mode, where the driver and executor are crammed in the same JVM. Members of the community have reported driver OOM's in such cases.
**New proposal.** We now reserve 300MB before taking the 75%. For 1GB JVMs, this leaves `(1024 - 300) * 0.75 = 543MB` for execution and storage. This is proposal (1) listed in the [JIRA](https://issues.apache.org/jira/browse/SPARK-12081).
Author: Andrew Or <andrew@databricks.com>
Closes #10081 from andrewor14/unified-memory-small-heaps.
|
|
|
|
|
|
|
|
| |
Garbage collection triggers cleanups. If the driver JVM is huge and there is little memory pressure, we may never clean up shuffle files on executors. This is a problem for long-running applications (e.g. streaming).
Author: Andrew Or <andrew@databricks.com>
Closes #10070 from andrewor14/periodic-gc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the current TreeNode, we should only return the simpleString.
In TreeNode's argString, if a TreeNode is not a child of the current TreeNode, we will only return the simpleString.
I tested the [following case provided by Cristian](https://issues.apache.org/jira/browse/SPARK-11596?focusedCommentId=15019241&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15019241).
```
val c = (1 to 20).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
println(s"PROCESSING >>>>>>>>>>> $idx")
val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
val union = curr.map(_.unionAll(df)).getOrElse(df)
union.cache()
Some(union)
}
c.get.explain(true)
```
Without the change, `c.get.explain(true)` took 100s. With the change, `c.get.explain(true)` took 26ms.
https://issues.apache.org/jira/browse/SPARK-11596
Author: Yin Huai <yhuai@databricks.com>
Closes #10079 from yhuai/SPARK-11596.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-11352
Author: Yin Huai <yhuai@databricks.com>
Closes #10072 from yhuai/SPARK-11352.
|
|
|
|
|
|
|
|
|
|
|
| |
When query the Timestamp or Date column like the following
val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" < end)
The generated SQL query is "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0"
It should have quote around the Timestamp/Date value such as "TIMESTAMP_COLUMN >= '2015-01-01 00:00:00.0'"
Author: Huaxin Gao <huaxing@oc0558782468.ibm.com>
Closes #9872 from huaxingao/spark-11788.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The issue is that the output commiter is not idempotent and retry attempts will
fail because the output file already exists. It is not safe to clean up the file
as this output committer is by design not retryable. Currently, the job fails
with a confusing file exists error. This patch is a stop gap to tell the user
to look at the top of the error log for the proper message.
This is difficult to test locally as Spark is hardcoded not to retry. Manually
verified by upping the retry attempts.
Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>
Closes #10080 from nongli/spark-11328.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
TestHive.reset()
When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive `TestHive.reset()` calls. This patch speeds up suites based on HiveComparisionTest, such as HiveCompatibilitySuite, with the following changes:
- Avoid `TestHive.reset()` whenever possible:
- Use a simple set of heuristics to guess whether we need to call `reset()` in between tests.
- As a safety-net, automatically re-run failed tests by calling `reset()` before the re-attempt.
- Speed up the expensive parts of `TestHive.reset()`: loading the `src` and `srcpart` tables took roughly 600ms per test, so we now avoid this by using a simple heuristic which only loads those tables by tests that reference them. This is based on simple string matching over the test queries which errs on the side of loading in more situations than might be strictly necessary.
After these changes, HiveCompatibilitySuite seems to run in about 10 minutes.
This PR is a revival of #6663, an earlier experimental PR from June, where I played around with several possible speedups for this suite.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #10055 from JoshRosen/speculative-testhive-reset.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
recovery issue
Fixed a minor race condition in #10017
Closes #10017
Author: jerryshao <sshao@hortonworks.com>
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #10074 from zsxwing/review-pr10017.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-11961
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9965 from yinxusen/SPARK-11961.
|
|
|
|
|
|
| |
JavaSerializerInstance.serialize"
This reverts commit 1401166576c7018c5f9c31e0a6703d5fb16ea339.
|
|
|
|
|
|
|
|
| |
The solution is the save the RDD partitioner in a separate file in the RDD checkpoint directory. That is, `<checkpoint dir>/_partitioner`. In most cases, whether the RDD partitioner was recovered or not, does not affect the correctness, only reduces performance. So this solution makes a best-effort attempt to save and recover the partitioner. If either fails, the checkpointing is not affected. This makes this patch safe and backward compatible.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #9983 from tdas/SPARK-12004.
|
|
|
|
|
|
|
|
|
|
| |
This bug was exposed as memory corruption in Timsort which uses copyMemory to copy
large regions that can overlap. The prior implementation did not handle this case
half the time and always copied forward, resulting in the data being corrupt.
Author: Nong Li <nong@databricks.com>
Closes #10068 from nongli/spark-12030.
|
|
|
|
|
|
|
|
| |
This commit upgrades the Tachyon dependency from 0.8.1 to 0.8.2.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #10054 from JoshRosen/upgrade-to-tachyon-0.8.2.
|
|
|
|
|
|
|
|
|
| |
andrewor14 the same PR as in branch 1.5
harishreedharan
Author: woj-i <wojciechindyk@gmail.com>
Closes #9859 from woj-i/master.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Persist and Unpersist exist in both RDD and Dataframe APIs. I think they are still very critical in Dataset APIs. Not sure if my understanding is correct? If so, could you help me check if the implementation is acceptable?
Please provide your opinions. marmbrus rxin cloud-fan
Thank you very much!
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes #9889 from gatorsmile/persistDS.
|
|
|
|
|
|
|
|
|
|
|
| |
create java version of `constructorFor` and `extractorFor` in `JavaTypeInference`
Author: Wenchen Fan <wenchen@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>
Closes #9937 from cloud-fan/pojo.
|
|
|
|
|
|
|
|
|
|
|
| |
compatible with encoder schema
When we build the `fromRowExpression` for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema.
For example, we build an encoder for `case class Data(a: Int, b: String)` and the real type is `[a: int, b: long]`, then we will hit runtime error and say that we can't construct class `Data` with int and long, because we lost the information that `b` should be a string.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #9840 from cloud-fan/err-msg.
|
|
|
|
|
|
|
|
| |
The reason is that, for a single culumn `RowEncoder`(or a single field product encoder), when we use it as the encoder for grouping key, we should also combine the grouping attributes, although there is only one grouping attribute.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10059 from cloud-fan/bug.
|
|
|
|
|
|
|
|
| |
This PR backports PR #10039 to master
Author: Cheng Lian <lian@databricks.com>
Closes #10063 from liancheng/spark-12046.doc-fix.master.
|
|
|
|
|
|
|
|
|
|
| |
`JavaSerializerInstance.serialize` uses `ByteArrayOutputStream.toByteArray` to get the serialized data. `ByteArrayOutputStream.toByteArray` needs to copy the content in the internal array to a new array. However, since the array will be converted to `ByteBuffer` at once, we can avoid the memory copy.
This PR added `ByteBufferOutputStream` to access the protected `buf` and convert it to a `ByteBuffer` directly.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #10051 from zsxwing/SPARK-12060.
|
|
|
|
|
|
|
|
|
|
|
|
| |
correct results for null values
JIRA: https://issues.apache.org/jira/browse/SPARK-11949
The result of cube plan uses incorrect schema. The schema of cube result should set nullable property to true because the grouping expressions will have null values.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #10038 from viirya/fix-cube.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-11898
syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization.
Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help,
1. decrease the worker memory consumption by 45%.
2. decrease running time by 40%.
This will also help extend the upper limit for Word2Vec.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #9878 from hhbyyh/w2vBC.
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-12018
The code of common subexpression elimination can be factored and simplified. Some unnecessary variables can be removed.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #10009 from viirya/refactor-subexpr-eliminate.
|
|
|
|
| |
I accidentally omitted these as part of #10049.
|
|
|
|
|
|
|
|
| |
Avoid potential deadlock with a user app's shutdown hook thread by more narrowly synchronizing access to 'hooks'
Author: Sean Owen <sowen@cloudera.com>
Closes #10042 from srowen/SPARK-12049.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change seems large, but most of it is just replacing `byte[]`
with `ByteBuffer` and `new byte[]` with `ByteBuffer.allocate()`,
since it changes the network library's API.
The following are parts of the code that actually have meaningful
changes:
- The Message implementations were changed to inherit from a new
AbstractMessage that can optionally hold a reference to a body
(in the form of a ManagedBuffer); this is similar to how
ResponseWithBody worked before, except now it's not restricted
to just responses.
- The TransportFrameDecoder was pretty much rewritten to avoid
copies as much as possible; it doesn't rely on CompositeByteBuf
to accumulate incoming data anymore, since CompositeByteBuf
has issues when slices are retained. The code now is able to
create frames without having to resort to copying bytes except
for a few bytes (containing the frame length) in very rare cases.
- Some minor changes in the SASL layer to convert things back to
`byte[]` since the JDK SASL API operates on those.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #9987 from vanzin/SPARK-12007.
|
|
|
|
|
|
|
|
|
|
|
|
| |
startDriverHeartbeat
https://issues.apache.org/jira/browse/SPARK-12037
a simple fix by changing the order of the statements
Author: CodingCat <zhunansjtu@gmail.com>
Closes #10032 from CodingCat/SPARK-12037.
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-12035
When we debuging lots of example code files, like in https://github.com/apache/spark/pull/10002, it's hard to know which file causes errors due to limited information in `include_example.rb`. With their filenames, we can locate bugs easily.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #10026 from yinxusen/SPARK-12035.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This pull request fixes multiple issues with API doc generation.
- Modify the Jekyll plugin so that the entire doc build fails if API docs cannot be generated. This will make it easy to detect when the doc build breaks, since this will now trigger Jenkins failures.
- Change how we handle the `-target` compiler option flag in order to fix `javadoc` generation.
- Incorporate doc changes from thunterdb (in #10048).
Closes #10048.
Author: Josh Rosen <joshrosen@databricks.com>
Author: Timothy Hunter <timhunter@databricks.com>
Closes #10049 from JoshRosen/fix-doc-build.
|
|
|
|
|
|
|
|
|
|
|
|
| |
KinesisStreamTests in test.py is broken because of #9403. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46896/testReport/(root)/KinesisStreamTests/test_kinesis_stream/
Because Streaming Python didn’t work when merging https://github.com/apache/spark/pull/9403, the PR build didn’t report the Python test failure actually.
This PR just disabled the test to unblock #10039
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #10047 from zsxwing/disable-python-kinesis-test.
|
| |
|
|
|
|
|
|
|
|
| |
CC jkbradley mengxr josepablocam
Author: Feynman Liang <feynman.liang@gmail.com>
Closes #10005 from feynmanliang/streaming-test-user-guide.
|
|
|
|
|
|
|
|
|
|
| |
Remove duplicate mllib example (DT/RF/GBT in Java/Python).
Since we have tutorial code for DT/RF/GBT classification/regression in Scala/Java/Python and example applications for DT/RF/GBT in Scala, so we mark these as duplicated and remove them.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9954 from yanboliang/SPARK-11975.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-11689
Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions.
Original PR is reverted due to document build error. https://github.com/apache/spark/pull/9722
mengxr feynmanliang yinxusen Sorry for the troubling.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #9974 from hhbyyh/ldaMLExample.
|
|
|
|
|
|
|
|
|
|
|
| |
```EventLoggingListener.getLogPath``` needs 4 input arguments:
https://github.com/apache/spark/blob/v1.6.0-preview2/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L276-L280
the 3rd parameter should be appAttemptId, 4th parameter is codec...
Author: Teng Qiu <teng.qiu@gmail.com>
Closes #10044 from chutium/SPARK-12053.
|
|
|
|
|
|
|
|
| |
This reverts commit cc243a079b1c039d6e7f0b410d1654d94a090e14 / PR #9297
I'm reverting this because it broke SQLListenerMemoryLeakSuite in the master Maven builds.
See #9991 for a discussion of why this broke the tests.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The list in ml-ensembles.md wasn't properly formatted and, as a result, was looking like this:
![old](http://i.imgur.com/2ZhELLR.png)
This PR aims to make it look like this:
![new](http://i.imgur.com/0Xriwd2.png)
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10025 from BenFradet/ml-ensembles-doc.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR improve the performance of CartesianProduct by caching the result of right plan.
After this patch, the query time of TPC-DS Q65 go down to 4 seconds from 28 minutes (420X faster).
cc nongli
Author: Davies Liu <davies@databricks.com>
Closes #9969 from davies/improve_cartesian.
|
|
|
|
|
|
|
|
| |
In 1.6, we introduce a public API to have a SQLContext for current thread, SparkPlan should use that.
Author: Davies Liu <davies@databricks.com>
Closes #9990 from davies/leak_context.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
database supports transactions
Fixes [SPARK-11989](https://issues.apache.org/jira/browse/SPARK-11989)
Author: CK50 <christian.kurz@oracle.com>
Author: Christian Kurz <christian.kurz@oracle.com>
Closes #9973 from CK50/branch-1.6_non-transactional.
(cherry picked from commit a589736a1b237ef2f3bd59fbaeefe143ddcc8f4e)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
| |
this is a trivial fix, discussed [here](http://stackoverflow.com/questions/28500401/maven-assembly-plugin-warning-the-assembly-descriptor-contains-a-filesystem-roo/).
Author: Prashant Sharma <scrapcodes@gmail.com>
Closes #10014 from ScrapCodes/assembly-warning.
|
|
|
|
|
|
|
|
|
| |
Top is implemented in terms of takeOrdered, which already maintains the
order, so top should, too.
Author: Wieland Hoffmann <themineo@gmail.com>
Closes #10013 from mineo/top-order.
|
|
|
|
|
|
|
|
| |
support SBT pom reader only.
Author: Prashant Sharma <scrapcodes@gmail.com>
Closes #10012 from ScrapCodes/minor-build-comment.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zk://host:port for a multi-master Mesos cluster using ZooKeeper
* According to below doc and validation logic in [SparkSubmit.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L231), master URL for a mesos cluster should always start with `mesos://`
http://spark.apache.org/docs/latest/running-on-mesos.html
`The Master URLs for Mesos are in the form mesos://host:5050 for a single-master Mesos cluster, or mesos://zk://host:2181 for a multi-master Mesos cluster using ZooKeeper.`
* However, [SparkContext.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2749) fails the validation and can receive master URL in the form `zk://host:port`
* For the master URLs in the form `zk:host:port`, the valid form should be `mesos://zk://host:port`
* This PR restrict the validation in `SparkContext.scala`, and now only mesos master URLs prefixed with `mesos://` can be accepted.
* This PR also updated corresponding unit test.
Author: toddwan <tawan0109@outlook.com>
Closes #9886 from toddwan/S11859.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Parquet relation with decimal column".
https://issues.apache.org/jira/browse/SPARK-12039
Since it is pretty flaky in hadoop 1 tests, we can disable it while we are investigating the cause.
Author: Yin Huai <yhuai@databricks.com>
Closes #10035 from yhuai/SPARK-12039-ignore.
|
|
|
|
|
|
|
|
|
|
|
|
| |
In https://github.com/apache/spark/pull/9409 we enabled multi-column counting. The approach taken in that PR introduces a bit of overhead by first creating a row only to check if all of the columns are non-null.
This PR fixes that technical debt. Count now takes multiple columns as its input. In order to make this work I have also added support for multiple columns in the single distinct code path.
cc yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes #10015 from hvanhovell/SPARK-12024.
|
|
|
|
|
|
| |
Author: Sun Rui <rui.sun@intel.com>
Closes #9769 from sun-rui/SPARK-11781.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.
I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218
shivaram sun-rui
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #9654 from felixcheung/colnamescoltypes.
|
|
|
|
|
|
|
|
|
|
| |
tests, fix doc and add examples
shivaram sun-rui
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #10019 from felixcheung/rfunctionsdoc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
value is null literals
When calling `get_json_object` for the following two cases, both results are `"null"`:
```scala
val tuple: Seq[(String, String)] = ("5", """{"f1": null}""") :: Nil
val df: DataFrame = tuple.toDF("key", "jstring")
val res = df.select(functions.get_json_object($"jstring", "$.f1")).collect()
```
```scala
val tuple2: Seq[(String, String)] = ("5", """{"f1": "null"}""") :: Nil
val df2: DataFrame = tuple2.toDF("key", "jstring")
val res3 = df2.select(functions.get_json_object($"jstring", "$.f1")).collect()
```
Fixed the problem and also added a test case.
Author: gatorsmile <gatorsmile@gmail.com>
Closes #10018 from gatorsmile/get_json_object.
|