| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
| |
New user guide section ml-decision-tree.md, including code examples.
I have run all examples, including the Java ones.
CC: manishamde yanboliang mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8244 from jkbradley/ml-dt-docs.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add warnings according to SPARK-8949 in `SparkContext`
- warnings in scaladoc
- log warnings when preferred locations feature is used through `SparkContext`'s constructor
However I didn't found any documentation reference of this feature. Please direct me if you know any reference to this feature.
Author: Han JU <ju.han.felix@gmail.com>
Closes #7874 from darkjh/SPARK-8949.
|
|
|
|
|
|
|
|
|
| |
By using `StringIndexer`, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label.
I think it is better to make it explicit on documentation.
Author: lewuathe <lewuathe@me.com>
Closes #8205 from Lewuathe/SPARK-9977.
|
|
|
|
|
|
|
|
| |
Fix typo in ntile function.
Author: Moussa Taifi <moutai10@gmail.com>
Closes #8261 from moutai/patch-2.
|
|
|
|
|
|
|
|
|
|
|
|
| |
`Lists.newArrayList` -> `Arrays.asList`
CC jkbradley feynmanliang
Anybody into replacing usages of `Lists.newArrayList` in the examples / source code too? this method isn't useful in Java 7 and beyond.
Author: Sean Owen <sowen@cloudera.com>
Closes #8272 from srowen/SPARK-10070.
|
|
|
|
|
|
|
|
| |
Link was broken because it included tick marks.
Author: Bill Chambers <wchambers@ischool.berkeley.edu>
Closes #8302 from anabranch/patch-1.
|
|
|
|
|
|
|
|
|
|
|
|
| |
spark.streaming.backpressure.{enable-->enabled} and fixed deprecated annotations
Small changes
- Renamed conf spark.streaming.backpressure.{enable --> enabled}
- Change Java Deprecated annotations to Scala deprecated annotation with more information.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #8299 from tdas/SPARK-9967.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
accesses cacheLocs
In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine):
```scala
val numItems = 100000
val s = Seq.fill(numItems)(1)
for (i <- 0 until numItems) s(i)
```
It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput.
This patch fixes this by replacing `Seq` with `Array`.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #8178 from JoshRosen/dagscheduler-perf.
|
|
|
|
|
|
|
|
| |
SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be modified accordingly since it lists the old Pregel code
Author: Alexander Ulanov <nashb@yandex.ru>
Closes #7831 from avulanov/SPARK-9508-pregel-doc2.
|
|
|
|
|
|
|
|
| |
cc JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes #8245 from davies/python_doc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
UDFs on complex types
This is kind of a weird case, but given a sufficiently complex query plan (in this case a TungstenProject with an Exchange underneath), we could have NPEs on the executors due to the time when we were calling transformAllExpressions
In general we should ensure that all transformations occur on the driver and not on the executors. Some reasons for avoid executor side transformations include:
* (this case) Some operator constructors require state such as access to the Spark/SQL conf so doing a makeCopy on the executor can fail.
* (unrelated reason for avoid executor transformations) ExprIds are calculated using an atomic integer, so you can violate their uniqueness constraint by constructing them anywhere other than the driver.
This subsumes #8285.
Author: Reynold Xin <rxin@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Closes #8295 from rxin/SPARK-10096.
|
|
|
|
|
|
|
|
|
|
|
|
| |
In UnsafeRow, we use the private field of BigInteger for better performance, but it actually didn't contribute much (3% in one benchmark) to end-to-end runtime, and make it not portable (may fail on other JVM implementations).
So we should use the public API instead.
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes #8286 from davies/portable_decimal.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Add `when` and `otherwise` as `Column` methods
- Add `When` as an expression function
- Add `%otherwise%` infix as an alias of `otherwise`
Since R doesn't support a feature like method chaining, `otherwise(when(condition, value), value)` style is a little annoying for me. If `%otherwise%` looks strange for shivaram, I can remove it. What do you think?
### JIRA
[[SPARK-10075] Add `when` expressino function in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10075)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8266 from yu-iskw/SPARK-10075.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
HiveSparkSubmitSuite and HiveThriftServer2 test suites
Scala process API has a known bug ([SI-8768] [1]), which may be the reason why several test suites which fork sub-processes are flaky.
This PR replaces Scala process API with Java process API in `CliSuite`, `HiveSparkSubmitSuite`, and `HiveThriftServer2` related test suites to see whether it fix these flaky tests.
[1]: https://issues.scala-lang.org/browse/SI-8768
Author: Cheng Lian <lian@databricks.com>
Closes #8168 from liancheng/spark-9939/use-java-process-api.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
before setting trackerState to Started
Test failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/3305/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/stop_gracefully/
There is a race condition that setting `trackerState` to `Started` could happen after calling `startReceiver`. Then `startReceiver` won't start the receivers because it uses `! isTrackerStarted` to check if ReceiverTracker is stopping or stopped. But actually, `trackerState` is `Initialized` and will be changed to `Started` soon.
Therefore, we should use `isTrackerStopping || isTrackerStopped`.
Author: zsxwing <zsxwing@gmail.com>
Closes #8294 from zsxwing/SPARK-9504.
|
|
|
|
|
|
|
|
|
|
|
|
| |
generate blocks fills up to capacity
Generated blocks are inserted into an ArrayBlockingQueue, and another thread pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if that queue fills up to capacity (default is 10 blocks), then the inserting into queue (done in the function updateCurrentBuffer) get blocked inside a synchronized block. However, the thread that is pulling blocks from the queue uses the same lock to check the current (active or stopped) while pulling from the queue. Since the block generating threads is blocked (as the queue is full) on the lock, this thread that is supposed to drain the queue gets blocked. Ergo, deadlock.
Solution: Moved blocking call to ArrayBlockingQueue outside the synchronized to prevent deadlock.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #8257 from tdas/SPARK-10072.
|
|
|
|
|
|
|
|
|
|
|
|
| |
```
R/functions.R:74:1: style: lines should not be more than 100 characters.
jc <- callJStatic("org.apache.spark.sql.functions", "lit", ifelse(class(x) == "Column", xjc, x))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8297 from yu-iskw/minor-lint-r.
|
|
|
|
|
|
|
|
|
|
| |
Here propose to remove old MRJobConfig#DEFAULT_APPLICATION_CLASSPATH support, since we now move to Yarn stable API.
vanzin and sryza , any opinion on this? If we still want to support old API, I can close it. But as far as I know now major Hadoop releases has moved to stable API.
Author: jerryshao <sshao@hortonworks.com>
Closes #8192 from jerryshao/SPARK-9969.
|
|
|
|
|
|
|
|
|
|
| |
This patch is against master, but we need to apply it to 1.5 branch as well.
cc shivaram and rxin
Author: Hossein <hossein@databricks.com>
Closes #8291 from falaki/SparkRVersion1.5.
|
|
|
|
|
|
|
|
| |
mengxr jkbradley
Author: Feynman Liang <fliang@databricks.com>
Closes #8184 from feynmanliang/SPARK-9889-DCT-docs.
|
|
|
|
|
|
|
|
|
|
| |
FailureSuite
Failures in streaming.FailureSuite can leak StreamingContext and SparkContext which fails all subsequent tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #8289 from tdas/SPARK-10098.
|
|
|
|
|
|
|
|
| |
Currently there is no test case for `Params#arrayLengthGt`.
Author: lewuathe <lewuathe@me.com>
Closes #8223 from Lewuathe/SPARK-10012.
|
|
|
|
|
|
|
|
| |
Added since tags to mllib.tree
Author: Bryan Cutler <bjcutler@us.ibm.com>
Closes #7380 from BryanCutler/sinceTag-mllibTree-8924.
|
|
|
|
|
|
| |
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #8282 from vanzin/SPARK-10088.
|
|
|
|
|
|
| |
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #8283 from vanzin/SPARK-10089.
|
|
|
|
|
|
|
|
|
| |
Add a new test case in yarn/ClientSuite which checks how the various SparkConf
and ClientArguments propagate into the ApplicationSubmissionContext.
Author: Dennis Huo <dhuo@google.com>
Closes #8072 from dennishuo/dhuo-yarn-application-tags.
|
|
|
|
|
|
|
|
| |
Turns out that inner classes of inner objects are referenced directly, and thus moving it will break binary compatibility.
Author: Michael Armbrust <michael@databricks.com>
Closes #8281 from marmbrus/binaryCompat.
|
|
|
|
|
|
|
|
|
|
| |
spark-streaming-XXX-assembly jars
Removed contents already included in Spark assembly jar from spark-streaming-XXX-assembly jars.
Author: zsxwing <zsxwing@gmail.com>
Closes #8069 from zsxwing/SPARK-9574.
|
|
|
|
|
|
|
|
| |
See https://issues.apache.org/jira/browse/SPARK-10085
Author: Piotr Migdal <pmigdal@gmail.com>
Closes #8284 from stared/spark-10085.
|
|
|
|
|
|
|
|
| |
Add Python example for mllib LDAModel user guide
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8227 from yanboliang/spark-10032.
|
|
|
|
|
|
|
|
|
|
| |
user guide
Add Python examples for mllib IsotonicRegression user guide
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8225 from yanboliang/spark-10029.
|
|
|
|
|
|
|
|
| |
Updates FPM user guide to include Association Rules.
Author: Feynman Liang <fliang@databricks.com>
Closes #8207 from feynmanliang/SPARK-9900-arules.
|
|
|
|
|
|
|
|
|
|
| |
The fix for SPARK-7736 introduced a race where a port value of "-1"
could be passed down to the pyspark process, causing it to fail to
connect back to the JVM. This change adds code to fix that race.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #8258 from vanzin/SPARK-7736.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
CountVectorizerModel
jira: https://issues.apache.org/jira/browse/SPARK-9028
Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency.
I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #7388 from hhbyyh/cvEstimator.
|
|
|
|
|
|
|
|
|
|
|
| |
parameters functions
### JIRA
[[SPARK-10007] Update `NAMESPACE` file in SparkR for simple parameters functions - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10007)
Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8277 from yu-iskw/SPARK-10007.
|
|
|
|
|
|
|
|
|
|
| |
Parquet hard coded a JUL logger which always writes to stdout. This PR redirects it via SLF4j JUL bridge handler, so that we can control Parquet logs via `log4j.properties`.
This solution is inspired by https://github.com/Parquet/parquet-mr/issues/390#issuecomment-46064909.
Author: Cheng Lian <lian@databricks.com>
Closes #8196 from liancheng/spark-8118/redirect-parquet-jul.
|
|
|
|
|
|
|
|
|
|
| |
it might be a typo introduced at the first moment or some leftover after some renaming......
the name of the method accessing the index file is called `getBlockData` now (not `getBlockLocation` as indicated in the comments)
Author: CodingCat <zhunansjtu@gmail.com>
Closes #8238 from CodingCat/minor_1.
|
|
|
|
|
|
|
|
| |
Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8263 from yanboliang/mlp-public.
|
|
|
|
|
|
|
|
|
|
|
|
| |
binary in ArrayData
The type for array of array in Java is slightly different than array of others.
cc cloud-fan
Author: Davies Liu <davies@databricks.com>
Closes #8250 from davies/array_binary.
|
|
|
|
|
|
| |
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8265 from yu-iskw/minor-translate-comment.
|
|
|
|
|
|
|
|
| |
This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #8260 from mengxr/SPARK-7808.
|
|
|
|
|
|
| |
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #8251 from vanzin/SPARK-10059.
|
|
|
|
|
|
|
|
|
|
| |
KS test
added doc examples for python.
Author: jose.cambronero <jose.cambronero@cloudera.com>
Closes #8154 from josepablocam/spark_9902.
|
|
|
|
|
|
| |
Author: Sandy Ryza <sandy@cloudera.com>
Closes #8230 from sryza/sandy-spark-7707.
|
|
|
|
|
|
|
|
|
|
| |
Adds user guide for `PrefixSpan`, including Scala and Java example code.
mengxr zhangjiajin
Author: Feynman Liang <fliang@databricks.com>
Closes #8253 from feynmanliang/SPARK-9898.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added since tags to mllib.regression
Author: Prayag Chandran <prayagchandran@gmail.com>
Closes #7518 from prayagchandran/sinceTags and squashes the following commits:
fa4dda2 [Prayag Chandran] Re-formatting
6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags
1a0365f [Prayag Chandran] Reformating and adding a few more tags
89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression
|
|
|
|
|
|
|
|
|
|
| |
ml.feature.ElementwiseProduct
Add Python API, user guide and example for ml.feature.ElementwiseProduct.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8061 from yanboliang/SPARK-9768.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
com.twitter:parquet-hadoop-bundle:1.6.0 is in SBT assembly jar
PR #7967 enables Spark SQL to persist Parquet tables in Hive compatible format when possible. One of the consequence is that, we have to set input/output classes to `MapredParquetInputFormat`/`MapredParquetOutputFormat`, which rely on com.twitter:parquet-hadoop:1.6.0 bundled with Hive 1.2.1.
When loading such a table in Spark SQL, `o.a.h.h.ql.metadata.Table` first loads these input/output format classes, and thus classes in com.twitter:parquet-hadoop:1.6.0. However, the scope of this dependency is defined as "runtime", and is not packaged into Spark assembly jar. This results in a `ClassNotFoundException`.
This issue can be worked around by asking users to add parquet-hadoop 1.6.0 via the `--driver-class-path` option. However, considering Maven build is immune to this problem, I feel it can be confusing and inconvenient for users.
So this PR fixes this issue by changing scope of parquet-hadoop 1.6.0 to "compile".
Author: Cheng Lian <lian@databricks.com>
Closes #8198 from liancheng/spark-9974/bundle-parquet-1.6.0.
|
|
|
|
|
|
|
| |
Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.Samavihome>
Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.local>
Closes #7729 from sabhyankar/branch_8920.
|
|
|
|
|
|
|
|
| |
mengxr jkbradley
Author: Feynman Liang <fliang@databricks.com>
Closes #8255 from feynmanliang/SPARK-10068.
|