| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
| |
In particular when the HADOOP_CONF_DIR is not not specified.
Author: Patrick Wendell <pwendell@gmail.com>
Closes #488 from pwendell/hadoop-cleanup and squashes the following commits:
fe95f13 [Patrick Wendell] Changes based on Andrew's feeback
18d09c1 [Patrick Wendell] Review comments from Andrew
17929cc [Patrick Wendell] Assorted clean-up for Spark-on-YARN.
|
|
|
|
|
|
|
|
|
|
| |
I think I hit a class loading issue when running JavaSparkSQL example using spark-submit in local mode.
Author: Kan Zhang <kzhang@apache.org>
Closes #484 from kanzhang/SPARK-1570 and squashes the following commits:
feaaeba [Kan Zhang] [SPARK-1570] Fix classloading in JavaSQLContext.applySchema
|
|
|
|
|
|
|
|
| |
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #483 from vanzin/yarn-2.4 and squashes the following commits:
0fc57d8 [Marcelo Vanzin] Fix compilation on Hadoop 2.4.x.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
**Bug**: In the existing history server, there is a `spark.history.updateInterval` seconds delay before application logs show up on the UI.
**Cause**: This is because the following events happen in this order: (1) The background thread that checks for logs starts, but realizes the server has not yet bound and so waits for N seconds, (2) server binds, (3) N seconds later the background thread finds that the server has finally bound to a port, and so finally checks for application logs.
**Fix**: This PR forces the log checking thread to start immediately after binding. It also documents two relevant environment variables that are currently missing.
Author: Andrew Or <andrewor14@gmail.com>
Closes #441 from andrewor14/history-server-fix and squashes the following commits:
b2eb46e [Andrew Or] Document SPARK_PUBLIC_DNS and SPARK_HISTORY_OPTS for the history server
e8d1fbc [Andrew Or] Eliminate delay between binding and checking for logs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Preview: http://54.82.240.23:4000/mllib-guide.html
Table of contents:
* Basics
* Data types
* Summary statistics
* Classification and regression
* linear support vector machine (SVM)
* logistic regression
* linear linear squares, Lasso, and ridge regression
* decision tree
* naive Bayes
* Collaborative Filtering
* alternating least squares (ALS)
* Clustering
* k-means
* Dimensionality reduction
* singular value decomposition (SVD)
* principal component analysis (PCA)
* Optimization
* stochastic gradient descent
* limited-memory BFGS (L-BFGS)
Author: Xiangrui Meng <meng@databricks.com>
Closes #422 from mengxr/mllib-doc and squashes the following commits:
944e3a9 [Xiangrui Meng] merge master
f9fda28 [Xiangrui Meng] minor
9474065 [Xiangrui Meng] add alpha to ALS examples
928e630 [Xiangrui Meng] initialization_mode -> initializationMode
5bbff49 [Xiangrui Meng] add imports to labeled point examples
c17440d [Xiangrui Meng] fix python nb example
28f40dc [Xiangrui Meng] remove localhost:4000
369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc
7dc95cc [Xiangrui Meng] update linear methods
053ad8a [Xiangrui Meng] add links to go back to the main page
abbbf7e [Xiangrui Meng] update ALS argument names
648283e [Xiangrui Meng] level down statistics
14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide
8cd2441 [Xiangrui Meng] minor updates
186ab07 [Xiangrui Meng] update section names
6568d65 [Xiangrui Meng] update toc, level up lr and svm
162ee12 [Xiangrui Meng] rename section names
5c1e1b1 [Xiangrui Meng] minor
8aeaba1 [Xiangrui Meng] wrap long lines
6ce6a6f [Xiangrui Meng] add summary statistics to toc
5760045 [Xiangrui Meng] claim beta
cc604bf [Xiangrui Meng] remove classification and regression
92747b3 [Xiangrui Meng] make section titles consistent
e605dd6 [Xiangrui Meng] add LIBSVM loader
f639674 [Xiangrui Meng] add python section to migration guide
c82ffb4 [Xiangrui Meng] clean optimization
31660eb [Xiangrui Meng] update linear algebra and stat
0a40837 [Xiangrui Meng] first pass over linear methods
1fc8271 [Xiangrui Meng] update toc
906ed0a [Xiangrui Meng] add a python example to naive bayes
5f0a700 [Xiangrui Meng] update collaborative filtering
656d416 [Xiangrui Meng] update mllib-clustering
86e143a [Xiangrui Meng] remove data types section from main page
8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples
d1b5cbf [Xiangrui Meng] merge master
72e4804 [Xiangrui Meng] one pass over tree guide
64f8995 [Xiangrui Meng] move decision tree guide to a separate file
9fca001 [Xiangrui Meng] add first version of linear algebra guide
53c9552 [Xiangrui Meng] update dependencies
f316ec2 [Xiangrui Meng] add migration guide
f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction
182460f [Xiangrui Meng] add guide for naive Bayes
137fd1d [Xiangrui Meng] re-organize toc
a61e434 [Xiangrui Meng] update mllib's toc
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ALS was using HashPartitioner and explicit uses of `%` together. Further, the naked use of `%` meant that, if the number of partitions corresponded with the stride of arithmetic progressions appearing in user and product ids, users and products could be mapped into buckets in an unfair or unwise way.
This pull request:
1) Makes the Partitioner an instance variable of ALS.
2) Replaces the direct uses of `%` with calls to a Partitioner.
3) Defines an anonymous Partitioner that scrambles the bits of the object's hashCode before reducing to the number of present buckets.
This pull request does not make the partitioner user-configurable.
I'm not all that happy about the way I did (1). It introduces an icky lifetime issue and dances around it by nulling something. However, I don't know a better way to make the partitioner visible everywhere it needs to be visible.
Author: Tor Myklebust <tmyklebu@gmail.com>
Closes #407 from tmyklebu/master and squashes the following commits:
dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
674933a [Tor Myklebust] Fix style.
40edc23 [Tor Myklebust] Fix missing space.
f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
36a0f43 [Tor Myklebust] Make the partitioner private.
d872b09 [Tor Myklebust] Add negative id ALS test.
df27697 [Tor Myklebust] Support custom partitioners. Currently we use the same partitioner for users and products.
c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If there are no `transpose()` in `self.theta`, a
*ValueError: matrices are not aligned*
is occurring. The former test case just ignore this situation.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #463 from yinxusen/python-naive-bayes and squashes the following commits:
fcbe3bc [Xusen Yin] fix bugs of dot in python
|
|
|
|
|
|
|
|
|
|
|
| |
Changed the Pyrolite dependency to a build which targets Java 6.
Author: Ahir Reddy <ahirreddy@gmail.com>
Closes #479 from ahirreddy/java6-pyrolite and squashes the following commits:
8ea25d3 [Ahir Reddy] Updated maven build to use java 6 compatible pyrolite
dabc703 [Ahir Reddy] Updated Pyrolite dependency to be Java 6 compatible
|
|
|
|
|
|
|
|
|
|
|
|
| |
as the original PR was merged before this mistake is found....fix here,
Sorry about that @pwendell, @andrewor14, I will be more careful next time
Author: CodingCat <zhunansjtu@gmail.com>
Closes #474 from CodingCat/hotfix_1399 and squashes the following commits:
f3a8ba9 [CodingCat] move outdated comments
|
|
|
|
|
|
|
|
|
|
| |
A simple change, mostly had to change a bunch of example code.
Author: Patrick Wendell <pwendell@gmail.com>
Closes #438 from pwendell/jar-of-class and squashes the following commits:
aa010ff [Patrick Wendell] SPARK-1496: Have jarOfClass return Option[String]
|
|
|
|
|
|
|
|
|
|
| |
...g file.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #375 from vanzin/event-file and squashes the following commits:
f673029 [Marcelo Vanzin] [SPARK-1459] Use local path (and not complete URL) when opening local log file.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
... so that we don't follow an unspoken set of forbidden rules for adding **@AlphaComponent**, **@DeveloperApi**, and **@Experimental** annotations in the code.
In addition, this PR
(1) removes unnecessary `:: * ::` tags,
(2) adds missing `:: * ::` tags, and
(3) removes annotations for internal APIs.
Author: Andrew Or <andrewor14@gmail.com>
Closes #470 from andrewor14/annotations-fix and squashes the following commits:
92a7f42 [Andrew Or] Document + fix annotation usages
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I used the sbt-unidoc plugin (https://github.com/sbt/sbt-unidoc) to create a unified Scaladoc of our public packages, and generate Javadocs as well. One limitation is that I haven't found an easy way to exclude packages in the Javadoc; there is a SBT task that identifies Java sources to run javadoc on, but it's been very difficult to modify it from outside to change what is set in the unidoc package. Some SBT-savvy people should help with this. The Javadoc site also lacks package-level descriptions and things like that, so we may want to look into that. We may decide not to post these right now if it's too limited compared to the Scala one.
Example of the built doc site: http://people.csail.mit.edu/matei/spark-unified-docs/
Author: Matei Zaharia <matei@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Patrick Wendell <pwendell@gmail.com>
Closes #457 from mateiz/better-docs and squashes the following commits:
a63d4a3 [Matei Zaharia] Skip Java/Scala API docs for Python package
5ea1f43 [Matei Zaharia] Fix links to Java classes in Java guide, fix some JS for scrolling to anchors on page load
f05abc0 [Matei Zaharia] Don't include java.lang package names
995e992 [Matei Zaharia] Skip internal packages and class names with $ in JavaDoc
a14a93c [Matei Zaharia] typo
76ce64d [Matei Zaharia] Add groups to Javadoc index page, and a first package-info.java
ed6f994 [Matei Zaharia] Generate JavaDoc as well, add titles, update doc site to use unified docs
acb993d [Matei Zaharia] Add Unidoc plugin for the projects we want Unidoced
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[WIP]
The current Network Receiver API makes it slightly complicated to right a new receiver as one needs to create an instance of BlockGenerator as shown in SocketReceiver
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/SocketInputDStream.scala#L51
Exposing the BlockGenerator interface has made it harder to improve the receiving process. The API of NetworkReceiver (which was not a very stable API anyways) needs to be change if we are to ensure future stability.
Additionally, the functions like streamingContext.socketStream that create input streams, return DStream objects. That makes it hard to expose functionality (say, rate limits) unique to input dstreams. They should return InputDStream or NetworkInputDStream. This is still not yet implemented.
This PR is blocked on the graceful shutdown PR #247
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #300 from tdas/network-receiver-api and squashes the following commits:
ea27b38 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into network-receiver-api
3a4777c [Tathagata Das] Renamed NetworkInputDStream to ReceiverInputDStream, and ActorReceiver related stuff.
838dd39 [Tathagata Das] Added more events to the StreamingListener to report errors and stopped receivers.
a75c7a6 [Tathagata Das] Address some PR comments and fixed other issues.
91bfa72 [Tathagata Das] Fixed bugs.
8533094 [Tathagata Das] Scala style fixes.
028bde6 [Tathagata Das] Further refactored receiver to allow restarting of a receiver.
43f5290 [Tathagata Das] Made functions that create input streams return InputDStream and NetworkInputDStream, for both Scala and Java.
2c94579 [Tathagata Das] Fixed graceful shutdown by removing interrupts on receiving thread.
9e37a0b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into network-receiver-api
3223e95 [Tathagata Das] Refactored the code that runs the NetworkReceiver into further classes and traits to make them more testable.
a36cc48 [Tathagata Das] Refactored the NetworkReceiver API for future stability.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-1399
refactor StageTable a bit to support additional column for failed stage
Author: CodingCat <zhunansjtu@gmail.com>
Author: Nan Zhu <CodingCat@users.noreply.github.com>
Closes #421 from CodingCat/SPARK-1399 and squashes the following commits:
2caba36 [CodingCat] remove dummy tag
77cf305 [CodingCat] create dummy element to wrap columns
3989ce2 [CodingCat] address Aaron's comments
18fc09f [Nan Zhu] fix compile error
00ea30a [Nan Zhu] address Kay's comments
16ac83d [CodingCat] set a default value of failureReason
35df3df [CodingCat] address andrew's comments
06d21a4 [CodingCat] address andrew's comments
25a6db6 [CodingCat] style fix
dc8856d [CodingCat] show stage failure reason in UI
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SPARK-1386 changed RDDPage to RddPage but didn't change the filename. I tried sbt/sbt publish-local. Inside the spark-core jar, the unit name is RDDPage.class and hence I got the following error:
~~~
[error] (run-main) java.lang.NoClassDefFoundError: org/apache/spark/ui/storage/RddPage
java.lang.NoClassDefFoundError: org/apache/spark/ui/storage/RddPage
at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:59)
at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:52)
at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:42)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:215)
at MovieLensALS$.main(MovieLensALS.scala:38)
at MovieLensALS.main(MovieLensALS.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.ui.storage.RddPage
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:59)
at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:52)
at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:42)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:215)
at MovieLensALS$.main(MovieLensALS.scala:38)
at MovieLensALS.main(MovieLensALS.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
~~~
This can be fixed after renaming RddPage to RDDPage, or renaming RDDPage.scala to RddPage.scala. I chose the former since the name `RDD` is common in Spark code.
Author: Xiangrui Meng <meng@databricks.com>
Closes #454 from mengxr/rddpage-fix and squashes the following commits:
f75e544 [Xiangrui Meng] rename RddPage to RDDPage
|
|
|
|
|
|
|
|
|
|
|
|
| |
#446 faced a connection refused exception from these tests, causing them to timeout and fail after a long time. For now, let's disable these tests.
(We recently disabled the corresponding test in streaming in 7863ecca35be9af1eca0dfe5fd8806c5dd710fd6. These tests are very similar).
Author: Andrew Or <andrewor14@gmail.com>
Closes #466 from andrewor14/ignore-ui-tests and squashes the following commits:
6f5a362 [Andrew Or] Ignore org.apache.spark.ui.UISuite tests
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Over time as we've added more deployment modes, this have gotten a bit unwieldy with user-facing configuration options in Spark. Going forward we'll advise all users to run `spark-submit` to launch applications. This is a WIP patch but it makes the following improvements:
1. Improved `spark-env.sh.template` which was missing a lot of things users now set in that file.
2. Removes the shipping of SPARK_CLASSPATH, SPARK_JAVA_OPTS, and SPARK_LIBRARY_PATH to the executors on the cluster. This was an ugly hack. Instead it introduces config variables spark.executor.extraJavaOpts, spark.executor.extraLibraryPath, and spark.executor.extraClassPath.
3. Adds ability to set these same variables for the driver using `spark-submit`.
4. Allows you to load system properties from a `spark-defaults.conf` file when running `spark-submit`. This will allow setting both SparkConf options and other system properties utilized by `spark-submit`.
5. Made `SPARK_LOCAL_IP` an environment variable rather than a SparkConf property. This is more consistent with it being set on each node.
Author: Patrick Wendell <pwendell@gmail.com>
Closes #299 from pwendell/config-cleanup and squashes the following commits:
127f301 [Patrick Wendell] Improvements to testing
a006464 [Patrick Wendell] Moving properties file template.
b4b496c [Patrick Wendell] spark-defaults.properties -> spark-defaults.conf
0086939 [Patrick Wendell] Minor style fixes
af09e3e [Patrick Wendell] Mention config file in docs and clean-up docs
b16e6a2 [Patrick Wendell] Cleanup of spark-submit script and Scala quick start guide
af0adf7 [Patrick Wendell] Automatically add user jar
a56b125 [Patrick Wendell] Responses to Tom's review
d50c388 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
a762901 [Patrick Wendell] Fixing test failures
ffa00fe [Patrick Wendell] Review feedback
fda0301 [Patrick Wendell] Note
308f1f6 [Patrick Wendell] Properly escape quotes and other clean-up for YARN
e83cd8f [Patrick Wendell] Changes to allow re-use of test applications
be42f35 [Patrick Wendell] Handle case where SPARK_HOME is not set
c2a2909 [Patrick Wendell] Test compile fixes
4ee6f9d [Patrick Wendell] Making YARN doc changes consistent
afc9ed8 [Patrick Wendell] Cleaning up line limits and two compile errors.
b08893b [Patrick Wendell] Additional improvements.
ace4ead [Patrick Wendell] Responses to review feedback.
b72d183 [Patrick Wendell] Review feedback for spark env file
46555c1 [Patrick Wendell] Review feedback and import clean-ups
437aed1 [Patrick Wendell] Small fix
761ebcd [Patrick Wendell] Library path and classpath for drivers
7cc70e4 [Patrick Wendell] Clean up terminology inside of spark-env script
5b0ba8e [Patrick Wendell] Don't ship executor envs
84cc5e5 [Patrick Wendell] Small clean-up
1f75238 [Patrick Wendell] SPARK_JAVA_OPTS --> SPARK_MASTER_OPTS for master settings
4982331 [Patrick Wendell] Remove SPARK_LIBRARY_PATH
6eaf7d0 [Patrick Wendell] executorJavaOpts
0faa3b6 [Patrick Wendell] Stash of adding config options in submit script and YARN
ac2d65e [Patrick Wendell] Change spark.local.dir -> SPARK_LOCAL_DIRS
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #451 from marmbrus/replCleanup and squashes the following commits:
088526a [Michael Armbrust] REPL cleanup.
|
|
|
|
|
|
|
|
|
|
|
| |
`new DoubleMatrix(double[])` creates a garbage `double[]` of the same length as its argument and immediately throws it away. This pull request avoids that constructor in the ALS code.
Author: Tor Myklebust <tmyklebu@gmail.com>
Closes #442 from tmyklebu/foo2 and squashes the following commits:
2784fc5 [Tor Myklebust] Mention that this is probably fixed as of jblas 1.2.4; repunctuate.
a09904f [Tor Myklebust] Helper function for wrapping Array[Double]'s with DoubleMatrix's.
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #447 from marmbrus/pythonInsert and squashes the following commits:
c7ab692 [Michael Armbrust] Keep docstrings < 72 chars.
ff62870 [Michael Armbrust] Add insertInto and saveAsTable to Python API.
|
|
|
|
|
|
|
|
|
|
| |
This gets rid of a warning when compiling core (since we were depending on a deprecated interface with a non-deprecated function). I also tested with javac, and this does the right thing when compiling java code.
Author: Michael Armbrust <michael@databricks.com>
Closes #452 from marmbrus/scalaDeprecation and squashes the following commits:
f628b4d [Michael Armbrust] Use scala deprecation instead of java.
|
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@apache.org>
Closes #443 from rxin/readme and squashes the following commits:
16853de [Reynold Xin] Updated SBT and Scala instructions.
3ac3ceb [Reynold Xin] README update
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...AsNewAPIHadoopDataset
`writer.close` should be put in the `finally` block to avoid potential resource leaks.
JIRA: https://issues.apache.org/jira/browse/SPARK-1482
Author: zsxwing <zsxwing@gmail.com>
Closes #400 from zsxwing/SPARK-1482 and squashes the following commits:
06b197a [zsxwing] SPARK-1482: Fix potential resource leaks in saveAsHadoopDataset and saveAsNewAPIHadoopDataset
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ordering.
This doesn't require creating new Ordering objects per row. Additionally, [view bounds are going to be deprecated](https://issues.scala-lang.org/browse/SI-7629), so we should get rid of them while APIs are still flexible.
Author: Michael Armbrust <michael@databricks.com>
Closes #410 from marmbrus/viewBounds and squashes the following commits:
c574221 [Michael Armbrust] fix example.
812008e [Michael Armbrust] Update Java API.
1b9b85c [Michael Armbrust] Update scala doc.
35798a8 [Michael Armbrust] Remove view bounds on Ordered in favor of a context bound on Ordering.
|
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@apache.org>
Closes #444 from rxin/pyspark and squashes the following commits:
fc11356 [Reynold Xin] Made the PySpark shell version checking compatible with Python 2.6.
571830b [Reynold Xin] Fixed broken pyspark shell.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Actually it is separated from https://github.com/apache/spark/pull/85 as suggested by @rxin
compare
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala#L122
and
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala#L117
the first one use get and then toLong, the second one getLong....better to make them consistent
very very small fix........
Author: CodingCat <zhunansjtu@gmail.com>
Closes #434 from CodingCat/SPARK-1523 and squashes the following commits:
0e86f3f [CodingCat] improve the readability of code in AkkaUtil
|
|
|
|
|
|
|
|
|
|
|
|
| |
Per discussion, this is my suggestion to make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0. See what you think of this much.
Author: Sean Owen <sowen@cloudera.com>
Closes #372 from srowen/SPARK-1357Addendum and squashes the following commits:
17cf1ea [Sean Owen] Remove (another) blank line after ":: Experimental ::"
6800e4c [Sean Owen] Remove blank line after ":: Experimental ::"
b3a88d2 [Sean Owen] Make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A quick fix for https://issues.apache.org/jira/browse/SPARK-1520
By excluding fastutil, we bring the number of files in the assembly jar back under 65536, so Java 7 won't create the assembly jar in zip64 format, which cannot be read by Java 6.
With this change, the assembly jar now has about 60000 entries (58000 files), tested with both sbt and maven.
Author: Xiangrui Meng <meng@databricks.com>
Closes #437 from mengxr/remove-fastutil and squashes the following commits:
00f9beb [Xiangrui Meng] remove fastutil from dependencies
|
|
|
|
|
|
|
|
|
| |
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #432 from liancheng/reuseRow and squashes the following commits:
9e6d083 [Cheng Lian] Simplified code with BufferedIterator
52acec9 [Cheng Lian] Reuses Row object in ExistingRdd.productToRowRdd()
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-1483
From the original JIRA: " The parameter name is part of the public API in Scala and Python, since you can pass named parameters to a method, so we should name it to this more descriptive term. Everywhere else we refer to "splits" as partitions." - @mateiz
Author: CodingCat <zhunansjtu@gmail.com>
Closes #430 from CodingCat/SPARK-1483 and squashes the following commits:
4b60541 [CodingCat] deprecate defaultMinSplits
ba2c663 [CodingCat] Rename minSplits to minPartitions in public APIs
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is currently causing many builds to hang.
https://issues.apache.org/jira/browse/SPARK-1530
Author: Patrick Wendell <pwendell@gmail.com>
Closes #440 from pwendell/uitest-fix and squashes the following commits:
9a143dc [Patrick Wendell] Ignore streaming UI test
|
|
|
|
|
|
|
|
|
|
| |
This will make the tests more stable when not running SQL tests.
Author: Patrick Wendell <pwendell@gmail.com>
Closes #439 from pwendell/hive-tests and squashes the following commits:
88a6032 [Patrick Wendell] FIX: Don't build Hive in assembly unless running Hive tests.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...finishes
Note this is dependent on https://github.com/apache/spark/pull/204 to have a working history server, but there are no code dependencies.
This also fixes SPARK-1288 yarn stable finishApplicationMaster incomplete. Since I was in there I made the diagnostic message be passed properly.
Author: Thomas Graves <tgraves@apache.org>
Closes #362 from tgravescs/SPARK-1408 and squashes the following commits:
ec89705 [Thomas Graves] Fix typo.
446122d [Thomas Graves] Make config yarn specific
f5d5373 [Thomas Graves] SPARK-1408 Modify Spark on Yarn to point to the history server when app finishes
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This only works for the three paths defined in the environment
(SPARK_JAR, SPARK_YARN_APP_JAR and SPARK_LOG4J_CONF).
Tested by running SparkPi with local: and file: URIs against Yarn cluster (no "upload" shows up in logs in the local case).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #303 from vanzin/yarn-local and squashes the following commits:
82219c1 [Marcelo Vanzin] [SPARK-1395] Allow "local:" URIs to work on Yarn.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Py3 from shell.py
Python alternative for https://github.com/apache/spark/pull/392; managed from shell.py
Author: AbhishekKr <abhikumar163@gmail.com>
Closes #399 from abhishekkr/pyspark_shell and squashes the following commits:
134bdc9 [AbhishekKr] pyspark require Python2, failing if system default is Py3 from shell.py
|
|
|
|
|
|
|
|
|
|
|
|
| |
This will also fix SPARK-1464: Update MLLib Examples to Use Breeze.
Author: Sandeep <sandeep@techaddict.me>
Closes #416 from techaddict/1462 and squashes the following commits:
a43638e [Sandeep] Some Style Changes
3ce69c3 [Sandeep] Fix Ordering and Naming of Imports in Examples
6c7e543 [Sandeep] SPARK-1462: Examples of ML algorithms are using deprecated APIs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is very confusing when your code throws an exception, but the only stack trace show is in the DAGScheduler. This is a simple patch to include the stack trace for the actual failure in the error message. Suggestions on formatting welcome.
Before:
```
scala> sc.parallelize(1 :: Nil).map(_ => sys.error("Ahh!")).collect()
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:3 failed 1 times (most recent failure: Exception failure in TID 3 on host localhost: java.lang.RuntimeException: Ahh!)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1055)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1039)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1037)
...
```
After:
```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:3 failed 1 times, most recent failure: Exception failure in TID 3 on host localhost: java.lang.RuntimeException: Ahh!
scala.sys.package$.error(package.scala:27)
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13)
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$6.apply(RDD.scala:676)
org.apache.spark.rdd.RDD$$anonfun$6.apply(RDD.scala:676)
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1048)
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1048)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:110)
org.apache.spark.scheduler.Task.run(Task.scala:50)
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:46)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1055)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1039)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1037)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1037)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:614)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:614)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:614)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:143)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
```
Author: Michael Armbrust <michael@databricks.com>
Closes #409 from marmbrus/stacktraces and squashes the following commits:
3e4eb65 [Michael Armbrust] indent. include header for driver stack trace.
018b06b [Michael Armbrust] Include stack trace for exceptions in user code.
|
|
|
|
|
|
|
|
|
|
| |
change _slideDuration to _windowDuration
Author: baishuo(白硕) <vc_java@hotmail.com>
Closes #425 from baishuo/master and squashes the following commits:
6f09ea1 [baishuo(白硕)] Update ReducedWindowedDStream.scala
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"By default, this uses only 8 parallel tasks to do the grouping." is a big misleading. Please refer to https://github.com/apache/spark/pull/389
detail is as following code :
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
for (r <- bySize if r.partitioner.isDefined) {
return r.partitioner.get
}
if (rdd.context.conf.contains("spark.default.parallelism")) {
new HashPartitioner(rdd.context.defaultParallelism)
} else {
new HashPartitioner(bySize.head.partitions.size)
}
}
Author: Chen Chao <crazyjvm@gmail.com>
Closes #403 from CrazyJvm/patch-4 and squashes the following commits:
42f6c9e [Chen Chao] fix format
829a995 [Chen Chao] fix format
1568336 [Chen Chao] misleading task number of groupByKey
|
|
|
|
|
|
|
|
| |
Author: Kan Zhang <kzhang@apache.org>
Closes #401 from kanzhang/fix-1475 and squashes the following commits:
c6058bd [Kan Zhang] Fixing a race condition in event listener unit test
|
|
|
|
|
|
|
|
|
|
|
| |
delete semicolon
Author: Chen Chao <crazyjvm@gmail.com>
Closes #411 from CrazyJvm/patch-5 and squashes the following commits:
72333a3 [Chen Chao] remove unnecessary brace
de5d9a7 [Chen Chao] style fix
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Each vertex partition is co-located with a pid2vid array created in RoutingTable.scala. This array maps edge partition IDs to the list of vertices in the current vertex partition that are mentioned by edges in that partition. Therefore the pid2vid array should have one entry per edge partition.
GraphX currently creates one entry per *vertex* partition, which is a bug that leads to an ArrayIndexOutOfBoundsException when there are more edge partitions than vertex partitions. This commit fixes the bug and adds a test for this case.
Resolves SPARK-1329. Thanks to Daniel Darabos for reporting this bug.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #368 from ankurdave/fix-pid2vid-size and squashes the following commits:
5a5c52a [Ankur Dave] SPARK-1329: Create pid2vid with correct number of partitions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GraphImpl.reverse used to reverse edges in each partition of the edge RDD but preserve the routing table and replicated vertex view, since reversing should not affect partitioning.
However, the old routing table would then have incorrect information for srcAttrOnly and dstAttrOnly. These RDDs should be switched.
A simple fix is for Graph.reverse to rebuild the routing table and replicated vertex view.
Thanks to Bogdan Ghidireac for reporting this issue on the [mailing list](http://apache-spark-user-list.1001560.n3.nabble.com/graph-reverse-amp-Pregel-API-td4338.html).
Author: Ankur Dave <ankurdave@gmail.com>
Closes #431 from ankurdave/fix-reverse-bug and squashes the following commits:
75d63cb [Ankur Dave] Rebuild routing table after Graph.reverse
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA issue:[SPARK-1511](https://issues.apache.org/jira/browse/SPARK-1511)
TestUtils.createCompiledClass method use renameTo() to move files which fails when the src and dest files are in different disks or partitions. This pr uses Files.move() instead. The move method will try to use renameTo() and then fall back to copy() and delete(). I think this should handle this issue.
I didn't found a test suite for this file, so I add file existence detection after file moving.
Author: Ye Xianjin <advancedxy@gmail.com>
Closes #427 from advancedxy/SPARK-1511 and squashes the following commits:
a2b97c7 [Ye Xianjin] Based on @srowen's comment, assert file existence.
6f95550 [Ye Xianjin] use Files.move instead of renameTo to handle the src and dest files are in different disks or partitions.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
YARN-1824 changes the APIs (addToEnvironment, setEnvFromInputString) in Apps, which causes the spark build to break if built against a version 2.4.0. To fix this, create the spark own function to do that functionality which will not break compiling against 2.3 and other 2.x versions.
Author: xuan <xuan@MacBook-Pro.local>
Author: xuan <xuan@macbook-pro.home>
Closes #396 from xgong/master and squashes the following commits:
42b5984 [xuan] Remove two extra imports
bc0926f [xuan] Remove usage of org.apache.hadoop.util.Shell
be89fa7 [xuan] fix Spark compilation is broken with the latest hadoop-2.4.0 release
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
... nicer error messages
There are two improvements to Scheduler Mode:
1. Made the built in ones case insensitive (fair/FAIR, fifo/FIFO).
2. If an invalid mode is given we should print a better error message.
Author: Sandeep <sandeep@techaddict.me>
Closes #388 from techaddict/1469 and squashes the following commits:
a31bbd5 [Sandeep] SPARK-1469: Scheduler mode should accept lower-case definitions and have nicer error messages There are two improvements to Scheduler Mode: 1. Made the built in ones case insensitive (fair/FAIR, fifo/FIFO). 2. If an invalid mode is given we should print a better error message.
|
| |
|