aboutsummaryrefslogtreecommitdiff
path: root/project
Commit message (Collapse)AuthorAgeFilesLines
* Updating versions for 1.0.2 release.Tathagata Das2014-07-252-2/+2
|
* Updating versions for branch-1.0Patrick Wendell2014-07-122-2/+2
|
* Updating versions for 1.0.1 releasePatrick Wendell2014-06-262-2/+2
|
* [SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQLYin Huai2014-06-171-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema. (cherry picked from commit d2f4f30b12f99358953e2781957468e2cfe3c916) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SQL] Update SparkSQL and ScalaTest in branch-1.0 to match master.Michael Armbrust2014-06-131-11/+11
| | | | | | | | | | | | | #511 and #863 got left out of branch-1.0 since we were really close to the release. Now that they have been tested a little I see no reason to leave them out. Author: Michael Armbrust <michael@databricks.com> Author: witgo <witgo@qq.com> Closes #1078 from marmbrus/branch-1.0 and squashes the following commits: 22be674 [witgo] [SPARK-1841]: update scalatest to version 2.1.5 fc8fc79 [Michael Armbrust] Include #1071 as well. c5d0adf [Michael Armbrust] Update SparkSQL in branch-1.0 to match master.
* [SPARK-2030] Bump SparkBuild.scala version number of branch-1.0 to ↵Takuya UESHIN2014-06-051-1/+1
| | | | | | | | | | 1.0.1-SNAPSHOT. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #975 from ueshin/issues/SPARK-2030 and squashes the following commits: aae1044 [Takuya UESHIN] Bump version number to 1.0.1-SNAPSHOT.
* Add support for Pivotal HD in the Maven build: SPARK-1992tzolov2014-06-031-0/+1
| | | | | | | | | | | | | | | | Allow Spark to build against particular Pivotal HD distributions. For example to build Spark against Pivotal HD 2.0.1 one can run: ``` mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0-gphd-3.0.1.0 -DskipTests clean package ``` Author: tzolov <christian.tzolov@gmail.com> Closes #942 from tzolov/master and squashes the following commits: bc3e05a [tzolov] Add support for Pivotal HD in the Maven build and SBT build: [SPARK-1992] (cherry picked from commit b1f285359a12345d94ae59f979d9a75482e732ce) Signed-off-by: Matei Zaharia <matei@databricks.com>
* Optionally include Hive as a dependency of the REPL.Michael Armbrust2014-05-311-1/+3
| | | | | | | | | | | | | Due to the way spark-shell launches from an assembly jar, I don't think this change will affect anyone who isn't trying to launch the shell directly from sbt. That said, it is kinda nice to be able to launch all things directly from SBT when developing. Author: Michael Armbrust <michael@databricks.com> Closes #801 from marmbrus/hiveRepl and squashes the following commits: 9570571 [Michael Armbrust] Optionally include Hive as a dependency of the REPL. (cherry picked from commit 7463cd248f81975bce9ff864002932864bd5b649) Signed-off-by: Aaron Davidson <aaron@databricks.com>
* SPARK-1935: Explicitly add commons-codec 1.5 as a dependency.Yin Huai2014-05-291-0/+1
| | | | | | | | | | | Author: Yin Huai <huai@cse.ohio-state.edu> Closes #889 from yhuai/SPARK-1935 and squashes the following commits: 7d50ef1 [Yin Huai] Explicitly add commons-codec 1.5 as a dependency. (cherry picked from commit 60b89fe6b09ff896a30d74204876da883e307de7) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-1875]NoClassDefFoundError: StringUtils when building with hadoop 1.x ↵witgo2014-05-191-2/+1
| | | | | | | | | | | | | | and hive Author: witgo <witgo@qq.com> Closes #824 from witgo/SPARK-1875_commons-lang-2.6 and squashes the following commits: ef7231d [witgo] review commit ead3c3b [witgo] SPARK-1875:NoClassDefFoundError: StringUtils when building against Hadoop 1 (cherry picked from commit 6a2c5c610c259f62cb12d8cfc18bf59cdb334bb2) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* fix different versions of commons-lang dependency and apache/spark#746 addendumwitgo2014-05-141-5/+7
| | | | | | | | | | | | | | | Author: witgo <witgo@qq.com> Closes #754 from witgo/commons-lang and squashes the following commits: 3ebab31 [witgo] merge master f3b8fa2 [witgo] merge master 2083fae [witgo] repeat definition 5599cdb [witgo] multiple version of sbt dependency c1b66a1 [witgo] fix different versions of commons-lang dependency (cherry picked from commit bae07e36a6e0fb7982405316646b452b4ff06acc) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Fix: sbt test throw an java.lang.OutOfMemoryError: PermGen spacewitgo2014-05-141-0/+1
| | | | | | | | | | | Author: witgo <witgo@qq.com> Closes #773 from witgo/sbt_javaOptions and squashes the following commits: 26c7d38 [witgo] Improve sbt configuration (cherry picked from commit fde82c1549c78f1eebbb21ec34e60befbbff65f5) Signed-off-by: Reynold Xin <rxin@apache.org>
* SPARK-1828: Created forked version of hive-exec that doesn't bundle other ↵Patrick Wendell2014-05-141-3/+3
| | | | | | | | | | | | | | | | | | dependencies See https://issues.apache.org/jira/browse/SPARK-1828 for more information. This is being submitted to Jenkin's for testing. The dependency won't fully propagate in Maven central for a few more hours. Author: Patrick Wendell <pwendell@gmail.com> Closes #767 from pwendell/hive-shaded and squashes the following commits: ea10ac5 [Patrick Wendell] SPARK-1828: Created forked version of hive-exec that doesn't bundle other dependencies (cherry picked from commit d58cb33ffa9e98a64cecea7b40ce7bfbed145079) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-1786: Reopening PR 724Ankur Dave2014-05-121-3/+5
| | | | | | | | | | | | | | | | | | Addressing issue in MimaBuild.scala. Author: Ankur Dave <ankurdave@gmail.com> Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Closes #742 from jegonzal/edge_partition_serialization and squashes the following commits: 8ba6e0d [Ankur Dave] Add concatenation operators to MimaBuild.scala cb2ed3a [Joseph E. Gonzalez] addressing missing exclusion in MimaBuild.scala 5d27824 [Ankur Dave] Disable reference tracking to fix serialization test c0a9ae5 [Ankur Dave] Add failing test for EdgePartition Kryo serialization a4a3faa [Joseph E. Gonzalez] Making EdgePartition serializable. (cherry picked from commit 0e2bde2030f8e455c5a269fc38d4ff05b395ca32) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-1806: Upgrade Mesos dependency to 0.18.1Bernardo Gomez Palacio2014-05-121-1/+1
| | | | | | | | | | | | | | | | | | | | Enabled Mesos (0.18.1) dependency with shaded protobuf Why is this needed? Avoids any protobuf version collision between Mesos and any other dependency in Spark e.g. Hadoop HDFS 2.2+ or 1.0.4. Ticket: https://issues.apache.org/jira/browse/SPARK-1806 * Should close https://issues.apache.org/jira/browse/SPARK-1433 Author berngp Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com> Closes #741 from berngp/feature/SPARK-1806 and squashes the following commits: 5d70646 [Bernardo Gomez Palacio] SPARK-1806: Upgrade Mesos dependency to 0.18.1
* Revert "Enabled incremental build that comes with sbt 0.13.2"Patrick Wendell2014-05-101-1/+1
| | | | | | | This reverts commit 71ce7eb0e5878f0bafd64bdd201ae257a3bfe106. I meant only to merge this into master. It's an experimental build feature.
* Enabled incremental build that comes with sbt 0.13.2Prashant Sharma2014-05-101-1/+1
| | | | | | | | | | | | | More info at. https://github.com/sbt/sbt/issues/1010 Author: Prashant Sharma <prashant.s@imaginea.com> Closes #525 from ScrapCodes/sbt-inc-opt and squashes the following commits: ba8fa42 [Prashant Sharma] Enabled incremental build that comes with sbt 0.13.2 (cherry picked from commit 70bcdef48a051028598d380d41dfce1c9bfb2b9b) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-1789. Multiple versions of Netty dependencies cause FlumeStreamSuite ↵Sean Owen2014-05-101-17/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | failure TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure. I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?) velvia notes: "I have found a workaround. If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty." There are at least 3 versions of Netty in play in the build: - the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem - the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6. - but, Spark Core directly uses io.netty:netty-all:4.0.17.Final The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue. The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final. But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile. If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation. So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict: - Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts - Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty - Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent - Update SBT build accordingly A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible. Author: Sean Owen <sowen@cloudera.com> Closes #723 from srowen/SPARK-1789 and squashes the following commits: 43661b7 [Sean Owen] Update and add Netty excludes to prevent some JAR conflicts that cause test issues (cherry picked from commit 2b7bd29eb6ee5baf739eec143044ecfc296b9b1f) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Unify GraphImpl RDDs + other graph load optimizationsAnkur Dave2014-05-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR makes the following changes, primarily in e4fbd329aef85fe2c38b0167255d2a712893d683: 1. *Unify RDDs to avoid zipPartitions.* A graph used to be four RDDs: vertices, edges, routing table, and triplet view. This commit merges them down to two: vertices (with routing table), and edges (with replicated vertices). 2. *Avoid duplicate shuffle in graph building.* We used to do two shuffles when building a graph: one to extract routing information from the edges and move it to the vertices, and another to find nonexistent vertices referred to by edges. With this commit, the latter is done as a side effect of the former. 3. *Avoid no-op shuffle when joins are fully eliminated.* This is a side effect of unifying the edges and the triplet view. 4. *Join elimination for mapTriplets.* 5. *Ship only the needed vertex attributes when upgrading the triplet view.* If the triplet view already contains source attributes, and we now need both attributes, only ship destination attributes rather than re-shipping both. This is done in `ReplicatedVertexView#upgrade`. Author: Ankur Dave <ankurdave@gmail.com> Closes #497 from ankurdave/unify-rdds and squashes the following commits: 332ab43 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into unify-rdds 4933e2e [Ankur Dave] Exclude RoutingTable from binary compatibility check 5ba8789 [Ankur Dave] Add GraphX upgrade guide from Spark 0.9.1 13ac845 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into unify-rdds a04765c [Ankur Dave] Remove unnecessary toOps call 57202e8 [Ankur Dave] Replace case with pair parameter 75af062 [Ankur Dave] Add explicit return types 04d3ae5 [Ankur Dave] Convert implicit parameter to context bound c88b269 [Ankur Dave] Revert upgradeIterator to if-in-a-loop 0d3584c [Ankur Dave] EdgePartition.size should be val 2a928b2 [Ankur Dave] Set locality wait 10b3596 [Ankur Dave] Clean up public API ae36110 [Ankur Dave] Fix style errors e4fbd32 [Ankur Dave] Unify GraphImpl RDDs + other graph load optimizations d6d60e2 [Ankur Dave] In GraphLoader, coalesce to minEdgePartitions 62c7b78 [Ankur Dave] In Analytics, take PageRank numIter d64e8d4 [Ankur Dave] Log current Pregel iteration (cherry picked from commit 905173df57b90f90ebafb22e43f55164445330e6) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SQL] Upgrade parquet library.Michael Armbrust2014-05-101-1/+1
| | | | | | | | | | | | | | | I think we are hitting this issue in some perf tests: https://github.com/Parquet/parquet-mr/commit/6aed5288fd4a1398063a5a219b2ae4a9f71b02cf Credit to @aarondav ! Author: Michael Armbrust <michael@databricks.com> Closes #684 from marmbrus/upgradeParquet and squashes the following commits: e10a619 [Michael Armbrust] Upgrade parquet library. (cherry picked from commit 4d6055329846f5e09472e5f844127a5ab5880e15) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-1644] The org.datanucleus:* should not be packaged into ↵witgo2014-05-101-5/+6
| | | | | | | | | | | | | | | | spark-assembly-*.jar Author: witgo <witgo@qq.com> Closes #688 from witgo/SPARK-1644 and squashes the following commits: 56ad6ac [witgo] review commit 87c03e4 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1644 6ffa7e4 [witgo] review commit a597414 [witgo] The org.datanucleus:* should not be packaged into spark-assembly-*.jar (cherry picked from commit 561510867a1b79beef57acf9df65c9f88481435d) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations...Kan Zhang2014-05-071-0/+2
| | | | | | | | | | | | | | | ... that do not change schema Author: Kan Zhang <kzhang@apache.org> Closes #448 from kanzhang/SPARK-1460 and squashes the following commits: 111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD 91dc787 [Kan Zhang] Taking into account newly added Ordering param 79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema (cherry picked from commit 967635a2425a769b932eea0984fe697d6721cab0) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-1549] Add Python support to spark-submitMatei Zaharia2014-05-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN. This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging. In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit. In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0. Author: Matei Zaharia <matei@databricks.com> Closes #664 from mateiz/py-submit and squashes the following commits: 15e9669 [Matei Zaharia] Fix some uses of path.separator property 051278c [Matei Zaharia] Small style fixes 0afe886 [Matei Zaharia] Add license headers 4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests 15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside 47c0655 [Matei Zaharia] More work to make spark-submit work with Python: d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones (cherry picked from commit 951a5d939863b42da83ac2569d5e9d7ed680e119) Signed-off-by: Matei Zaharia <matei@databricks.com>
* SPARK-1474: Spark on yarn assembly doesn't include AmIpFilterThomas Graves2014-05-061-1/+2
| | | | | | | | | | | | | | | We use org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter in spark on yarn but are not included it in the assembly jar. I tested this on yarn cluster by removing the yarn jars from the classpath and spark runs fine now. Author: Thomas Graves <tgraves@apache.org> Closes #406 from tgravescs/SPARK-1474 and squashes the following commits: 1548bf9 [Thomas Graves] SPARK-1474: Spark on yarn assembly doesn't include org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter (cherry picked from commit 1e829905c791fbf1dfd8e0c1caa62ead7354605e) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-1556. jets3t dep doesn't update properly with newer Hadoop versionsSean Owen2014-05-051-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | See related discussion at https://github.com/apache/spark/pull/468 This PR may still overstep what you have in mind, but let me put it on the table to start. Besides fixing the issue, it has one substantive change, and that is to manage Hadoop-specific things only in Hadoop-related profiles. This does _not_ remove `yarn.version`. - Moves the YARN and Hadoop profiles together in pom.xml. Sorry that this makes the diff a little hard to grok but the changes are only as follows. - Removes `hadoop.major.version` - Introduce `hadoop-2.2` and `hadoop-2.3` profiles to control Hadoop-specific changes: - like the protobuf version issue - this was only 'solved' now by enabling YARN for 2.2+, which is really an orthogonal issue - like the jets3t version issue now - Hadoop profiles set an appropriate default `hadoop.version`, that can be overridden - _(YARN profiles in the parent now only exist to add the sub-module)_ - Fixes the jets3t dependency issue - and makes it a runtime dependency - and centralizes config of this guy in the parent pom - Updates build docs - Updates SBT build too - and fixes a regex problem along the way Author: Sean Owen <sowen@cloudera.com> Closes #629 from srowen/SPARK-1556 and squashes the following commits: c3fa967 [Sean Owen] Fix hadoop-2.4 profile typo in doc a2105fd [Sean Owen] Add hadoop-2.4 profile and don't set hadoop.version in profiles 274f4f9 [Sean Owen] Make jets3t a runtime dependency, and bring its exclusion up into parent config bbed826 [Sean Owen] Use jets3t 0.9.0 for Hadoop 2.3+ (and correct similar regex issue in SBT build) f21f356 [Sean Owen] Build changes to set up for jets3t fix (cherry picked from commit 73b0cbcc241cca3d318ff74340e80b02f884acbd) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-1629. Addendum: Depend on commons lang3 (already used by tachyon) as ↵Sean Owen2014-05-041-0/+1
| | | | | | | | | | | | | | | it's used in ReplSuite, and return to use lang3 utility in Utils.scala For consideration. This was proposed in related discussion: https://github.com/apache/spark/pull/569 Author: Sean Owen <sowen@cloudera.com> Closes #635 from srowen/SPARK-1629.2 and squashes the following commits: a442b98 [Sean Owen] Depend on commons lang3 (already used by tachyon) as it's used in ReplSuite, and return to use lang3 utility in Utils.scala (cherry picked from commit f5041579ff573f988b673c2506fa4edc32f5ad84) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-1636][MLLIB] Move main methods to examplesXiangrui Meng2014-04-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * `NaiveBayes` -> `SparseNaiveBayes` * `KMeans` -> `DenseKMeans` * `SVMWithSGD` and `LogisticRegerssionWithSGD` -> `BinaryClassification` * `ALS` -> `MovieLensALS` * `LinearRegressionWithSGD`, `LassoWithSGD`, and `RidgeRegressionWithSGD` -> `LinearRegression` * `DecisionTree` -> `DecisionTreeRunner` `scopt` is used for parsing command-line parameters. `scopt` has MIT license and it only depends on `scala-library`. Example help message: ~~~ BinaryClassification: an example app for binary classification. Usage: BinaryClassification [options] <input> --numIterations <value> number of iterations --stepSize <value> initial step size, default: 1.0 --algorithm <value> algorithm (SVM,LR), default: LR --regType <value> regularization type (L1,L2), default: L2 --regParam <value> regularization parameter, default: 0.1 <input> input paths to labeled examples in LIBSVM format ~~~ Author: Xiangrui Meng <meng@databricks.com> Closes #584 from mengxr/mllib-main and squashes the following commits: 7b58c60 [Xiangrui Meng] minor 6e35d7e [Xiangrui Meng] make imports explicit and fix code style c6178c9 [Xiangrui Meng] update TS PCA/SVD to use new spark-submit 6acff75 [Xiangrui Meng] use scopt for DecisionTreeRunner be86069 [Xiangrui Meng] use main instead of extending App b3edf68 [Xiangrui Meng] move DecisionTree's main method to examples 8bfaa5a [Xiangrui Meng] change NaiveBayesParams to Params fe23dcb [Xiangrui Meng] remove main from KMeans and add DenseKMeans as an example 67f4448 [Xiangrui Meng] remove main methods from linear regression algorithms and add LinearRegression example b066bbc [Xiangrui Meng] remove main from ALS and add MovieLensALS example b040f3b [Xiangrui Meng] change BinaryClassificationParams to Params 577945b [Xiangrui Meng] remove unused imports from NB 3d299bc [Xiangrui Meng] remove main from LR/SVM and add an example app for binary classification f70878e [Xiangrui Meng] remove main from NaiveBayes and add an example NaiveBayes app 01ec2cd [Xiangrui Meng] Merge branch 'master' into mllib-main 9420692 [Xiangrui Meng] add scopt to examples dependencies (cherry picked from commit 3f38334f441940ed0a5bbf5588ca7f22d3940359) Signed-off-by: Reynold Xin <rxin@apache.org>
* Improved build configurationwitgo2014-04-281-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x 2, Fix SPARK-1491: maven hadoop-provided profile fails to build 3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency 4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces) Author: witgo <witgo@qq.com> Closes #480 from witgo/format_pom and squashes the following commits: 03f652f [witgo] review commit b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence 7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence 0da4bc3 [witgo] merge master d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom e345919 [witgo] add avro dependency to yarn-alpha 77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency 1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 934f24d [witgo] review commit cf46edc [witgo] exclude jruby 06e7328 [witgo] Merge branch 'SparkBuild' into format_pom 99464d2 [witgo] fix maven hadoop-provided profile fails to build 0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x 6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml (cherry picked from commit 030f2c2126d5075576cd6d83a1ee7462c48b953b) Conflicts: sql/catalyst/pom.xml sql/core/pom.xml sql/hive/pom.xml
* Update the import package name for TestHive in sbt shellCheng Hao2014-04-271-1/+1
| | | | | | | | | | | | | sbt/sbt hive/console will fail as TestHive changed its package from "org.apache.spark.sql.hive" to "org.apache.spark.sql.hive.test". Author: Cheng Hao <hao.cheng@intel.com> Closes #574 from chenghao-intel/hive_console and squashes the following commits: de14035 [Cheng Hao] Update the import package name for TestHive in sbt shell (cherry picked from commit ea01affc34a8b816f92a71e0f8f6901397c8a4a3) Signed-off-by: Reynold Xin <rxin@apache.org>
* SPARK-1621 Upgrade Chill to 0.3.6Matei Zaharia2014-04-251-3/+3
| | | | | | | | | | | | | It registers more Scala classes, including things like Ranges that we had to register manually before. See https://github.com/twitter/chill/releases for Chill's change log. Author: Matei Zaharia <matei@databricks.com> Closes #543 from mateiz/chill-0.3.6 and squashes the following commits: a1dc5e0 [Matei Zaharia] Upgrade Chill to 0.3.6 and remove our special registration of Ranges (cherry picked from commit a24d918c71f6ac4adbe3ae363ef69f4658118938) Signed-off-by: Matei Zaharia <matei@databricks.com>
* SPARK-1584: Upgrade Flume dependency to 1.4.0tmalaska2014-04-241-1/+2
| | | | | | | | | | | | | | | Updated the Flume dependency in the maven pom file and the scala build file. Author: tmalaska <ted.malaska@cloudera.com> Closes #507 from tmalaska/master and squashes the following commits: 79492c8 [tmalaska] excluded all thrift 159c3f1 [tmalaska] fixed the flume pom file issues 5bf56a7 [tmalaska] Upgrade flume version (cherry picked from commit d5c6ae6cc3305b9aa3185486b5b6ba0a6e5aca90) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-1119 and other build improvementsPatrick Wendell2014-04-231-1/+3
| | | | | | | | | | | | | | | 1. Makes assembly and examples jar naming consistent in maven/sbt. 2. Updates make-distribution.sh to use Maven and fixes some bugs. 3. Updates the create-release script to call make-distribution script. Author: Patrick Wendell <pwendell@gmail.com> Closes #502 from pwendell/make-distribution and squashes the following commits: 1a97f0d [Patrick Wendell] SPARK-1119 and other build improvements (cherry picked from commit cd4ed293262e2349794c13467d1737974385c019) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-1562 Fix visibility / annotation of Spark SQL APIsMichael Armbrust2014-04-221-2/+2
| | | | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #489 from marmbrus/sqlDocFixes and squashes the following commits: acee4f3 [Michael Armbrust] Fix visibility / annotation of Spark SQL APIs (cherry picked from commit aa77f8a6a604efe0d02bc8412b3f1ba3903b7a57) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-1560]: Updated Pyrolite Dependency to be Java 6 compatibleAhir Reddy2014-04-221-1/+1
| | | | | | | | | | | | | | Changed the Pyrolite dependency to a build which targets Java 6. Author: Ahir Reddy <ahirreddy@gmail.com> Closes #479 from ahirreddy/java6-pyrolite and squashes the following commits: 8ea25d3 [Ahir Reddy] Updated maven build to use java 6 compatible pyrolite dabc703 [Ahir Reddy] Updated Pyrolite dependency to be Java 6 compatible (cherry picked from commit 0f87e6ad4366a8c453a7415bc89399030003c264) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-1439, SPARK-1440] Generate unified Scaladoc across projects and JavadocsMatei Zaharia2014-04-212-12/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | I used the sbt-unidoc plugin (https://github.com/sbt/sbt-unidoc) to create a unified Scaladoc of our public packages, and generate Javadocs as well. One limitation is that I haven't found an easy way to exclude packages in the Javadoc; there is a SBT task that identifies Java sources to run javadoc on, but it's been very difficult to modify it from outside to change what is set in the unidoc package. Some SBT-savvy people should help with this. The Javadoc site also lacks package-level descriptions and things like that, so we may want to look into that. We may decide not to post these right now if it's too limited compared to the Scala one. Example of the built doc site: http://people.csail.mit.edu/matei/spark-unified-docs/ Author: Matei Zaharia <matei@databricks.com> This patch had conflicts when merged, resolved by Committer: Patrick Wendell <pwendell@gmail.com> Closes #457 from mateiz/better-docs and squashes the following commits: a63d4a3 [Matei Zaharia] Skip Java/Scala API docs for Python package 5ea1f43 [Matei Zaharia] Fix links to Java classes in Java guide, fix some JS for scrolling to anchors on page load f05abc0 [Matei Zaharia] Don't include java.lang package names 995e992 [Matei Zaharia] Skip internal packages and class names with $ in JavaDoc a14a93c [Matei Zaharia] typo 76ce64d [Matei Zaharia] Add groups to Javadoc index page, and a first package-info.java ed6f994 [Matei Zaharia] Generate JavaDoc as well, add titles, update doc site to use unified docs acb993d [Matei Zaharia] Add Unidoc plugin for the projects we want Unidoced (cherry picked from commit fc7838470465474f777bd17791c1bb5f9c348521) Conflicts: project/SparkBuild.scala
* [SPARK-1332] Improve Spark Streaming's Network Receiver and InputDStream API ↵Tathagata Das2014-04-211-14/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [WIP] The current Network Receiver API makes it slightly complicated to right a new receiver as one needs to create an instance of BlockGenerator as shown in SocketReceiver https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/SocketInputDStream.scala#L51 Exposing the BlockGenerator interface has made it harder to improve the receiving process. The API of NetworkReceiver (which was not a very stable API anyways) needs to be change if we are to ensure future stability. Additionally, the functions like streamingContext.socketStream that create input streams, return DStream objects. That makes it hard to expose functionality (say, rate limits) unique to input dstreams. They should return InputDStream or NetworkInputDStream. This is still not yet implemented. This PR is blocked on the graceful shutdown PR #247 Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #300 from tdas/network-receiver-api and squashes the following commits: ea27b38 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into network-receiver-api 3a4777c [Tathagata Das] Renamed NetworkInputDStream to ReceiverInputDStream, and ActorReceiver related stuff. 838dd39 [Tathagata Das] Added more events to the StreamingListener to report errors and stopped receivers. a75c7a6 [Tathagata Das] Address some PR comments and fixed other issues. 91bfa72 [Tathagata Das] Fixed bugs. 8533094 [Tathagata Das] Scala style fixes. 028bde6 [Tathagata Das] Further refactored receiver to allow restarting of a receiver. 43f5290 [Tathagata Das] Made functions that create input streams return InputDStream and NetworkInputDStream, for both Scala and Java. 2c94579 [Tathagata Das] Fixed graceful shutdown by removing interrupts on receiving thread. 9e37a0b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into network-receiver-api 3223e95 [Tathagata Das] Refactored the code that runs the NetworkReceiver into further classes and traits to make them more testable. a36cc48 [Tathagata Das] Refactored the NetworkReceiver API for future stability. (cherry picked from commit 04c37b6f749dc2418cc28c89964cdc687dfcbd51) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Updating versions for Spark 1.0Patrick Wendell2014-04-212-2/+2
|
* [SPARK-1520] remove fastutil from dependenciesXiangrui Meng2014-04-181-2/+2
| | | | | | | | | | | | | | | | | A quick fix for https://issues.apache.org/jira/browse/SPARK-1520 By excluding fastutil, we bring the number of files in the assembly jar back under 65536, so Java 7 won't create the assembly jar in zip64 format, which cannot be read by Java 6. With this change, the assembly jar now has about 60000 entries (58000 files), tested with both sbt and maven. Author: Xiangrui Meng <meng@databricks.com> Closes #437 from mengxr/remove-fastutil and squashes the following commits: 00f9beb [Xiangrui Meng] remove fastutil from dependencies (cherry picked from commit aa17f022c59af02b04b977da9017671ef14d664a) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-959] Updated SBT from 0.13.1 to 0.13.2Cheng Lian2014-04-162-3/+1
| | | | | | | | | | | | | | | JIRA issue: [SPARK-959](https://spark-project.atlassian.net/browse/SPARK-959) SBT 0.13.2 has been officially released. This version updated Ivy 2.0 to Ivy 2.3, which fixes [IVY-899](https://issues.apache.org/jira/browse/IVY-899). This PR also removed previous workaround. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #426 from liancheng/updateSbt and squashes the following commits: 95e3dc8 [Cheng Lian] Updated SBT from 0.13.1 to 0.13.2 to fix SPARK-959 (cherry picked from commit 6a10d801626f1513b1b349b54ba0e2e6bf55c7e2) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [FIX] update sbt-idea to version 1.6.0Xiangrui Meng2014-04-151-1/+1
| | | | | | | | | | | | | I saw `No "scala-library*.jar" in Scala compiler library` error in IDEA. It seems upgrading `sbt-idea` to 1.6.0 fixed the problem. Author: Xiangrui Meng <meng@databricks.com> Closes #419 from mengxr/idea-plugin and squashes the following commits: fb3c35f [Xiangrui Meng] update sbt-idea to version 1.6.0 (cherry picked from commit 8517911efb89aade61c8b8c54fee216dae9a4b4f) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-1374: PySpark API for SparkSQLAhir Reddy2014-04-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling (cherry picked from commit c99bcb7feaa761c5826f2e1d844d0502a3b79538) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-1488. Resolve scalac feature warnings during buildSean Owen2014-04-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For your consideration: scalac currently notes a number of feature warnings during compilation: ``` [warn] there were 65 feature warning(s); re-run with -feature for details ``` Warnings are like: ``` [warn] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261: implicit conversion method rddToPairRDDFunctions should be enabled [warn] by making the implicit value scala.language.implicitConversions visible. [warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions' [warn] or by setting the compiler option -language:implicitConversions. [warn] See the Scala docs for value scala.language.implicitConversions for a discussion [warn] why the feature should be explicitly enabled. [warn] implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) = [warn] ^ ``` scalac is suggesting that it's just best practice to explicitly enable certain language features by importing them where used. This PR simply adds the imports it suggests (and squashes one other Java warning along the way). This leaves just deprecation warnings in the build. Author: Sean Owen <sowen@cloudera.com> Closes #404 from srowen/SPARK-1488 and squashes the following commits: 8598980 [Sean Owen] Quiet scalac warnings about language features by explicitly importing language features. 39bc831 [Sean Owen] Enable -feature in scalac to emit language feature warnings (cherry picked from commit 0247b5c5467ca1b0d03ba929a78fa4d805582d84) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-1386] Web UI for Spark StreamingTathagata Das2014-04-111-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When debugging Spark Streaming applications it is necessary to monitor certain metrics that are not shown in the Spark application UI. For example, what is average processing time of batches? What is the scheduling delay? Is the system able to process as fast as it is receiving data? How many records I am receiving through my receivers? While the StreamingListener interface introduced in the 0.9 provided some of this information, it could only be accessed programmatically. A UI that shows information specific to the streaming applications is necessary for easier debugging. This PR introduces such a UI. It shows various statistics related to the streaming application. Here is a screenshot of the UI running on my local machine. http://i.imgur.com/1ooDGhm.png This UI is integrated into the Spark UI running at 4040. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Andrew Or <andrewor14@gmail.com> Closes #290 from tdas/streaming-web-ui and squashes the following commits: fc73ca5 [Tathagata Das] Merge pull request #9 from andrewor14/ui-refactor 642dd88 [Andrew Or] Merge SparkUISuite.scala into UISuite.scala eb30517 [Andrew Or] Merge github.com:apache/spark into ui-refactor f4f4cbe [Tathagata Das] More minor fixes. 34bb364 [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui 252c566 [Tathagata Das] Merge pull request #8 from andrewor14/ui-refactor e038b4b [Tathagata Das] Addressed Patrick's comments. 125a054 [Andrew Or] Disable serving static resources with gzip 90feb8d [Andrew Or] Address Patrick's comments 89dae36 [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui 72fe256 [Tathagata Das] Merge pull request #6 from andrewor14/ui-refactor 2fc09c8 [Tathagata Das] Added binary check exclusions aa396d4 [Andrew Or] Rename tabs and pages (No more IndexPage.scala) f8e1053 [Tathagata Das] Added Spark and Streaming UI unit tests. caa5e05 [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui 585cd65 [Tathagata Das] Merge pull request #5 from andrewor14/ui-refactor 914b8ff [Tathagata Das] Moved utils functions to UIUtils. 548c98c [Andrew Or] Wide refactoring of WebUI, UITab, and UIPage (see commit message) 6de06b0 [Tathagata Das] Merge remote-tracking branch 'apache/master' into streaming-web-ui ee6543f [Tathagata Das] Minor changes based on Andrew's comments. fa760fe [Tathagata Das] Fixed long line. 1c0bcef [Tathagata Das] Refactored streaming UI into two files. 1af239b [Tathagata Das] Changed streaming UI to attach itself as a tab with the Spark UI. 827e81a [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui 168fe86 [Tathagata Das] Merge pull request #2 from andrewor14/ui-refactor 3e986f8 [Tathagata Das] Merge remote-tracking branch 'apache/master' into streaming-web-ui c78c92d [Andrew Or] Remove outdated comment 8f7323b [Andrew Or] End of file new lines, indentation, and imports (minor) 0d61ee8 [Andrew Or] Merge branch 'streaming-web-ui' of github.com:tdas/spark into ui-refactor 9a48fa1 [Andrew Or] Allow adding tabs to SparkUI dynamically + add example 61358e3 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-web-ui 53be2c5 [Tathagata Das] Minor style updates. ed25dfc [Andrew Or] Generalize SparkUI header to display tabs dynamically a37ad4f [Andrew Or] Comments, imports and formatting (minor) cd000b0 [Andrew Or] Merge github.com:apache/spark into ui-refactor 7d57444 [Andrew Or] Refactoring the UI interface to add flexibility aef4dd5 [Tathagata Das] Added Apache licenses. db27bad [Tathagata Das] Added last batch processing time to StreamingUI. 4d86e98 [Tathagata Das] Added basic stats to the StreamingUI and refactored the UI to a Page to make it easier to transition to using SparkUI later. 93f1c69 [Tathagata Das] Added network receiver information to the Streaming UI. 56cc7fb [Tathagata Das] First cut implementation of Streaming UI.
* SPARK-1057 (alternative) Remove fastutilSean Owen2014-04-111-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (This is for discussion at this point -- I'm not suggesting this should be committed.) This is what removing fastutil looks like. Much of it is straightforward, like using `java.io` buffered stream classes, and Guava for murmurhash3. Uses of the `FastByteArrayOutputStream` were a little trickier. In only one case though do I think the change to use `java.io` actually entails an extra array copy. The rest is using `OpenHashMap` and `OpenHashSet`. These are now written in terms of more scala-like operations. `OpenHashMap` is where I made three non-trivial changes to make it work, and they need review: - It is no longer private - The key must be a `ClassTag` - Unless a lot of other code changes, the key type can't enforce being a supertype of `Null` It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective. But what about those last changes? Author: Sean Owen <sowen@cloudera.com> Closes #266 from srowen/SPARK-1057-alternate and squashes the following commits: 2601129 [Sean Owen] Fix Map return type error not previously caught ec65502 [Sean Owen] Updates from matei's review 00bc81e [Sean Owen] Remove use of fastutil and replace with use of java.io, spark.util and Guava classes (cherry picked from commit 165e06a74c3d75e6b7341c120943add8b035b96a) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Revert "SPARK-1433: Upgrade Mesos dependency to 0.17.0"Patrick Wendell2014-04-101-1/+1
| | | | This reverts commit 12c077d5aa0b76a808a55db625c9677a52bd43f9.
* Spark-939: allow user jars to take precedence over spark jarsHolden Karau2014-04-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I still need to do a small bit of re-factoring [mostly the one Java file I'll switch it back to a Scala file and use it in both the close loaders], but comments on other things I should do would be great. Author: Holden Karau <holden@pigscanfly.ca> Closes #217 from holdenk/spark-939-allow-user-jars-to-take-precedence-over-spark-jars and squashes the following commits: cf0cac9 [Holden Karau] Fix the executorclassloader 1955232 [Holden Karau] Fix long line in TestUtils 8f89965 [Holden Karau] Fix tests for new class name 7546549 [Holden Karau] CR feedback, merge some of the testutils methods down, rename the classloader 644719f [Holden Karau] User the class generator for the repl class loader tests too f0b7114 [Holden Karau] Fix the core/src/test/scala/org/apache/spark/executor/ExecutorURLClassLoaderSuite.scala tests 204b199 [Holden Karau] Fix the generated classes 9f68f10 [Holden Karau] Start rewriting the ExecutorURLClassLoaderSuite to not use the hard coded classes 858aba2 [Holden Karau] Remove a bunch of test junk 261aaee [Holden Karau] simplify executorurlclassloader a bit 7a7bf5f [Holden Karau] CR feedback d4ae848 [Holden Karau] rewrite component into scala aa95083 [Holden Karau] CR feedback 7752594 [Holden Karau] re-add https comment a0ef85a [Holden Karau] Fix style issues 125ea7f [Holden Karau] Easier to just remove those files, we don't need them bb8d179 [Holden Karau] Fix issues with the repl class loader 241b03d [Holden Karau] fix my rat excludes a343350 [Holden Karau] Update rat-excludes and remove a useless file d90d217 [Holden Karau] Fix fall back with custom class loader and add a test for it 4919bf9 [Holden Karau] Fix parent calling class loader issue 8a67302 [Holden Karau] Test are good 9e2d236 [Holden Karau] It works comrade 691ee00 [Holden Karau] It works ish dc4fe44 [Holden Karau] Does not depend on being in my home directory 47046ff [Holden Karau] Remove bad import' 22d83cb [Holden Karau] Add a test suite for the executor url class loader suite 7ef4628 [Holden Karau] Clean up 792d961 [Holden Karau] Almost works 16aecd1 [Holden Karau] Doesn't quite work 8d2241e [Holden Karau] Adda FakeClass for testing ClassLoader precedence options 648b559 [Holden Karau] Both class loaders compile. Now for testing e1d9f71 [Holden Karau] One loader workers.
* SPARK-1433: Upgrade Mesos dependency to 0.17.0Sandeep2014-04-081-1/+1
| | | | | | | | | | | Mesos 0.13.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0 Author: Sandeep <sandeep@techaddict.me> Closes #355 from techaddict/mesos_update and squashes the following commits: f1abeee [Sandeep] SPARK-1433: Upgrade Mesos dependency to 0.17.0 Mesos 0.13.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0
* [SPARK-1331] Added graceful shutdown to Spark StreamingTathagata Das2014-04-081-11/+13
| | | | | | | | | | | | | | | | | | | | | Current version of StreamingContext.stop() directly kills all the data receivers (NetworkReceiver) without waiting for the data already received to be persisted and processed. This PR provides the fix. Now, when the StreamingContext.stop() is called, the following sequence of steps will happen. 1. The driver will send a stop signal to all the active receivers. 2. Each receiver, when it gets a stop signal from the driver, first stop receiving more data, then waits for the thread that persists data blocks to BlockManager to finish persisting all receive data, and finally quits. 3. After all the receivers have stopped, the driver will wait for the Job Generator and Job Scheduler to finish processing all the received data. It also fixes the semantics of StreamingContext.start and stop. It will throw appropriate errors and warnings if stop() is called before start(), stop() is called twice, etc. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #247 from tdas/graceful-shutdown and squashes the following commits: 61c0016 [Tathagata Das] Updated MIMA binary check excludes. ae1d39b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into graceful-shutdown 6b59cfc [Tathagata Das] Minor changes based on Andrew's comment on PR. d0b8d65 [Tathagata Das] Reduced time taken by graceful shutdown unit test. f55bc67 [Tathagata Das] Fix scalastyle c69b3a7 [Tathagata Das] Updates based on Patrick's comments. c43b8ae [Tathagata Das] Added graceful shutdown to Spark Streaming.
* [SQL] SPARK-1371 Hash Aggregation ImprovementsMichael Armbrust2014-04-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Given: ```scala case class Data(a: Int, b: Int) val rdd = sparkContext .parallelize(1 to 200) .flatMap(_ => (1 to 50000).map(i => Data(i % 100, i))) rdd.registerAsTable("data") cacheTable("data") ``` Before: ``` SELECT COUNT(*) FROM data:[10000000] 16795.567ms SELECT a, SUM(b) FROM data GROUP BY a 7536.436ms SELECT SUM(b) FROM data 10954.1ms ``` After: ``` SELECT COUNT(*) FROM data:[10000000] 1372.175ms SELECT a, SUM(b) FROM data GROUP BY a 2070.446ms SELECT SUM(b) FROM data 958.969ms ``` Author: Michael Armbrust <michael@databricks.com> Closes #295 from marmbrus/hashAgg and squashes the following commits: ec63575 [Michael Armbrust] Add comment. d0495a9 [Michael Armbrust] Use scaladoc instead. b4a6887 [Michael Armbrust] Address review comments. a2d90ba [Michael Armbrust] Capture child output statically to avoid issues with generators and serialization. 7c13112 [Michael Armbrust] Rewrite Aggregate operator to stream input and use projections. Remove unused local RDD functions implicits. 5096f99 [Michael Armbrust] Make HiveUDAF fields transient since object inspectors are not serializable. 6a4b671 [Michael Armbrust] Add option to avoid binding operators expressions automatically. 92cca08 [Michael Armbrust] Always include serialization debug info when running tests. 1279df2 [Michael Armbrust] Increase default number of partitions.
* SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packagingAaron Davidson2014-04-061-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars. This patch has the following features/bug fixes: - Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar. - Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven) - assemble-deps fixed since we no longer use a different ASSEMBLY_DIR - avoid adding log message in compute-classpath.sh to the classpath :) Still TODO before mergeable: - We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself. - Spark SQL documentation updates. Author: Aaron Davidson <aaron@databricks.com> Closes #237 from aarondav/master and squashes the following commits: 5dc4329 [Aaron Davidson] Typo fixes dd4f298 [Aaron Davidson] Doc update dd1a365 [Aaron Davidson] Eliminate need for SPARK_HIVE at runtime by d/ling datanucleus from Maven a9269b5 [Aaron Davidson] [WIP] Use SPARK_HIVE to determine if we include Hive in packaging