aboutsummaryrefslogtreecommitdiff
path: root/project/SparkBuild.scala
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-2467] Revert SparkBuild to publish-local to both .m2 and .ivy2.Takuya UESHIN2014-07-141-1/+13
| | | | | | | | Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1398 from ueshin/issues/SPARK-2467 and squashes the following commits: 7f01d58 [Takuya UESHIN] Revert SparkBuild to publish-local to both .m2 and .ivy2.
* [SPARK-2437] Rename MAVEN_PROFILES to SBT_MAVEN_PROFILES and add ↵Prashant Sharma2014-07-111-2/+7
| | | | | | | | | | | | SBT_MAVEN_PROPERTIES NOTE: It is not possible to use both env variable `SBT_MAVEN_PROFILES` and `-P` flag at same time. `-P` if specified takes precedence. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #1374 from ScrapCodes/SPARK-2437/rename-MAVEN_PROFILES and squashes the following commits: 8694bde [Prashant Sharma] [SPARK-2437] Rename MAVEN_PROFILES to SBT_MAVEN_PROFILES and add SBT_MAVEN_PROPERTIES
* [SPARK-1776] Have Spark's SBT build read dependencies from Maven.Prashant Sharma2014-07-101-579/+201
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch introduces the new way of working also retaining the existing ways of doing things. For example build instruction for yarn in maven is `mvn -Pyarn -PHadoop2.2 clean package -DskipTests` in sbt it can become `MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly` Also supports `sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly` Author: Prashant Sharma <prashant.s@imaginea.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #772 from ScrapCodes/sbt-maven and squashes the following commits: a8ac951 [Prashant Sharma] Updated sbt version. 62b09bb [Prashant Sharma] Improvements. fa6221d [Prashant Sharma] Excluding sql from mima 4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default. 72651ca [Prashant Sharma] Addresses code reivew comments. acab73d [Prashant Sharma] Revert "Small fix to run-examples script." ac4312c [Prashant Sharma] Revert "minor fix" 6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit. 65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path. 446768e [Prashant Sharma] minor fix 89b9777 [Prashant Sharma] Merge conflicts d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups. dccc8ac [Prashant Sharma] updated mima to check against 1.0 a49c61b [Prashant Sharma] Fix for tools jar a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies. cf88758 [Prashant Sharma] cleanup 9439ea3 [Prashant Sharma] Small fix to run-examples script. 96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven. 36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins. 4973dbd [Patrick Wendell] Example build using pom reader.
* SPARK-2332 [build] add exclusion for old servlet-api on hadoop-client in corePeter MacKinnon2014-07-011-1/+1
| | | | | | | | | | | Fix for class of test suite failures in jenkins Author: Peter MacKinnon <pmackinn@redhat.com> Closes #1271 from pdmack/master and squashes the following commits: cfe59fd [Peter MacKinnon] exclude servlet-api in hadoop-client for sbt 6f39fec [Peter MacKinnon] add exclusion for old servlet-api on hadoop-client in core
* Fix some tests.Marcelo Vanzin2014-06-201-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | - JavaAPISuite was trying to compare a bare path with a URI. Fix by extracting the path from the URI, since we know it should be a local path anyway/ - b9be1609 excluded the ASM dependency everywhere, but easymock needs it (because cglib needs it). So re-add the dependency, with test scope this time. The second one above actually uncovered a weird situation: the maven test target works, even though I can't find the class sbt complains about in its classpath. sbt complains with: [error] Uncaught exception when running org.apache.spark.util .random.RandomSamplerSuite: java.lang.NoClassDefFoundError: org/objectweb/asm/Type To avoid more weirdness caused by that, I explicitly added the asm dependency to both maven and sbt (for tests only), and verified the classes don't end up in the final assembly. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #917 from vanzin/flaky-tests and squashes the following commits: d022320 [Marcelo Vanzin] Fix some tests.
* [SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQLYin Huai2014-06-171-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
* SPARK-1939 Refactor takeSample method in RDD to use ScaSRSDoris Xin2014-06-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate. Author: Doris Xin <doris.s.xin@gmail.com> Author: dorx <doris.s.xin@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #916 from dorx/takeSample and squashes the following commits: 5b061ae [Doris Xin] merge master 444e750 [Doris Xin] edge cases 3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939 82dde31 [Xiangrui Meng] update pyspark's takeSample 48d954d [Doris Xin] remove unused imports from RDDSuite fb1452f [Doris Xin] allowing num to be greater than count in all cases 1481b01 [Doris Xin] washing test tubes and making coffee dc699f3 [Doris Xin] give back imports removed by accident in rdd.py 64e445b [Doris Xin] logwarnning as soon as it enters the while loop 55518ed [Doris Xin] added TODO for logging in rdd.py eff89e2 [Doris Xin] addressed reviewer comments. ecab508 [Doris Xin] "fixed checkstyle violation 0a9b3e3 [Doris Xin] "reviewer comment addressed" f80f270 [Doris Xin] Merge branch 'master' into takeSample ae3ad04 [Doris Xin] fixed edge cases to prevent overflow 065ebcd [Doris Xin] Merge branch 'master' into takeSample 9bdd36e [Doris Xin] Check sample size and move computeFraction e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample 7cab53a [Doris Xin] fixed import bug in rdd.py ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD 1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
* SPARK-1843: Replace assemble-deps with env variable.Patrick Wendell2014-06-121-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (This change is actually small, I moved some logic into compute-classpath that was previously in spark-class). Assemble deps has existed for a while to allow developers to run local code with new changes quickly. When I'm developing I typically use a simpler approach which just prepends the Spark classes to the classpath before the assembly jar. This is well defined in the JVM and the Spark classes take precedence over those in the assembly. This approach is portable across both builds which is the main reason I'd like to switch to it. It's also a bit easier to toggle on and off quickly. The way you use this is the following: ``` $ ./bin/spark-shell # Use spark with the normal assembly $ export SPARK_PREPEND_CLASSES=true $ ./bin/spark-shell # Now it's using compiled classes $ unset SPARK_PREPEND_CLASSES $ ./bin/spark-shell # Back to normal ``` Author: Patrick Wendell <pwendell@gmail.com> Closes #877 from pwendell/assemble-deps and squashes the following commits: 8a11345 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into assemble-deps faa3168 [Patrick Wendell] Adding a warning for compatibility 3f151a7 [Patrick Wendell] Small fix bbfb73c [Patrick Wendell] Review feedback 328e9f8 [Patrick Wendell] SPARK-1843: Replace assemble-deps with env variable.
* [SPARK-2069] MIMA false positivesPrashant Sharma2014-06-111-0/+15
| | | | | | | | | | | | | Fixes SPARK 2070 and 2071 Author: Prashant Sharma <prashant.s@imaginea.com> Closes #1021 from ScrapCodes/SPARK-2070/package-private-methods and squashes the following commits: 7979a57 [Prashant Sharma] addressed code review comments 558546d [Prashant Sharma] A little fancy error message. 59275ab [Prashant Sharma] SPARK-2071 Mima ignores classes and its members from previous versions too. 0c4ff2b [Prashant Sharma] SPARK-2070 Ignore methods along with annotated classes.
* [SPARK-1841]: update scalatest to version 2.1.5witgo2014-06-061-11/+11
| | | | | | | | | | | | | | | | | | | | Author: witgo <witgo@qq.com> Closes #713 from witgo/scalatest and squashes the following commits: b627a6a [witgo] merge master 51fb3d6 [witgo] merge master 3771474 [witgo] fix RDDSuite 996d6f9 [witgo] fix TimeStampedWeakValueHashMap test 9dfa4e7 [witgo] merge bug 1479b22 [witgo] merge master 29b9194 [witgo] fix code style 022a7a2 [witgo] fix test dependency a52c0fa [witgo] fix test dependency cd8f59d [witgo] Merge branch 'master' of https://github.com/apache/spark into scalatest 046540d [witgo] fix RDDSuite.scala 2c543b9 [witgo] fix ReplSuite.scala c458928 [witgo] update scalatest to version 2.1.5
* Remove compile-scoped junit dependency.Marcelo Vanzin2014-06-051-1/+2
| | | | | | | | | | | | | This avoids having junit classes showing up in the assembly jar. I verified that only test classes in the jtransforms package use junit. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #794 from vanzin/junit-dep-exclusion and squashes the following commits: 274e1c2 [Marcelo Vanzin] Remove junit from assembly in sbt build also. ad950be [Marcelo Vanzin] Remove compile-scoped junit dependency.
* SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of ↵Reynold Xin2014-06-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | HyperLogLog. I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results). Author: Reynold Xin <rxin@apache.org> Closes #897 from rxin/hll and squashes the following commits: 4d83f41 [Reynold Xin] New error bound and non-randomness. f154ea0 [Reynold Xin] Added a comment on the value bound for testing. e367527 [Reynold Xin] One more round of code review. 41e649a [Reynold Xin] Update final mima list. 9e320c8 [Reynold Xin] Incorporate code review feedback. e110d70 [Reynold Xin] Merge branch 'master' into hll 354deb8 [Reynold Xin] Added comment on the Mima exclude rules. acaa524 [Reynold Xin] Added the right exclude rules in MimaExcludes. 6555bfe [Reynold Xin] Added a default method and re-arranged MimaExcludes. 1db1522 [Reynold Xin] Excluded util.SerializableHyperLogLog from MIMA check. 9221b27 [Reynold Xin] Merge branch 'master' into hll 88cfe77 [Reynold Xin] Updated documentation and restored the old incorrect API to maintain API compatibility. 1294be6 [Reynold Xin] Updated HLL+. e7786cb [Reynold Xin] Merge branch 'master' into hll c0ef0c2 [Reynold Xin] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.
* Add support for Pivotal HD in the Maven build: SPARK-1992tzolov2014-06-031-0/+1
| | | | | | | | | | | | | Allow Spark to build against particular Pivotal HD distributions. For example to build Spark against Pivotal HD 2.0.1 one can run: ``` mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0-gphd-3.0.1.0 -DskipTests clean package ``` Author: tzolov <christian.tzolov@gmail.com> Closes #942 from tzolov/master and squashes the following commits: bc3e05a [tzolov] Add support for Pivotal HD in the Maven build and SBT build: [SPARK-1992]
* Optionally include Hive as a dependency of the REPL.Michael Armbrust2014-05-311-1/+3
| | | | | | | | | | Due to the way spark-shell launches from an assembly jar, I don't think this change will affect anyone who isn't trying to launch the shell directly from sbt. That said, it is kinda nice to be able to launch all things directly from SBT when developing. Author: Michael Armbrust <michael@databricks.com> Closes #801 from marmbrus/hiveRepl and squashes the following commits: 9570571 [Michael Armbrust] Optionally include Hive as a dependency of the REPL.
* [SPARK-1971] Update MIMA to compare against Spark 1.0.0Prashant Sharma2014-05-301-3/+4
| | | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #910 from ScrapCodes/enable-mima/spark-core and squashes the following commits: 79f3687 [Prashant Sharma] updated Mima to check against version 1.0 1e8969c [Prashant Sharma] Spark core missed out on Mima settings. So in effect we never tested spark core for mima related errors.
* SPARK-1935: Explicitly add commons-codec 1.5 as a dependency.Yin Huai2014-05-291-0/+1
| | | | | | | | Author: Yin Huai <huai@cse.ohio-state.edu> Closes #889 from yhuai/SPARK-1935 and squashes the following commits: 7d50ef1 [Yin Huai] Explicitly add commons-codec 1.5 as a dependency.
* [SPARK-1875]NoClassDefFoundError: StringUtils when building with hadoop 1.x ↵witgo2014-05-191-2/+1
| | | | | | | | | | | and hive Author: witgo <witgo@qq.com> Closes #824 from witgo/SPARK-1875_commons-lang-2.6 and squashes the following commits: ef7231d [witgo] review commit ead3c3b [witgo] SPARK-1875:NoClassDefFoundError: StringUtils when building against Hadoop 1
* fix different versions of commons-lang dependency and apache/spark#746 addendumwitgo2014-05-141-5/+7
| | | | | | | | | | | | Author: witgo <witgo@qq.com> Closes #754 from witgo/commons-lang and squashes the following commits: 3ebab31 [witgo] merge master f3b8fa2 [witgo] merge master 2083fae [witgo] repeat definition 5599cdb [witgo] multiple version of sbt dependency c1b66a1 [witgo] fix different versions of commons-lang dependency
* Fix: sbt test throw an java.lang.OutOfMemoryError: PermGen spacewitgo2014-05-141-0/+1
| | | | | | | | Author: witgo <witgo@qq.com> Closes #773 from witgo/sbt_javaOptions and squashes the following commits: 26c7d38 [witgo] Improve sbt configuration
* SPARK-1828: Created forked version of hive-exec that doesn't bundle other ↵Patrick Wendell2014-05-141-3/+3
| | | | | | | | | | | | | | | dependencies See https://issues.apache.org/jira/browse/SPARK-1828 for more information. This is being submitted to Jenkin's for testing. The dependency won't fully propagate in Maven central for a few more hours. Author: Patrick Wendell <pwendell@gmail.com> Closes #767 from pwendell/hive-shaded and squashes the following commits: ea10ac5 [Patrick Wendell] SPARK-1828: Created forked version of hive-exec that doesn't bundle other dependencies
* SPARK-1806: Upgrade Mesos dependency to 0.18.1Bernardo Gomez Palacio2014-05-121-1/+1
| | | | | | | | | | | | | | | | | | | | Enabled Mesos (0.18.1) dependency with shaded protobuf Why is this needed? Avoids any protobuf version collision between Mesos and any other dependency in Spark e.g. Hadoop HDFS 2.2+ or 1.0.4. Ticket: https://issues.apache.org/jira/browse/SPARK-1806 * Should close https://issues.apache.org/jira/browse/SPARK-1433 Author berngp Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com> Closes #741 from berngp/feature/SPARK-1806 and squashes the following commits: 5d70646 [Bernardo Gomez Palacio] SPARK-1806: Upgrade Mesos dependency to 0.18.1
* Enabled incremental build that comes with sbt 0.13.2Prashant Sharma2014-05-101-1/+1
| | | | | | | | | | More info at. https://github.com/sbt/sbt/issues/1010 Author: Prashant Sharma <prashant.s@imaginea.com> Closes #525 from ScrapCodes/sbt-inc-opt and squashes the following commits: ba8fa42 [Prashant Sharma] Enabled incremental build that comes with sbt 0.13.2
* SPARK-1789. Multiple versions of Netty dependencies cause FlumeStreamSuite ↵Sean Owen2014-05-101-17/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | failure TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure. I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?) velvia notes: "I have found a workaround. If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty." There are at least 3 versions of Netty in play in the build: - the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem - the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6. - but, Spark Core directly uses io.netty:netty-all:4.0.17.Final The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue. The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final. But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile. If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation. So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict: - Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts - Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty - Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent - Update SBT build accordingly A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible. Author: Sean Owen <sowen@cloudera.com> Closes #723 from srowen/SPARK-1789 and squashes the following commits: 43661b7 [Sean Owen] Update and add Netty excludes to prevent some JAR conflicts that cause test issues
* [SQL] Upgrade parquet library.Michael Armbrust2014-05-101-1/+1
| | | | | | | | | | | | I think we are hitting this issue in some perf tests: https://github.com/Parquet/parquet-mr/commit/6aed5288fd4a1398063a5a219b2ae4a9f71b02cf Credit to @aarondav ! Author: Michael Armbrust <michael@databricks.com> Closes #684 from marmbrus/upgradeParquet and squashes the following commits: e10a619 [Michael Armbrust] Upgrade parquet library.
* [SPARK-1644] The org.datanucleus:* should not be packaged into ↵witgo2014-05-101-5/+6
| | | | | | | | | | | | | spark-assembly-*.jar Author: witgo <witgo@qq.com> Closes #688 from witgo/SPARK-1644 and squashes the following commits: 56ad6ac [witgo] review commit 87c03e4 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1644 6ffa7e4 [witgo] review commit a597414 [witgo] The org.datanucleus:* should not be packaged into spark-assembly-*.jar
* [SPARK-1549] Add Python support to spark-submitMatei Zaharia2014-05-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN. This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging. In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit. In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0. Author: Matei Zaharia <matei@databricks.com> Closes #664 from mateiz/py-submit and squashes the following commits: 15e9669 [Matei Zaharia] Fix some uses of path.separator property 051278c [Matei Zaharia] Small style fixes 0afe886 [Matei Zaharia] Add license headers 4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests 15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside 47c0655 [Matei Zaharia] More work to make spark-submit work with Python: d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
* SPARK-1474: Spark on yarn assembly doesn't include AmIpFilterThomas Graves2014-05-061-1/+2
| | | | | | | | | | | | We use org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter in spark on yarn but are not included it in the assembly jar. I tested this on yarn cluster by removing the yarn jars from the classpath and spark runs fine now. Author: Thomas Graves <tgraves@apache.org> Closes #406 from tgravescs/SPARK-1474 and squashes the following commits: 1548bf9 [Thomas Graves] SPARK-1474: Spark on yarn assembly doesn't include org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
* SPARK-1556. jets3t dep doesn't update properly with newer Hadoop versionsSean Owen2014-05-051-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | See related discussion at https://github.com/apache/spark/pull/468 This PR may still overstep what you have in mind, but let me put it on the table to start. Besides fixing the issue, it has one substantive change, and that is to manage Hadoop-specific things only in Hadoop-related profiles. This does _not_ remove `yarn.version`. - Moves the YARN and Hadoop profiles together in pom.xml. Sorry that this makes the diff a little hard to grok but the changes are only as follows. - Removes `hadoop.major.version` - Introduce `hadoop-2.2` and `hadoop-2.3` profiles to control Hadoop-specific changes: - like the protobuf version issue - this was only 'solved' now by enabling YARN for 2.2+, which is really an orthogonal issue - like the jets3t version issue now - Hadoop profiles set an appropriate default `hadoop.version`, that can be overridden - _(YARN profiles in the parent now only exist to add the sub-module)_ - Fixes the jets3t dependency issue - and makes it a runtime dependency - and centralizes config of this guy in the parent pom - Updates build docs - Updates SBT build too - and fixes a regex problem along the way Author: Sean Owen <sowen@cloudera.com> Closes #629 from srowen/SPARK-1556 and squashes the following commits: c3fa967 [Sean Owen] Fix hadoop-2.4 profile typo in doc a2105fd [Sean Owen] Add hadoop-2.4 profile and don't set hadoop.version in profiles 274f4f9 [Sean Owen] Make jets3t a runtime dependency, and bring its exclusion up into parent config bbed826 [Sean Owen] Use jets3t 0.9.0 for Hadoop 2.3+ (and correct similar regex issue in SBT build) f21f356 [Sean Owen] Build changes to set up for jets3t fix
* SPARK-1629. Addendum: Depend on commons lang3 (already used by tachyon) as ↵Sean Owen2014-05-041-0/+1
| | | | | | | | | | | | it's used in ReplSuite, and return to use lang3 utility in Utils.scala For consideration. This was proposed in related discussion: https://github.com/apache/spark/pull/569 Author: Sean Owen <sowen@cloudera.com> Closes #635 from srowen/SPARK-1629.2 and squashes the following commits: a442b98 [Sean Owen] Depend on commons lang3 (already used by tachyon) as it's used in ReplSuite, and return to use lang3 utility in Utils.scala
* [SPARK-1636][MLLIB] Move main methods to examplesXiangrui Meng2014-04-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * `NaiveBayes` -> `SparseNaiveBayes` * `KMeans` -> `DenseKMeans` * `SVMWithSGD` and `LogisticRegerssionWithSGD` -> `BinaryClassification` * `ALS` -> `MovieLensALS` * `LinearRegressionWithSGD`, `LassoWithSGD`, and `RidgeRegressionWithSGD` -> `LinearRegression` * `DecisionTree` -> `DecisionTreeRunner` `scopt` is used for parsing command-line parameters. `scopt` has MIT license and it only depends on `scala-library`. Example help message: ~~~ BinaryClassification: an example app for binary classification. Usage: BinaryClassification [options] <input> --numIterations <value> number of iterations --stepSize <value> initial step size, default: 1.0 --algorithm <value> algorithm (SVM,LR), default: LR --regType <value> regularization type (L1,L2), default: L2 --regParam <value> regularization parameter, default: 0.1 <input> input paths to labeled examples in LIBSVM format ~~~ Author: Xiangrui Meng <meng@databricks.com> Closes #584 from mengxr/mllib-main and squashes the following commits: 7b58c60 [Xiangrui Meng] minor 6e35d7e [Xiangrui Meng] make imports explicit and fix code style c6178c9 [Xiangrui Meng] update TS PCA/SVD to use new spark-submit 6acff75 [Xiangrui Meng] use scopt for DecisionTreeRunner be86069 [Xiangrui Meng] use main instead of extending App b3edf68 [Xiangrui Meng] move DecisionTree's main method to examples 8bfaa5a [Xiangrui Meng] change NaiveBayesParams to Params fe23dcb [Xiangrui Meng] remove main from KMeans and add DenseKMeans as an example 67f4448 [Xiangrui Meng] remove main methods from linear regression algorithms and add LinearRegression example b066bbc [Xiangrui Meng] remove main from ALS and add MovieLensALS example b040f3b [Xiangrui Meng] change BinaryClassificationParams to Params 577945b [Xiangrui Meng] remove unused imports from NB 3d299bc [Xiangrui Meng] remove main from LR/SVM and add an example app for binary classification f70878e [Xiangrui Meng] remove main from NaiveBayes and add an example NaiveBayes app 01ec2cd [Xiangrui Meng] Merge branch 'master' into mllib-main 9420692 [Xiangrui Meng] add scopt to examples dependencies
* Improved build configurationwitgo2014-04-281-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x 2, Fix SPARK-1491: maven hadoop-provided profile fails to build 3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency 4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces) Author: witgo <witgo@qq.com> Closes #480 from witgo/format_pom and squashes the following commits: 03f652f [witgo] review commit b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence 7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence 0da4bc3 [witgo] merge master d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom e345919 [witgo] add avro dependency to yarn-alpha 77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency 1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 934f24d [witgo] review commit cf46edc [witgo] exclude jruby 06e7328 [witgo] Merge branch 'SparkBuild' into format_pom 99464d2 [witgo] fix maven hadoop-provided profile fails to build 0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x 6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml
* Update the import package name for TestHive in sbt shellCheng Hao2014-04-271-1/+1
| | | | | | | | | | sbt/sbt hive/console will fail as TestHive changed its package from "org.apache.spark.sql.hive" to "org.apache.spark.sql.hive.test". Author: Cheng Hao <hao.cheng@intel.com> Closes #574 from chenghao-intel/hive_console and squashes the following commits: de14035 [Cheng Hao] Update the import package name for TestHive in sbt shell
* SPARK-1621 Upgrade Chill to 0.3.6Matei Zaharia2014-04-251-3/+3
| | | | | | | | | | It registers more Scala classes, including things like Ranges that we had to register manually before. See https://github.com/twitter/chill/releases for Chill's change log. Author: Matei Zaharia <matei@databricks.com> Closes #543 from mateiz/chill-0.3.6 and squashes the following commits: a1dc5e0 [Matei Zaharia] Upgrade Chill to 0.3.6 and remove our special registration of Ranges
* SPARK-1584: Upgrade Flume dependency to 1.4.0tmalaska2014-04-241-1/+2
| | | | | | | | | | | | Updated the Flume dependency in the maven pom file and the scala build file. Author: tmalaska <ted.malaska@cloudera.com> Closes #507 from tmalaska/master and squashes the following commits: 79492c8 [tmalaska] excluded all thrift 159c3f1 [tmalaska] fixed the flume pom file issues 5bf56a7 [tmalaska] Upgrade flume version
* SPARK-1119 and other build improvementsPatrick Wendell2014-04-231-1/+3
| | | | | | | | | | | | 1. Makes assembly and examples jar naming consistent in maven/sbt. 2. Updates make-distribution.sh to use Maven and fixes some bugs. 3. Updates the create-release script to call make-distribution script. Author: Patrick Wendell <pwendell@gmail.com> Closes #502 from pwendell/make-distribution and squashes the following commits: 1a97f0d [Patrick Wendell] SPARK-1119 and other build improvements
* SPARK-1562 Fix visibility / annotation of Spark SQL APIsMichael Armbrust2014-04-221-2/+2
| | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #489 from marmbrus/sqlDocFixes and squashes the following commits: acee4f3 [Michael Armbrust] Fix visibility / annotation of Spark SQL APIs
* [SPARK-1560]: Updated Pyrolite Dependency to be Java 6 compatibleAhir Reddy2014-04-221-1/+1
| | | | | | | | | | | Changed the Pyrolite dependency to a build which targets Java 6. Author: Ahir Reddy <ahirreddy@gmail.com> Closes #479 from ahirreddy/java6-pyrolite and squashes the following commits: 8ea25d3 [Ahir Reddy] Updated maven build to use java 6 compatible pyrolite dabc703 [Ahir Reddy] Updated Pyrolite dependency to be Java 6 compatible
* [SPARK-1439, SPARK-1440] Generate unified Scaladoc across projects and JavadocsMatei Zaharia2014-04-211-12/+62
| | | | | | | | | | | | | | | | | | | | | | I used the sbt-unidoc plugin (https://github.com/sbt/sbt-unidoc) to create a unified Scaladoc of our public packages, and generate Javadocs as well. One limitation is that I haven't found an easy way to exclude packages in the Javadoc; there is a SBT task that identifies Java sources to run javadoc on, but it's been very difficult to modify it from outside to change what is set in the unidoc package. Some SBT-savvy people should help with this. The Javadoc site also lacks package-level descriptions and things like that, so we may want to look into that. We may decide not to post these right now if it's too limited compared to the Scala one. Example of the built doc site: http://people.csail.mit.edu/matei/spark-unified-docs/ Author: Matei Zaharia <matei@databricks.com> This patch had conflicts when merged, resolved by Committer: Patrick Wendell <pwendell@gmail.com> Closes #457 from mateiz/better-docs and squashes the following commits: a63d4a3 [Matei Zaharia] Skip Java/Scala API docs for Python package 5ea1f43 [Matei Zaharia] Fix links to Java classes in Java guide, fix some JS for scrolling to anchors on page load f05abc0 [Matei Zaharia] Don't include java.lang package names 995e992 [Matei Zaharia] Skip internal packages and class names with $ in JavaDoc a14a93c [Matei Zaharia] typo 76ce64d [Matei Zaharia] Add groups to Javadoc index page, and a first package-info.java ed6f994 [Matei Zaharia] Generate JavaDoc as well, add titles, update doc site to use unified docs acb993d [Matei Zaharia] Add Unidoc plugin for the projects we want Unidoced
* [SPARK-1520] remove fastutil from dependenciesXiangrui Meng2014-04-181-2/+2
| | | | | | | | | | | | | | A quick fix for https://issues.apache.org/jira/browse/SPARK-1520 By excluding fastutil, we bring the number of files in the assembly jar back under 65536, so Java 7 won't create the assembly jar in zip64 format, which cannot be read by Java 6. With this change, the assembly jar now has about 60000 entries (58000 files), tested with both sbt and maven. Author: Xiangrui Meng <meng@databricks.com> Closes #437 from mengxr/remove-fastutil and squashes the following commits: 00f9beb [Xiangrui Meng] remove fastutil from dependencies
* [SPARK-959] Updated SBT from 0.13.1 to 0.13.2Cheng Lian2014-04-161-2/+0
| | | | | | | | | | | | JIRA issue: [SPARK-959](https://spark-project.atlassian.net/browse/SPARK-959) SBT 0.13.2 has been officially released. This version updated Ivy 2.0 to Ivy 2.3, which fixes [IVY-899](https://issues.apache.org/jira/browse/IVY-899). This PR also removed previous workaround. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #426 from liancheng/updateSbt and squashes the following commits: 95e3dc8 [Cheng Lian] Updated SBT from 0.13.1 to 0.13.2 to fix SPARK-959
* SPARK-1374: PySpark API for SparkSQLAhir Reddy2014-04-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
* SPARK-1488. Resolve scalac feature warnings during buildSean Owen2014-04-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For your consideration: scalac currently notes a number of feature warnings during compilation: ``` [warn] there were 65 feature warning(s); re-run with -feature for details ``` Warnings are like: ``` [warn] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261: implicit conversion method rddToPairRDDFunctions should be enabled [warn] by making the implicit value scala.language.implicitConversions visible. [warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions' [warn] or by setting the compiler option -language:implicitConversions. [warn] See the Scala docs for value scala.language.implicitConversions for a discussion [warn] why the feature should be explicitly enabled. [warn] implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) = [warn] ^ ``` scalac is suggesting that it's just best practice to explicitly enable certain language features by importing them where used. This PR simply adds the imports it suggests (and squashes one other Java warning along the way). This leaves just deprecation warnings in the build. Author: Sean Owen <sowen@cloudera.com> Closes #404 from srowen/SPARK-1488 and squashes the following commits: 8598980 [Sean Owen] Quiet scalac warnings about language features by explicitly importing language features. 39bc831 [Sean Owen] Enable -feature in scalac to emit language feature warnings
* SPARK-1057 (alternative) Remove fastutilSean Owen2014-04-111-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | (This is for discussion at this point -- I'm not suggesting this should be committed.) This is what removing fastutil looks like. Much of it is straightforward, like using `java.io` buffered stream classes, and Guava for murmurhash3. Uses of the `FastByteArrayOutputStream` were a little trickier. In only one case though do I think the change to use `java.io` actually entails an extra array copy. The rest is using `OpenHashMap` and `OpenHashSet`. These are now written in terms of more scala-like operations. `OpenHashMap` is where I made three non-trivial changes to make it work, and they need review: - It is no longer private - The key must be a `ClassTag` - Unless a lot of other code changes, the key type can't enforce being a supertype of `Null` It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective. But what about those last changes? Author: Sean Owen <sowen@cloudera.com> Closes #266 from srowen/SPARK-1057-alternate and squashes the following commits: 2601129 [Sean Owen] Fix Map return type error not previously caught ec65502 [Sean Owen] Updates from matei's review 00bc81e [Sean Owen] Remove use of fastutil and replace with use of java.io, spark.util and Guava classes
* Revert "SPARK-1433: Upgrade Mesos dependency to 0.17.0"Patrick Wendell2014-04-101-1/+1
| | | | This reverts commit 12c077d5aa0b76a808a55db625c9677a52bd43f9.
* Spark-939: allow user jars to take precedence over spark jarsHolden Karau2014-04-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I still need to do a small bit of re-factoring [mostly the one Java file I'll switch it back to a Scala file and use it in both the close loaders], but comments on other things I should do would be great. Author: Holden Karau <holden@pigscanfly.ca> Closes #217 from holdenk/spark-939-allow-user-jars-to-take-precedence-over-spark-jars and squashes the following commits: cf0cac9 [Holden Karau] Fix the executorclassloader 1955232 [Holden Karau] Fix long line in TestUtils 8f89965 [Holden Karau] Fix tests for new class name 7546549 [Holden Karau] CR feedback, merge some of the testutils methods down, rename the classloader 644719f [Holden Karau] User the class generator for the repl class loader tests too f0b7114 [Holden Karau] Fix the core/src/test/scala/org/apache/spark/executor/ExecutorURLClassLoaderSuite.scala tests 204b199 [Holden Karau] Fix the generated classes 9f68f10 [Holden Karau] Start rewriting the ExecutorURLClassLoaderSuite to not use the hard coded classes 858aba2 [Holden Karau] Remove a bunch of test junk 261aaee [Holden Karau] simplify executorurlclassloader a bit 7a7bf5f [Holden Karau] CR feedback d4ae848 [Holden Karau] rewrite component into scala aa95083 [Holden Karau] CR feedback 7752594 [Holden Karau] re-add https comment a0ef85a [Holden Karau] Fix style issues 125ea7f [Holden Karau] Easier to just remove those files, we don't need them bb8d179 [Holden Karau] Fix issues with the repl class loader 241b03d [Holden Karau] fix my rat excludes a343350 [Holden Karau] Update rat-excludes and remove a useless file d90d217 [Holden Karau] Fix fall back with custom class loader and add a test for it 4919bf9 [Holden Karau] Fix parent calling class loader issue 8a67302 [Holden Karau] Test are good 9e2d236 [Holden Karau] It works comrade 691ee00 [Holden Karau] It works ish dc4fe44 [Holden Karau] Does not depend on being in my home directory 47046ff [Holden Karau] Remove bad import' 22d83cb [Holden Karau] Add a test suite for the executor url class loader suite 7ef4628 [Holden Karau] Clean up 792d961 [Holden Karau] Almost works 16aecd1 [Holden Karau] Doesn't quite work 8d2241e [Holden Karau] Adda FakeClass for testing ClassLoader precedence options 648b559 [Holden Karau] Both class loaders compile. Now for testing e1d9f71 [Holden Karau] One loader workers.
* SPARK-1433: Upgrade Mesos dependency to 0.17.0Sandeep2014-04-081-1/+1
| | | | | | | | | | | Mesos 0.13.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0 Author: Sandeep <sandeep@techaddict.me> Closes #355 from techaddict/mesos_update and squashes the following commits: f1abeee [Sandeep] SPARK-1433: Upgrade Mesos dependency to 0.17.0 Mesos 0.13.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0
* [SQL] SPARK-1371 Hash Aggregation ImprovementsMichael Armbrust2014-04-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Given: ```scala case class Data(a: Int, b: Int) val rdd = sparkContext .parallelize(1 to 200) .flatMap(_ => (1 to 50000).map(i => Data(i % 100, i))) rdd.registerAsTable("data") cacheTable("data") ``` Before: ``` SELECT COUNT(*) FROM data:[10000000] 16795.567ms SELECT a, SUM(b) FROM data GROUP BY a 7536.436ms SELECT SUM(b) FROM data 10954.1ms ``` After: ``` SELECT COUNT(*) FROM data:[10000000] 1372.175ms SELECT a, SUM(b) FROM data GROUP BY a 2070.446ms SELECT SUM(b) FROM data 958.969ms ``` Author: Michael Armbrust <michael@databricks.com> Closes #295 from marmbrus/hashAgg and squashes the following commits: ec63575 [Michael Armbrust] Add comment. d0495a9 [Michael Armbrust] Use scaladoc instead. b4a6887 [Michael Armbrust] Address review comments. a2d90ba [Michael Armbrust] Capture child output statically to avoid issues with generators and serialization. 7c13112 [Michael Armbrust] Rewrite Aggregate operator to stream input and use projections. Remove unused local RDD functions implicits. 5096f99 [Michael Armbrust] Make HiveUDAF fields transient since object inspectors are not serializable. 6a4b671 [Michael Armbrust] Add option to avoid binding operators expressions automatically. 92cca08 [Michael Armbrust] Always include serialization debug info when running tests. 1279df2 [Michael Armbrust] Increase default number of partitions.
* SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packagingAaron Davidson2014-04-061-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars. This patch has the following features/bug fixes: - Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar. - Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven) - assemble-deps fixed since we no longer use a different ASSEMBLY_DIR - avoid adding log message in compute-classpath.sh to the classpath :) Still TODO before mergeable: - We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself. - Spark SQL documentation updates. Author: Aaron Davidson <aaron@databricks.com> Closes #237 from aarondav/master and squashes the following commits: 5dc4329 [Aaron Davidson] Typo fixes dd4f298 [Aaron Davidson] Doc update dd1a365 [Aaron Davidson] Eliminate need for SPARK_HIVE at runtime by d/ling datanucleus from Maven a9269b5 [Aaron Davidson] [WIP] Use SPARK_HIVE to determine if we include Hive in packaging
* SPARK-1387. Update build plugins, avoid plugin version warning, centralize ↵Sean Owen2014-04-061-21/+28
| | | | | | | | | | | | | | | | | | | | versions Another handful of small build changes to organize and standardize a bit, and avoid warnings: - Update Maven plugin versions for good measure - Since plugins need maven 3.0.4 already, require it explicitly (<3.0.4 had some bugs anyway) - Use variables to define versions across dependencies where they should move in lock step - ... and make this consistent between Maven/SBT OK, I also updated the JIRA URL while I was at it here. Author: Sean Owen <sowen@cloudera.com> Closes #291 from srowen/SPARK-1387 and squashes the following commits: 461eca1 [Sean Owen] Couldn't resist also updating JIRA location to new one c2d5cc5 [Sean Owen] Update plugins and Maven version; use variables consistently across Maven/SBT to define dependency versions that should stay in step.
* SPARK-1305: Support persisting RDD's directly to TachyonHaoyuan Li2014-04-041-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the PR#468 of apache-incubator-spark to the apache-spark "Adding an option to persist Spark RDD blocks into Tachyon." Author: Haoyuan Li <haoyuan@cs.berkeley.edu> Author: RongGu <gurongwalker@gmail.com> Closes #158 from RongGu/master and squashes the following commits: 72b7768 [Haoyuan Li] merge master 9f7fa1b [Haoyuan Li] fix code style ae7834b [Haoyuan Li] minor cleanup a8b3ec6 [Haoyuan Li] merge master branch e0f4891 [Haoyuan Li] better check offheap. 55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel 7cd4600 [RongGu] remove some logic code for tachyonstore's replication 51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore 8adfcfa [RongGu] address arron's comment on inTachyonSize 120e48a [RongGu] changed the root-level dir name in Tachyon 5cc041c [Haoyuan Li] address aaron's comments 9b97935 [Haoyuan Li] address aaron's comments d9a6438 [Haoyuan Li] fix for pspark 77d2703 [Haoyuan Li] change python api.git status 3dcace4 [Haoyuan Li] address matei's comments 91fa09d [Haoyuan Li] address patrick's comments 589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE 64348b2 [Haoyuan Li] update conf docs. ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1 619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler 49cc724 [Haoyuan Li] update docs with off_headp option 4572f9f [RongGu] reserving the old apply function API of StorageLevel 04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP 76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix 939e467 [Haoyuan Li] 0.4.1-thrift from maven central 86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1 16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1 bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem 6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1 d827250 [RongGu] fix JsonProtocolSuie test failure 716e93b [Haoyuan Li] revert the version ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift 2825a13 [RongGu] up-merging to the current master branch of the apache spark 6a22c1a [Haoyuan Li] fix scalastyle 8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client. 77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice. 1dcadf9 [Haoyuan Li] typo bf278fa [Haoyuan Li] fix python tests e82909c [Haoyuan Li] minor cleanup 776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR 8859371 [Haoyuan Li] various minor fixes and clean up e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode. fcaeab2 [Haoyuan Li] address Aaron's comment e554b1e [Haoyuan Li] add python code 47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels. dc8ef24 [Haoyuan Li] add old storelevel constructor e01a271 [Haoyuan Li] update tachyon 0.4.1 8011a96 [RongGu] fix a brought-in mistake in StorageLevel 70ca182 [RongGu] a bit change in comment 556978b [RongGu] fix the scalastyle errors 791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark