aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-1754] [SQL] Add missing arithmetic DSL operations.Takuya UESHIN2014-05-082-1/+19
| | | | | | | | | | | | | Add missing arithmetic DSL operations: `unary_-`, `%`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #689 from ueshin/issues/SPARK-1754 and squashes the following commits: a09ef69 [Takuya UESHIN] Add also missing ! (not) operation. f73ae2c [Takuya UESHIN] Remove redundant tests. 5b3f087 [Takuya UESHIN] Add tests relating DSL operations. e09c5b8 [Takuya UESHIN] Add missing arithmetic DSL operations.
* Fixing typo in als.pyEvan Sparks2014-05-081-1/+1
| | | | | | | | | | XtY should be Xty. Author: Evan Sparks <evan.sparks@gmail.com> Closes #696 from etrain/patch-2 and squashes the following commits: 634cb8d [Evan Sparks] Fixing typo in als.py
* [SPARK-1745] Move interrupted flag from TaskContext constructor (minor)Andrew Or2014-05-085-22/+17
| | | | | | | | | | | | | | | It makes little sense to start a TaskContext that is interrupted. Indeed, I searched for all use cases of it and didn't find a single instance in which `interrupted` is true on construction. This was inspired by reviewing #640, which adds an additional `@volatile var completed` that is similar. These are not the most urgent changes, but I wanted to push them out before I forget. Author: Andrew Or <andrewor14@gmail.com> Closes #675 from andrewor14/task-context and squashes the following commits: 9575e02 [Andrew Or] Add space 69455d1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into task-context c471490 [Andrew Or] Oops, removed one flag too many. Adding it back. 85311f8 [Andrew Or] Move interrupted flag from TaskContext constructor
* SPARK-1565, update examples to be used with spark-submit script.Prashant Sharma2014-05-0856-480/+405
| | | | | | | | | | | | | | | | Commit for initial feedback, basically I am curious if we should prompt user for providing args esp. when its mandatory. And can we skip if they are not ? Also few other things that did not work like `bin/spark-submit examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar --class org.apache.spark.examples.SparkALS --arg 100 500 10 5 2` Not all the args get passed properly, may be I have messed up something will try to sort it out hopefully. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #552 from ScrapCodes/SPARK-1565/update-examples and squashes the following commits: 669dd23 [Prashant Sharma] Review comments 2727e70 [Prashant Sharma] SPARK-1565, update examples to be used with spark-submit script.
* [SQL] Improve SparkSQL AggregatesMichael Armbrust2014-05-084-10/+96
| | | | | | | | | | | * Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum. Author: Michael Armbrust <michael@databricks.com> Closes #683 from marmbrus/aggFixes and squashes the following commits: 64fe30b [Michael Armbrust] Improve SparkSQL Aggregates * Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum.
* Use numpy directly for matrix multiply.Evan Sparks2014-05-081-8/+7
| | | | | | | | | | | | | | Using matrix multiply to compute XtX and XtY yields a 5-20x speedup depending on problem size. For example - the following takes 19s locally after this change vs. 5m21s before the change. (16x speedup). bin/pyspark examples/src/main/python/als.py local[8] 1000 1000 50 10 10 Author: Evan Sparks <evan.sparks@gmail.com> Closes #687 from etrain/patch-1 and squashes the following commits: e094dbc [Evan Sparks] Touching only diaganols on update. d1ab9b6 [Evan Sparks] Use numpy directly for matrix multiply.
* SPARK-1668: Add implicit preference as an option to examples/MovieLensALSSandeep2014-05-081-9/+46
| | | | | | | | | | | | | | | | Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/ Author: Sandeep <sandeep@techaddict.me> Closes #597 from techaddict/SPARK-1668 and squashes the following commits: 8b371dc [Sandeep] Second Pass on reviews by mengxr eca9d37 [Sandeep] based on mengxr's suggestions 937e54c [Sandeep] Changes 5149d40 [Sandeep] Changes based on review 1dd7657 [Sandeep] use mean() 42444d7 [Sandeep] Based on Suggestions by mengxr e3082fa [Sandeep] SPARK-1668: Add implicit preference as an option to examples/MovieLensALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/
* SPARK-1544 Add support for deep decision trees.Manish Amde2014-05-075-33/+177
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | @etrain and I came with a PR for arbitrarily deep decision trees at the cost of multiple passes over the data at deep tree levels. To summarize: 1) We take a parameter that indicates the amount of memory users want to reserve for computation on each worker (and 2x that at the driver). 2) Using that information, we calculate two things - the maximum depth to which we train as usual (which is, implicitly, the maximum number of nodes we want to train in parallel), and the size of the groups we should use in the case where we exceed this depth. cc: @atalwalkar, @hirakendu, @mengxr Author: Manish Amde <manish9ue@gmail.com> Author: manishamde <manish9ue@gmail.com> Author: Evan Sparks <sparks@cs.berkeley.edu> Closes #475 from manishamde/deep_tree and squashes the following commits: 968ca9d [Manish Amde] merged master 7fc9545 [Manish Amde] added docs ce004a1 [Manish Amde] minor formatting b27ad2c [Manish Amde] formatting 426bb28 [Manish Amde] programming guide blurb 8053fed [Manish Amde] more formatting 5eca9e4 [Manish Amde] grammar 4731cda [Manish Amde] formatting 5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation cbd9f14 [Manish Amde] modified scala.math to math dad9652 [Manish Amde] removed unused imports e0426ee [Manish Amde] renamed parameter 718506b [Manish Amde] added unit test 1517155 [Manish Amde] updated documentation 9dbdabe [Manish Amde] merge from master 719d009 [Manish Amde] updating user documentation fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree 0287772 [Evan Sparks] Fixing scalastyle issue. 2f1e093 [Manish Amde] minor: added doc for maxMemory parameter 2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree abc5a23 [Evan Sparks] Parameterizing max memory. 50b143a [Manish Amde] adding support for very deep trees
* Update GradientDescentSuite.scalabaishuo(白硕)2014-05-071-3/+3
| | | | | | | | | | | | use more faster way to construct an array Author: baishuo(白硕) <vc_java@hotmail.com> Closes #588 from baishuo/master and squashes the following commits: 45b95fb [baishuo(白硕)] Update GradientDescentSuite.scala c03b61c [baishuo(白硕)] Update GradientDescentSuite.scala b666d27 [baishuo(白硕)] Update GradientDescentSuite.scala
* [SPARK-1743][MLLIB] add loadLibSVMFile and saveAsLibSVMFile to pysparkXiangrui Meng2014-05-072-2/+178
| | | | | | | | | | | | | Make loading/saving labeled data easier for pyspark users. Also changed type check in `SparseVector` to allow numpy integers. Author: Xiangrui Meng <meng@databricks.com> Closes #672 from mengxr/pyspark-mllib-util and squashes the following commits: 2943fa7 [Xiangrui Meng] format docs d61668d [Xiangrui Meng] add loadLibSVMFile and saveAsLibSVMFile to pyspark
* SPARK-1569 Spark on Yarn, authentication broken by pr299Thomas Graves2014-05-071-19/+30
| | | | | | | | | | | | | | Pass the configs as java options since the executor needs to know before it registers whether to create the connection using authentication or not. We could see about passing only the authentication configs but for now I just had it pass them all. I also updating it to use a list to construct the command to make it the same as ClientBase and avoid any issues with spaces. Author: Thomas Graves <tgraves@apache.org> Closes #649 from tgravescs/SPARK-1569 and squashes the following commits: 0178ab8 [Thomas Graves] add akka settings 22a8735 [Thomas Graves] Change to only path spark.auth* configs 8ccc1d4 [Thomas Graves] SPARK-1569 Spark on Yarn, authentication broken
* [SPARK-1688] Propagate PySpark worker stderr to driverAndrew Or2014-05-074-101/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When at least one of the following conditions is true, PySpark cannot be loaded: 1. PYTHONPATH is not set 2. PYTHONPATH does not contain the python directory (or jar, in the case of YARN) 3. The jar does not contain pyspark files (YARN) 4. The jar does not contain py4j files (YARN) However, we currently throw the same random `java.io.EOFException` for all of the above cases, when trying to read from the python daemon's output. This message is super unhelpful. This PR includes the python stderr and the PYTHONPATH in the exception propagated to the driver. Now, the exception message looks something like: ``` Error from python worker: : No module named pyspark PYTHONPATH was: /path/to/spark/python:/path/to/some/jar java.io.EOFException <stack trace> ``` whereas before it was just ``` java.io.EOFException <stack trace> ``` Author: Andrew Or <andrewor14@gmail.com> Closes #603 from andrewor14/pyspark-exception and squashes the following commits: 10d65d3 [Andrew Or] Throwable -> Exception, worker -> daemon 862d1d7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception a5ed798 [Andrew Or] Use block string and interpolation instead of var (minor) cc09c45 [Andrew Or] Account for the fact that the python daemon may not have terminated yet 444f019 [Andrew Or] Use the new RedirectThread + include system PYTHONPATH aab00ae [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception 0cc2402 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception 783efe2 [Andrew Or] Make python daemon stderr indentation consistent 9524172 [Andrew Or] Avoid potential NPE / error stream contention + Move things around 29f9688 [Andrew Or] Add back original exception type e92d36b [Andrew Or] Include python worker stderr in the exception propagated to the driver 7c69360 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception cdbc185 [Andrew Or] Fix python attribute not found exception when PYTHONPATH is not set dcc0353 [Andrew Or] Check both python and system environment variables for PYTHONPATH 6c09c21 [Andrew Or] Validate PYTHONPATH and PySpark modules before starting python workers
* Typo fix: fetchting -> fetchingAndrew Ash2014-05-071-1/+1
| | | | | | | | Author: Andrew Ash <andrew@andrewash.com> Closes #680 from ash211/patch-3 and squashes the following commits: 9ce3746 [Andrew Ash] Typo fix: fetchting -> fetching
* Nicer logging for SecurityManager startupAndrew Ash2014-05-071-2/+3
| | | | | | | | | | Happy to open a jira ticket if you'd like to track one there. Author: Andrew Ash <andrew@andrewash.com> Closes #678 from ash211/SecurityManagerLogging and squashes the following commits: 2aa0b7a [Andrew Ash] Nicer logging for SecurityManager startup
* [SQL] Fix Performance Issue in data type castingCheng Hao2014-05-071-1/+1
| | | | | | | | | | Using lazy val object instead of function in the class Cast, which improved the performance nearly by 2X in my local micro-benchmark. Author: Cheng Hao <hao.cheng@intel.com> Closes #679 from chenghao-intel/fix_type_casting and squashes the following commits: 71b0902 [Cheng Hao] using lazy val object instead of function for data type casting
* SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptionsAaron Davidson2014-05-075-107/+141
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch includes several cleanups to PythonRDD, focused around fixing [SPARK-1579](https://issues.apache.org/jira/browse/SPARK-1579) cleanly. Listed in order of approximate importance: - The Python daemon waits for Spark to close the socket before exiting, in order to avoid causing spurious IOExceptions in Spark's `PythonRDD::WriterThread`. - Removes the Python Monitor Thread, which polled for task cancellations in order to kill the Python worker. Instead, we do this in the onCompleteCallback, since this is guaranteed to be called during cancellation. - Adds a "completed" variable to TaskContext to avoid the issue noted in [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), where onCompleteCallbacks may be execution-order dependent. Along with this, I removed the "context.interrupted = true" flag in the onCompleteCallback. - Extracts PythonRDD::WriterThread to its own class. Since this patch provides an alternative solution to [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), I did test it with ``` sc.textFile("latlon.tsv").take(5) ``` many times without error. Additionally, in order to test the unswallowed exceptions, I performed ``` sc.textFile("s3n://<big file>").count() ``` and cut my internet during execution. Prior to this patch, we got the "stdin writer exited early" message, which was unhelpful. Now, we get the SocketExceptions propagated through Spark to the user and get proper (though unsuccessful) task retries. Author: Aaron Davidson <aaron@databricks.com> Closes #640 from aarondav/pyspark-io and squashes the following commits: b391ff8 [Aaron Davidson] Detect "clean socket shutdowns" and stop waiting on the socket c0c49da [Aaron Davidson] SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions
* [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations...Kan Zhang2014-05-077-22/+246
| | | | | | | | | | | | ... that do not change schema Author: Kan Zhang <kzhang@apache.org> Closes #448 from kanzhang/SPARK-1460 and squashes the following commits: 111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD 91dc787 [Kan Zhang] Taking into account newly added Ordering param 79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema
* [WIP][Spark-SQL] Optimize the Constant Folding for ExpressionCheng Hao2014-05-0714-32/+1502
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, expression does not support the "constant null" well in constant folding. e.g. Sum(a, 0) actually always produces Literal(0, NumericType) in runtime. For example: ``` explain select isnull(key+null) from src; == Logical Plan == Project [HiveGenericUdf#isnull((key#30 + CAST(null, IntegerType))) AS c_0#28] MetastoreRelation default, src, None == Optimized Logical Plan == Project [true AS c_0#28] MetastoreRelation default, src, None == Physical Plan == Project [true AS c_0#28] HiveTableScan [], (MetastoreRelation default, src, None), None ``` I've create a new Optimization rule called NullPropagation for such kind of constant folding. Author: Cheng Hao <hao.cheng@intel.com> Author: Michael Armbrust <michael@databricks.com> Closes #482 from chenghao-intel/optimize_constant_folding and squashes the following commits: 2f14b50 [Cheng Hao] Fix code style issues 68b9fad [Cheng Hao] Remove the Literal pattern matching for NullPropagation 29c8166 [Cheng Hao] Update the code for feedback of code review 50444cc [Cheng Hao] Remove the unnecessary null checking 80f9f18 [Cheng Hao] Update the UnitTest for aggregation constant folding 27ea3d7 [Cheng Hao] Fix Constant Folding Bugs & Add More Unittests b28e03a [Cheng Hao] Merge pull request #1 from marmbrus/pr/482 9ccefdb [Michael Armbrust] Add tests for optimized expression evaluation. 543ef9d [Cheng Hao] fix code style issues 9cf0396 [Cheng Hao] update code according to the code review comment 536c005 [Cheng Hao] Add Exceptional case for constant folding 3c045c7 [Cheng Hao] Optimize the Constant Folding by adding more rules 2645d4f [Cheng Hao] Constant Folding(null propagation)
* SPARK-1746: Support setting SPARK_JAVA_OPTS on executors for backwards ↵Patrick Wendell2014-05-071-2/+9
| | | | | | | | | | compatibility Author: Patrick Wendell <pwendell@gmail.com> Closes #676 from pwendell/worker-opts and squashes the following commits: 54456c4 [Patrick Wendell] SPARK-1746: Support setting SPARK_JAVA_OPTS on executors for backwards compatibility
* [HOTFIX] SPARK-1637: There are some Streaming examples added after the PR ↵Sandeep2014-05-062-6/+4
| | | | | | | | | | | | | #571 was last updated. This resulted in Compilation Errors. cc @mateiz project not compiling currently. Author: Sandeep <sandeep@techaddict.me> Closes #673 from techaddict/SPARK-1637-HOTFIX and squashes the following commits: b512f4f [Sandeep] [SPARK-1637][HOTFIX] There are some Streaming examples added after the PR #571 was last updated. This resulted in Compilation Errors.
* Proposal: clarify Scala programming guide on caching ...Ethan Jewett2014-05-061-3/+5
| | | | | | | | | | | ... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html Author: Ethan Jewett <esjewett@gmail.com> Closes #668 from esjewett/Doc-update and squashes the following commits: 11793ce [Ethan Jewett] Update based on feedback 171e670 [Ethan Jewett] Clarify Scala programming guide on caching ...
* SPARK-1727. Correct small compile errors, typos, and markdown issues in ↵Sean Owen2014-05-0617-68/+97
| | | | | | | | | | | | | | | | | | (primarly) MLlib docs While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs. Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown. Author: Sean Owen <sowen@cloudera.com> Closes #653 from srowen/SPARK-1727 and squashes the following commits: 6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count 8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output) 99966a9 [Sean Owen] Update issue tracker URL in docs 23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak) 8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs
* SPARK-1637: Clean up examples for 1.0Sandeep2014-05-0640-72/+69
| | | | | | | | | | | | | | | | | | - [x] Move all of them into subpackages of org.apache.spark.examples (right now some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib) - [x] Move Python examples into examples/src/main/python - [x] Update docs to reflect these changes Author: Sandeep <sandeep@techaddict.me> This patch had conflicts when merged, resolved by Committer: Matei Zaharia <matei@databricks.com> Closes #571 from techaddict/SPARK-1637 and squashes the following commits: 47ef86c [Sandeep] Changes based on Discussions on PR, removing use of RawTextHelper from examples 8ed2d3f [Sandeep] Docs Updated for changes, Change for java examples 5f96121 [Sandeep] Move Python examples into examples/src/main/python 0a8dd77 [Sandeep] Move all Scala Examples to org.apache.spark.examples (some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)
* SPARK-1737: Warn rather than fail when Java 7+ is used to create distributionsPatrick Wendell2014-05-061-10/+10
| | | | | | | | | | | Also moves a few lines of code around in make-distribution.sh. Author: Patrick Wendell <pwendell@gmail.com> Closes #669 from pwendell/make-distribution and squashes the following commits: 8bfac49 [Patrick Wendell] Small fix 46918ec [Patrick Wendell] SPARK-1737: Warn rather than fail when Java 7+ is used to create distributions.
* [SPARK-1549] Add Python support to spark-submitMatei Zaharia2014-05-0616-194/+505
| | | | | | | | | | | | | | | | | | | | | | This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN. This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging. In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit. In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0. Author: Matei Zaharia <matei@databricks.com> Closes #664 from mateiz/py-submit and squashes the following commits: 15e9669 [Matei Zaharia] Fix some uses of path.separator property 051278c [Matei Zaharia] Small style fixes 0afe886 [Matei Zaharia] Add license headers 4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests 15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside 47c0655 [Matei Zaharia] More work to make spark-submit work with Python: d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
* SPARK-1734: spark-submit throws an exception: Exception in thread "main"...witgo2014-05-061-1/+1
| | | | | | | | | | ... java.lang.ClassNotFoundException: org.apache.spark.broadcast.TorrentBroadcastFactory Author: witgo <witgo@qq.com> Closes #665 from witgo/SPARK-1734 and squashes the following commits: cacf238 [witgo] SPARK-1734: spark-submit throws an exception: Exception in thread "main" java.lang.ClassNotFoundException: org.apache.spark.broadcast.TorrentBroadcastFactory
* [SPARK-1685] Cancel retryTimer on restart of Worker or AppClientMark Hamstra2014-05-062-8/+15
| | | | | | | | | | | See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up. Author: Mark Hamstra <markhamstra@gmail.com> Closes #602 from markhamstra/SPARK-1685 and squashes the following commits: 11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer 69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient
* Fix two download suggestions in the docs:Patrick Wendell2014-05-062-31/+13
| | | | | | | | | | | 1) On the quick start page provide a direct link to the downloads (suggested by @pbailis). 2) On the index page, don't suggest users always have to build Spark, since many won't. Author: Patrick Wendell <pwendell@gmail.com> Closes #662 from pwendell/quick-start and squashes the following commits: 0622f27 [Patrick Wendell] Fix two download suggestions in the docs:
* SPARK-1474: Spark on yarn assembly doesn't include AmIpFilterThomas Graves2014-05-063-2/+30
| | | | | | | | | | | | We use org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter in spark on yarn but are not included it in the assembly jar. I tested this on yarn cluster by removing the yarn jars from the classpath and spark runs fine now. Author: Thomas Graves <tgraves@apache.org> Closes #406 from tgravescs/SPARK-1474 and squashes the following commits: 1548bf9 [Thomas Graves] SPARK-1474: Spark on yarn assembly doesn't include org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
* Update OpenHashSet.scalaArcherShao2014-05-061-2/+2
| | | | | | | | | | Modify wrong comment of function addWithoutResize. Author: ArcherShao <ArcherShao@users.noreply.github.com> Closes #667 from ArcherShao/patch-3 and squashes the following commits: a607358 [ArcherShao] Update OpenHashSet.scala
* [SQL] SPARK-1732 - Support for null primitive values.Michael Armbrust2014-05-057-5/+122
| | | | | | | | | | | I also removed a println that I bumped into. Author: Michael Armbrust <michael@databricks.com> Closes #658 from marmbrus/nullPrimitives and squashes the following commits: a3ec4f3 [Michael Armbrust] Remove println. 695606b [Michael Armbrust] Support for null primatives from using scala and java reflection.
* [SPARK-1735] Add the missing special profiles to make-distribution.shAndrew Or2014-05-051-20/+27
| | | | | | | | | | | | | 73b0cbcc241cca3d318ff74340e80b02f884acbd introduced a few special profiles that are not covered in the `make-distribution.sh`. This affects hadoop versions 2.2.x, 2.3.x, and 2.4.x. Without these special profiles, a java version error for protobufs is thrown at run time. I took the opportunity to rewrite the way we construct the maven command. Previously, the only hadoop version that triggered the `yarn-alpha` profile was 0.23.x, which was inconsistent with the [docs](https://github.com/apache/spark/blob/master/docs/building-with-maven.md). This is now generalized to hadoop versions from 0.23.x to 2.1.x. Author: Andrew Or <andrewor14@gmail.com> Closes #660 from andrewor14/hadoop-distribution and squashes the following commits: 6740126 [Andrew Or] Generalize the yarn profile to hadoop versions 2.2+ 88f192d [Andrew Or] Add the required special profiles to make-distribution.sh
* [SPARK-1678][SPARK-1679] In-memory compression bug fix and made compression ↵Cheng Lian2014-05-0517-26/+105
| | | | | | | | | | | | | | | | | configurable, disabled by default In-memory compression is now configurable in `SparkConf` by the `spark.sql.inMemoryCompression.enabled` property, and is disabled by default. To help code review, the bug fix is in [the first commit](https://github.com/liancheng/spark/commit/d537a367edf0bf24d0b925cc58b21d805ccbc11f), compression configuration is in [the second one](https://github.com/liancheng/spark/commit/4ce09aa8aa820bbbbbaa0f3f084a6cff1d4e6195). Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #608 from liancheng/spark-1678 and squashes the following commits: 66c3a8d [Cheng Lian] Renamed in-memory compression configuration key f8fb3a0 [Cheng Lian] Added assertion for testing .hasNext of various decoder 4ce09aa [Cheng Lian] Made in-memory compression configurable via SparkConf d537a36 [Cheng Lian] Fixed SPARK-1678
* [SPARK-1594][MLLIB] Cleaning up MLlib APIs and guideXiangrui Meng2014-05-0539-329/+390
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Final pass before the v1.0 release. * Remove `VectorRDDs` * Move `BinaryClassificationMetrics` from `evaluation.binary` to `evaluation` * Change default value of `addIntercept` to false and allow to add intercept in Ridge and Lasso. * Clean `DecisionTree` package doc and test suite. * Mark model constructors `private[spark]` * Rename `loadLibSVMData` to `loadLibSVMFile` and hide `LabelParser` from users. * Add `saveAsLibSVMFile`. * Add `appendBias` to `MLUtils`. Author: Xiangrui Meng <meng@databricks.com> Closes #524 from mengxr/mllib-cleaning and squashes the following commits: 295dc8b [Xiangrui Meng] update loadLibSVMFile doc 1977ac1 [Xiangrui Meng] fix doc of appendBias 649fcf0 [Xiangrui Meng] rename loadLibSVMData to loadLibSVMFile; hide LabelParser from user APIs 54b812c [Xiangrui Meng] add appendBias a71e7d0 [Xiangrui Meng] add saveAsLibSVMFile d976295 [Xiangrui Meng] Merge branch 'master' into mllib-cleaning b7e5cec [Xiangrui Meng] remove some experimental annotations and make model constructors private[mllib] 9b02b93 [Xiangrui Meng] minor code style update a593ddc [Xiangrui Meng] fix python tests fc28c18 [Xiangrui Meng] mark more classes experimental f6cbbff [Xiangrui Meng] fix Java tests 0af70b0 [Xiangrui Meng] minor 6e139ef [Xiangrui Meng] Merge branch 'master' into mllib-cleaning 94e6dce [Xiangrui Meng] move BinaryLabelCounter and BinaryConfusionMatrixImpl to evaluation.binary df34907 [Xiangrui Meng] clean DecisionTreeSuite to use LocalSparkContext c81807f [Xiangrui Meng] set the default value of AddIntercept to false 03389c0 [Xiangrui Meng] allow to add intercept in Ridge and Lasso c66c56f [Xiangrui Meng] move tree md to package object doc a2695df [Xiangrui Meng] update guide for BinaryClassificationMetrics 9194f4c [Xiangrui Meng] move BinaryClassificationMetrics one level up 1c1a0e3 [Xiangrui Meng] remove VectorRDDs because it only contains one function that is not necessary for us to maintain
* Expose SparkListeners and relevant classes as DeveloperApiAndrew Or2014-05-0511-33/+78
| | | | | | | | | | | Hopefully this can go into 1.0, as a few people on the user list have asked for this. Author: Andrew Or <andrewor14@gmail.com> Closes #648 from andrewor14/expose-listeners and squashes the following commits: e45e1ef [Andrew Or] Add missing colons (minor) 350d643 [Andrew Or] Expose SparkListeners and relevant classes as DeveloperApi
* SPARK-1728. JavaRDDLike.mapPartitionsWithIndex requires ClassTagSandy Ryza2014-05-051-2/+2
| | | | | | | | Author: Sandy Ryza <sandy@cloudera.com> Closes #657 from sryza/sandy-spark-1728 and squashes the following commits: 4751443 [Sandy Ryza] SPARK-1728. JavaRDDLike.mapPartitionsWithIndex requires ClassTag
* [SPARK-1681] Include datanucleus jars in Spark Hive distributionAndrew Or2014-05-052-29/+40
| | | | | | | | | | | | | | | | This copies the datanucleus jars over from `lib_managed` into `dist/lib`, if any. The `CLASSPATH` must also be updated to reflect this change. Author: Andrew Or <andrewor14@gmail.com> Closes #610 from andrewor14/hive-distribution and squashes the following commits: a4bc96f [Andrew Or] Rename search path in jar error check fa205e1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into hive-distribution 7855f58 [Andrew Or] Have jar command respect JAVA_HOME + check for jar errors both cases c16bbfd [Andrew Or] Merge branch 'master' of github.com:apache/spark into hive-distribution 32f6826 [Andrew Or] Leave the double colons 940a1bb [Andrew Or] Add back 2>/dev/null 58357cc [Andrew Or] Include datanucleus jars in Spark distribution built with Hive support
* [SPARK-1504], [SPARK-1505], [SPARK-1558] Updated Spark Streaming guideTathagata Das2014-05-0510-146/+360
| | | | | | | | | | | | | | | | | | | | | | - SPARK-1558: Updated custom receiver guide to match it with the new API - SPARK-1504: Added deployment and monitoring subsection to streaming - SPARK-1505: Added migration guide for migrating from 0.9.x and below to Spark 1.0 - Updated various Java streaming examples to use JavaReceiverInputDStream to highlight the API change. - Removed the requirement for cleaner ttl from streaming guide Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #652 from tdas/doc-fix and squashes the following commits: cb4f4b7 [Tathagata Das] Possible fix for flaky graceful shutdown test. ab71f7f [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into doc-fix 8d6ff9b [Tathagata Das] Addded migration guide to Spark Streaming. 7d171df [Tathagata Das] Added reference to JavaReceiverInputStream in examples and streaming guide. 49edd7c [Tathagata Das] Change java doc links to use Java docs. 11528d7 [Tathagata Das] Updated links on index page. ff80970 [Tathagata Das] More updates to streaming guide. 4dc42e9 [Tathagata Das] Added monitoring and other documentation in the streaming guide. 14c6564 [Tathagata Das] Updated custom receiver guide.
* SPARK-1721: Reset the thread classLoader in the Mesos ExecutorBouke van der Bijl2014-05-052-4/+3
| | | | | | | | | | | | | | | | | | This is because Mesos calls it with a different environment or something, the result is that the Spark jar is missing and it can't load classes. This fixes http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html I have no idea whether this is the right fix, I can only confirm that it fixes the issue for us. The `registered` method is called from mesos (https://github.com/apache/mesos/blob/765ff9bc2ac5a12d4362f8235b572a37d646390a/src/java/jni/org_apache_mesos_MesosExecutorDriver.cpp) I am unsure which commit caused this regression Author: Bouke van der Bijl <boukevanderbijl@gmail.com> Closes #620 from bouk/mesos-classloader-fix and squashes the following commits: c13eae0 [Bouke van der Bijl] Use getContextOrSparkClassLoader in SparkEnv and CompressionCodec
* SPARK-1556. jets3t dep doesn't update properly with newer Hadoop versionsSean Owen2014-05-054-52/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | See related discussion at https://github.com/apache/spark/pull/468 This PR may still overstep what you have in mind, but let me put it on the table to start. Besides fixing the issue, it has one substantive change, and that is to manage Hadoop-specific things only in Hadoop-related profiles. This does _not_ remove `yarn.version`. - Moves the YARN and Hadoop profiles together in pom.xml. Sorry that this makes the diff a little hard to grok but the changes are only as follows. - Removes `hadoop.major.version` - Introduce `hadoop-2.2` and `hadoop-2.3` profiles to control Hadoop-specific changes: - like the protobuf version issue - this was only 'solved' now by enabling YARN for 2.2+, which is really an orthogonal issue - like the jets3t version issue now - Hadoop profiles set an appropriate default `hadoop.version`, that can be overridden - _(YARN profiles in the parent now only exist to add the sub-module)_ - Fixes the jets3t dependency issue - and makes it a runtime dependency - and centralizes config of this guy in the parent pom - Updates build docs - Updates SBT build too - and fixes a regex problem along the way Author: Sean Owen <sowen@cloudera.com> Closes #629 from srowen/SPARK-1556 and squashes the following commits: c3fa967 [Sean Owen] Fix hadoop-2.4 profile typo in doc a2105fd [Sean Owen] Add hadoop-2.4 profile and don't set hadoop.version in profiles 274f4f9 [Sean Owen] Make jets3t a runtime dependency, and bring its exclusion up into parent config bbed826 [Sean Owen] Use jets3t 0.9.0 for Hadoop 2.3+ (and correct similar regex issue in SBT build) f21f356 [Sean Owen] Build changes to set up for jets3t fix
* Updated doc for spark.closure.serializer to indicate only Java serializer work.Reynold Xin2014-05-051-2/+1
| | | | | | | | | | | See discussion from http://apache-spark-developers-list.1001551.n3.nabble.com/bug-using-kryo-as-closure-serializer-td6473.html Author: Reynold Xin <rxin@apache.org> Closes #642 from rxin/docs-ser and squashes the following commits: a507db5 [Reynold Xin] Use "Java" instead of default. 5eb8cdd [Reynold Xin] Updated doc for spark.closure.serializer to indicate only the default serializer work.
* Address SPARK-1717msiddalingaiah2014-05-041-3/+3
| | | | | | | | | | I tested the change locally with Spark 0.9.1, but I can't test with 1.0.0 because there was no AMI for it at the time. It's a trivial fix, so it shouldn't cause any problems. Author: msiddalingaiah <madhu@madhu.com> Closes #641 from msiddalingaiah/master and squashes the following commits: a4f7404 [msiddalingaiah] Address SPARK-1717
* SPARK-1710: spark-submit should print better errors than ↵Sandeep2014-05-041-2/+10
| | | | | | | | | | | | | "InvocationTargetException" Catching the InvocationTargetException, printing getTargetException. Author: Sandeep <sandeep@techaddict.me> Closes #630 from techaddict/SPARK-1710 and squashes the following commits: 834d79b [Sandeep] changes from srowen suggestions 109d604 [Sandeep] SPARK-1710: spark-submit should print better errors than "InvocationTargetException"
* EC2 script should exit with non-zero code on UsageErrorAllan Douglas R. de Oliveira2014-05-041-0/+1
| | | | | | | | | | This is specially import because some ssh errors are raised as UsageError, preventing an automated usage of the script from detecting the failure. Author: Allan Douglas R. de Oliveira <allan@chaordicsystems.com> Closes #638 from douglaz/ec2_exit_code_fix and squashes the following commits: 5915e6d [Allan Douglas R. de Oliveira] EC2 script should exit with non-zero code on UsageError
* SPARK-1693: Most of the tests throw a java.lang.SecurityException when s...witgo2014-05-041-0/+8
| | | | | | | | | | | ...park built for hadoop 2.3.0 , 2.4.0 Author: witgo <witgo@qq.com> Closes #628 from witgo/SPARK-1693_new and squashes the following commits: e3af968 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1693_new dc63905 [witgo] SPARK-1693: Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0
* SPARK-1629. Addendum: Depend on commons lang3 (already used by tachyon) as ↵Sean Owen2014-05-044-3/+12
| | | | | | | | | | | | it's used in ReplSuite, and return to use lang3 utility in Utils.scala For consideration. This was proposed in related discussion: https://github.com/apache/spark/pull/569 Author: Sean Owen <sowen@cloudera.com> Closes #635 from srowen/SPARK-1629.2 and squashes the following commits: a442b98 [Sean Owen] Depend on commons lang3 (already used by tachyon) as it's used in ReplSuite, and return to use lang3 utility in Utils.scala
* SPARK-1703 Warn users if Spark is run on JRE6 but compiled with JDK7.Patrick Wendell2014-05-043-1/+36
| | | | | | | | | | | This add some guards and good warning messages if users hit this issue. /cc @aarondav with whom I discussed parts of the design. Author: Patrick Wendell <pwendell@gmail.com> Closes #627 from pwendell/jdk6 and squashes the following commits: a38a958 [Patrick Wendell] Code review feedback 94e9f84 [Patrick Wendell] SPARK-1703 Warn users if Spark is run on JRE6 but compiled with JDK7.
* SPARK-1663. (Addendum) Fix signature of one version of ↵Sean Owen2014-05-041-1/+1
| | | | | | | | | | | | JavaPairRDDStream.reduceByKeyAndWindow() It appears that one of these methods doesn't use `org.apache.spark.api.java.function.Function2` like all the others, but uses Scala's `Function2`. Author: Sean Owen <sowen@cloudera.com> Closes #633 from srowen/SPARK-1663.2 and squashes the following commits: 1e0232d [Sean Owen] Fix signature of one version of reduceByKeyAndWindow to use Java API Function2, as apparently intended
* SPARK-1658: Correctly identify if maven is installed and workingRahul Singhal2014-05-041-2/+3
| | | | | | | | | | | | | The current test is checking the exit code of "tail" rather than "mvn". This new check will make sure that mvn is installed and was able to execute the "version command". Author: Rahul Singhal <rahul.singhal@guavus.com> Closes #580 from rahulsinghaliitd/SPARK-1658 and squashes the following commits: 83c0313 [Rahul Singhal] SPARK-1658: Correctly identify if maven is installed and working bf821b9 [Rahul Singhal] SPARK-1658: Correctly identify if maven is installed and working
* The default version of yarn is equal to the hadoop versionwitgo2014-05-034-12/+18
| | | | | | | | | | | | | | | This is a part of [PR 590](https://github.com/apache/spark/pull/590) Author: witgo <witgo@qq.com> Closes #626 from witgo/yarn_version and squashes the following commits: c390631 [witgo] restore the yarn dependency declarations f8a4ad8 [witgo] revert remove the dependency of avro in yarn-alpha 2df6cf5 [witgo] review commit a1d876a [witgo] review commit 20e7e3e [witgo] review commit c76763b [witgo] The default value of yarn.version is equal to hadoop.version