aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* HOTFIX: clear() configs in SQLConf-related unit tests.Zongheng Yang2014-06-102-0/+3
| | | | | | | | | | | | Thanks goes to @liancheng, who pointed out that `sql/test-only *.SQLConfSuite *.SQLQuerySuite` passed but `sql/test-only *.SQLQuerySuite *.SQLConfSuite` failed. The reason is that some tests use the same test keys and without clear()'ing, they get carried over to other tests. This hotfix simply adds some `clear()` calls. This problem was not evident on Jenkins before probably because `parallelExecution` is not set to `false` for `sqlCoreSettings`. Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1040 from concretevitamin/sqlconf-tests and squashes the following commits: 6d14ceb [Zongheng Yang] HOTFIX: clear() confs in SQLConf related unit tests.
* [SPARK-2065] give launched instances namesNicholas Chammas2014-06-101-0/+11
| | | | | | | | | | | | | | | | This update resolves [SPARK-2065](https://issues.apache.org/jira/browse/SPARK-2065). It gives launched EC2 instances descriptive names by using instance tags. Launched instances now show up in the EC2 console with these names. I used `format()` with named parameters, which I believe is the recommended practice for string formatting in Python, but which doesn’t seem to be used elsewhere in the script. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Author: nchammas <nicholas.chammas@gmail.com> Closes #1043 from nchammas/master and squashes the following commits: 69f6e22 [Nicholas Chammas] PEP8 fixes 2627247 [Nicholas Chammas] broke up lines before they hit 100 chars 6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names 69da6cf [nchammas] Merge pull request #1 from apache/master
* Resolve scalatest warnings during buildwitgo2014-06-1021-41/+41
| | | | | | | | Author: witgo <witgo@qq.com> Closes #1032 from witgo/ShouldMatchers and squashes the following commits: 7ebf34c [witgo] Resolve scalatest warnings during build
* [SPARK-1940] Enabling rolling of executor logs, and automatic cleanup of old ↵Tathagata Das2014-06-1012-40/+895
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | executor logs Currently, in the default log4j configuration, all the executor logs get sent to the file <code>[executor-working-dir]/stderr</code>. This does not all log files to be rolled, so old logs cannot be removed. Using log4j RollingFileAppender allows log4j logs to be rolled, but all the logs get sent to a different set of files, other than the files <code>stdout</code> and <code>stderr</code> . So the logs are not visible in the Spark web UI any more as Spark web UI only reads the files <code>stdout</code> and <code>stderr</code>. Furthermore, it still does not allow the stdout and stderr to be cleared periodically in case a large amount of stuff gets written to them (e.g. by explicit `println` inside map function). This PR solves this by implementing a simple `RollingFileAppender` within Spark (disabled by default). When enabled (using configuration parameter `spark.executor.rollingLogs.enabled`), the logs can get rolled over either by time interval (set with `spark.executor.rollingLogs.interval`, set to daily by default), or by size of logs (set with `spark.executor.rollingLogs.size`). Finally, old logs can be automatically deleted by specifying how many of the latest log files to keep (set with `spark.executor.rollingLogs.keepLastN`). The web UI has also been modified to show the logs across the rolled-over files. You can test this locally (without waiting a whole day) by setting configuration `spark.executor.rollingLogs.enabled=true` and `spark.executor.rollingLogs.interval=minutely`. Continuously generate logs by running spark jobs and the generated logs files would look like this (`stderr` and `stdout` are the most current log file that are being written to). ``` stderr stderr--2014-05-27--14-37 stderr--2014-05-27--14-47 stderr--2014-05-27--15-05 stdout stdout--2014-05-27--14-47 ``` The web ui should show logs across these files. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #895 from tdas/rolling-logs and squashes the following commits: fd8f87f [Tathagata Das] Minor change. d326aee [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs ad956c1 [Tathagata Das] Scala style fix. 1f0a6ec [Tathagata Das] Some more changes based on Patrick's PR comments. c8bfe4e [Tathagata Das] Refactore FileAppender to a package spark.util.logging and broke up the file into multiple files. Changed configuration parameter names. 4224409 [Tathagata Das] Style fix. 108a9f8 [Tathagata Das] Added better constraint handling for rolling policies. f7da977 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs 9134495 [Tathagata Das] Simplified rolling logs by removing Daily/Hourly/MinutelyRollingFileAppender, and removing the setting rollingLogs.enabled 312d874 [Tathagata Das] Minor fixes based on PR comments. 8a67d83 [Tathagata Das] Fixed comments. b36cfd6 [Tathagata Das] Implemented RollingPolicy, TimeBasedRollingPolicy and SizeBasedRollingPolicy, and changed RollingFileAppender accordingly. b7e8272 [Tathagata Das] Style fix, 374c9a9 [Tathagata Das] Added missing license. 24354ea [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs 6cc09c7 [Tathagata Das] Fixed bugs in rolling logs, and added more debug statements. adf4910 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs 931f8fb [Tathagata Das] Changed log viewer in Spark web UI to handle rolling log files. cb4fb6d [Tathagata Das] Added FileAppender and RollingFileAppender to generate rolling executor logs.
* [SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not re...joyyoj2014-06-101-2/+2
| | | | | | | | | | flume event sent to Spark will fail if the body is too large and numHeaders is greater than zero Author: joyyoj <sunshch@gmail.com> Closes #951 from joyyoj/master and squashes the following commits: f4660c5 [joyyoj] [SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not read properly
* [SQL] Add average overflow test case from #978egraldlo2014-06-102-0/+17
| | | | | | | | | | | | | | By @egraldlo. Author: egraldlo <egraldlo@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #1033 from marmbrus/pr/978 and squashes the following commits: e228c5e [Michael Armbrust] Remove "test". 762aeaf [Michael Armbrust] Remove unneeded rule. More descriptive name for test table. d414cd7 [egraldlo] fommatting issues 1153f75 [egraldlo] do best to avoid overflowing in function avg().
* HOTFIX: Increase time limit for Bagel testAnkur Dave2014-06-101-2/+2
| | | | | | | | | | The test was timing out on some slow EC2 workers. Author: Ankur Dave <ankurdave@gmail.com> Closes #1037 from ankurdave/bagel-test-time-limit and squashes the following commits: 67fd487 [Ankur Dave] Increase time limit for Bagel test
* HOTFIX: Fix Python tests on Jenkins.Patrick Wendell2014-06-103-8/+12
| | | | | | | | | | Author: Patrick Wendell <pwendell@gmail.com> Closes #1036 from pwendell/jenkins-test and squashes the following commits: 9c99856 [Patrick Wendell] Better output during tests 71e7b74 [Patrick Wendell] Removing incorrect python path 74984db [Patrick Wendell] HOTFIX: Allow PySpark tests to run on Jenkins.
* [SPARK-2076][SQL] Pushdown the join filter & predication for outer joinCheng Hao2014-06-102-22/+277
| | | | | | | | | | | | As the rule described in https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior, we can optimize the SQL Join by pushing down the Join predicate and Where predicate. Author: Cheng Hao <hao.cheng@intel.com> Closes #1015 from chenghao-intel/join_predicate_push_down and squashes the following commits: 10feff9 [Cheng Hao] fix bug of changing the join type in PredicatePushDownThroughJoin 44c6700 [Cheng Hao] Add logical to support pushdown the join filter 0bce426 [Cheng Hao] Pushdown the join filter & predicate for outer join
* [SPARK-1978] In some cases, spark-yarn does not automatically restart the ↵witgo2014-06-102-27/+34
| | | | | | | | | | | | | | | | | | | failed container Author: witgo <witgo@qq.com> Closes #921 from witgo/allocateExecutors and squashes the following commits: bc3aa66 [witgo] review commit 8800eba [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors 32ac7af [witgo] review commit 056b8c7 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors 04c6f7e [witgo] Merge branch 'master' into allocateExecutors aff827c [witgo] review commit 5c376e0 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors 1faf4f4 [witgo] Merge branch 'master' into allocateExecutors 3c464bd [witgo] add time limit to allocateExecutors e00b656 [witgo] In some cases, yarn does not automatically restart the container
* Moved hiveOperators.scala to the right package folderCheng Lian2014-06-101-0/+0
| | | | | | | | | | The package is `org.apache.spark.sql.hive.execution`, while the file was placed under `sql/hive/src/main/scala/org/apache/spark/sql/hive/`. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1029 from liancheng/moveHiveOperators and squashes the following commits: d632eb8 [Cheng Lian] Moved hiveOperators.scala to the right package folder
* [SPARK-1508][SQL] Add SQLConf to SQLContext.Zongheng Yang2014-06-1014-61/+429
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR (1) introduces a new class SQLConf that stores key-value properties for a SQLContext (2) clean up the semantics of various forms of SET commands. The SQLConf class unlocks user-controllable optimization opportunities; for example, user can now override the number of partitions used during an Exchange. A SQLConf can be accessed and modified programmatically through its getters and setters. It can also be modified through SET commands executed by `sql()` or `hql()`. Note that users now have the ability to change a particular property for different queries inside the same Spark job, unlike settings configured in SparkConf. For SET commands: "SET" will return all properties currently set in a SQLConf, "SET key" will return the key-value pair (if set) or an undefined message, and "SET key=value" will call the setter on SQLConf, and if a HiveContext is used, it will be executed in Hive as well. Author: Zongheng Yang <zongheng.y@gmail.com> Closes #956 from concretevitamin/sqlconf and squashes the following commits: 4968c11 [Zongheng Yang] Very minor cleanup. d74dde5 [Zongheng Yang] Remove the redundant mkQueryExecution() method. c129b86 [Zongheng Yang] Merge remote-tracking branch 'upstream/master' into sqlconf 26c40eb [Zongheng Yang] Make SQLConf a trait and have SQLContext mix it in. dd19666 [Zongheng Yang] Update a comment. baa5d29 [Zongheng Yang] Remove default param for shuffle partitions accessor. 5f7e6d8 [Zongheng Yang] Add default num partitions. 22d9ed7 [Zongheng Yang] Fix output() of Set physical. Add SQLConf param accessor method. e9856c4 [Zongheng Yang] Use java.util.Collections.synchronizedMap on a Java HashMap. 88dd0c8 [Zongheng Yang] Remove redundant SET Keyword. 271f0b1 [Zongheng Yang] Minor change. f8983d1 [Zongheng Yang] Minor changes per review comments. 1ce8a5e [Zongheng Yang] Invoke runSqlHive() in SQLConf#get for the HiveContext case. b766af9 [Zongheng Yang] Remove a test. d52e1bd [Zongheng Yang] De-hardcode number of shuffle partitions for BasicOperators (read from SQLConf). 555599c [Zongheng Yang] Bullet-proof (relatively) parsing SET per review comment. c2067e8 [Zongheng Yang] Mark SQLContext transient and put it in a second param list. 2ea8cdc [Zongheng Yang] Wrap long line. 41d7f09 [Zongheng Yang] Fix imports. 13279e6 [Zongheng Yang] Refactor the logic of eagerly processing SET commands. b14b83e [Zongheng Yang] In a HiveContext, make SQLConf a subset of HiveConf. 6983180 [Zongheng Yang] Move a SET test to SQLQuerySuite and make it complete. 5b67985 [Zongheng Yang] New line at EOF. c651797 [Zongheng Yang] Add commands.scala. efd82db [Zongheng Yang] Clean up semantics of several cases of SET. c1017c2 [Zongheng Yang] WIP in changing SetCommand to take two Options (for different semantics of SETs). 0f00d86 [Zongheng Yang] Add a test for singleton set command in SQL. 41acd75 [Zongheng Yang] Add a test for hql() in HiveQuerySuite. 2276929 [Zongheng Yang] Fix default hive result for set commands in HiveComparisonTest. 3b0c71b [Zongheng Yang] Remove Parser for set commands. A few other fixes. d0c4578 [Zongheng Yang] Tmux typo. 0ecea46 [Zongheng Yang] Changes for HiveQl and HiveContext. ce22d80 [Zongheng Yang] Fix parsing issues. cb722c1 [Zongheng Yang] Finish up SQLConf patch. 4ebf362 [Zongheng Yang] First cut at SQLConf inside SQLContext.
* SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormatsNick Pentreath2014-06-0913-6/+1140
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So I finally resurrected this PR. It seems the old one against the incubator mirror is no longer available, so I cannot reference it. This adds initial support for reading Hadoop ```SequenceFile```s, as well as arbitrary Hadoop ```InputFormat```s, in PySpark. # Overview The basics are as follows: 1. ```PythonRDD``` object contains the relevant methods, that are in turn invoked by ```SparkContext``` in PySpark 2. The SequenceFile or InputFormat is read on the Scala side and converted from ```Writable``` instances to the relevant Scala classes (in the case of primitives) 3. Pyrolite is used to serialize Java objects. If this fails, the fallback is ```toString``` 4. ```PickleSerializer``` on the Python side deserializes. This works "out the box" for simple ```Writable```s: * ```Text``` * ```IntWritable```, ```DoubleWritable```, ```FloatWritable``` * ```NullWritable``` * ```BooleanWritable``` * ```BytesWritable``` * ```MapWritable``` It also works for simple, "struct-like" classes. Due to the way Pyrolite works, this requires that the classes satisfy the JavaBeans convenstions (i.e. with fields and a no-arg constructor and getters/setters). (Perhaps in future some sugar for case classes and reflection could be added). I've tested it out with ```ESInputFormat``` as an example and it works very nicely: ```python conf = {"es.resource" : "index/type" } rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf) rdd.first() ``` I suspect for things like HBase/Cassandra it will be a bit trickier to get it to work out the box. # Some things still outstanding: 1. ~~Requires ```msgpack-python``` and will fail without it. As originally discussed with Josh, add a ```as_strings``` argument that defaults to ```False```, that can be used if ```msgpack-python``` is not available~~ 2. ~~I see from https://github.com/apache/spark/pull/363 that Pyrolite is being used there for SerDe between Scala and Python. @ahirreddy @mateiz what is the plan behind this - is Pyrolite preferred? It seems from a cursory glance that adapting the ```msgpack```-based SerDe here to use Pyrolite wouldn't be too hard~~ 3. ~~Support the key and value "wrapper" that would allow a Scala/Java function to be plugged in that would transform whatever the key/value Writable class is into something that can be serialized (e.g. convert some custom Writable to a JavaBean or ```java.util.Map``` that can be easily serialized)~~ 4. Support ```saveAsSequenceFile``` and ```saveAsHadoopFile``` etc. This would require SerDe in the reverse direction, that can be handled by Pyrolite. Will work on this as a separate PR Author: Nick Pentreath <nick.pentreath@gmail.com> Closes #455 from MLnick/pyspark-inputformats and squashes the following commits: 268df7e [Nick Pentreath] Documentation changes mer @pwendell comments 761269b [Nick Pentreath] Address @pwendell comments, simplify default writable conversions and remove registry. 4c972d8 [Nick Pentreath] Add license headers d150431 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats cde6af9 [Nick Pentreath] Parameterize converter trait 5ebacfa [Nick Pentreath] Update docs for PySpark input formats a985492 [Nick Pentreath] Move Converter examples to own package 365d0be [Nick Pentreath] Make classes private[python]. Add docs and @Experimental annotation to Converter interface. eeb8205 [Nick Pentreath] Fix path relative to SPARK_HOME in tests 1eaa08b [Nick Pentreath] HBase -> Cassandra app name oversight 3f90c3e [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats 2c18513 [Nick Pentreath] Add examples for reading HBase and Cassandra InputFormats from Python b65606f [Nick Pentreath] Add converter interface 5757f6e [Nick Pentreath] Default key/value classes for sequenceFile asre None 085b55f [Nick Pentreath] Move input format tests to tests.py and clean up docs 43eb728 [Nick Pentreath] PySpark InputFormats docs into programming guide 94beedc [Nick Pentreath] Clean up args in PythonRDD. Set key/value converter defaults to None for PySpark context.py methods 1a4a1d6 [Nick Pentreath] Address @mateiz style comments 01e0813 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats 15a7d07 [Nick Pentreath] Remove default args for key/value classes. Arg names to camelCase 9fe6bd5 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats 84fe8e3 [Nick Pentreath] Python programming guide space formatting d0f52b6 [Nick Pentreath] Python programming guide 7caa73a [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats 93ef995 [Nick Pentreath] Add back context.py changes 9ef1896 [Nick Pentreath] Recover earlier changes lost in previous merge for serializers.py 077ecb2 [Nick Pentreath] Recover earlier changes lost in previous merge for context.py 5af4770 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats 35b8e3a [Nick Pentreath] Another fix for test ordering bef3afb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats e001b94 [Nick Pentreath] Fix test failures due to ordering 78978d9 [Nick Pentreath] Add doc for SequenceFile and InputFormat support to Python programming guide 64eb051 [Nick Pentreath] Scalastyle fix e7552fa [Nick Pentreath] Merge branch 'master' into pyspark-inputformats 44f2857 [Nick Pentreath] Remove msgpack dependency and switch serialization to Pyrolite, plus some clean up and refactoring c0ebfb6 [Nick Pentreath] Change sequencefile test data generator to easily be called from PySpark tests 1d7c17c [Nick Pentreath] Amend tests to auto-generate sequencefile data in temp dir 17a656b [Nick Pentreath] remove binary sequencefile for tests f60959e [Nick Pentreath] Remove msgpack dependency and serializer from PySpark 450e0a2 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats 31a2fff [Nick Pentreath] Scalastyle fixes fc5099e [Nick Pentreath] Add Apache license headers 4e08983 [Nick Pentreath] Clean up docs for PySpark context methods b20ec7e [Nick Pentreath] Clean up merge duplicate dependencies 951c117 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats f6aac55 [Nick Pentreath] Bring back msgpack 9d2256e [Nick Pentreath] Merge branch 'master' into pyspark-inputformats 1bbbfb0 [Nick Pentreath] Clean up SparkBuild from merge a67dfad [Nick Pentreath] Clean up Msgpack serialization and registering 7237263 [Nick Pentreath] Add back msgpack serializer and hadoop file code lost during merging 25da1ca [Nick Pentreath] Add generator for nulls, bools, bytes and maps 65360d5 [Nick Pentreath] Adding test SequenceFiles 0c612e5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats d72bf18 [Nick Pentreath] msgpack dd57922 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats e67212a [Nick Pentreath] Add back msgpack dependency f2d76a0 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats 41856a5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats 97ef708 [Nick Pentreath] Remove old writeToStream 2beeedb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats 795a763 [Nick Pentreath] Change name to WriteInputFormatTestDataGenerator. Cleanup some var names. Use SPARK_HOME in path for writing test sequencefile data. 174f520 [Nick Pentreath] Add back graphx settings 703ee65 [Nick Pentreath] Add back msgpack 619c0fa [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats 1c8efbc [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats eb40036 [Nick Pentreath] Remove unused comment lines 4d7ef2e [Nick Pentreath] Fix indentation f1d73e3 [Nick Pentreath] mergeConfs returns a copy rather than mutating one of the input arguments 0f5cd84 [Nick Pentreath] Remove unused pair UTF8 class. Add comments to msgpack deserializer 4294cbb [Nick Pentreath] Add old Hadoop api methods. Clean up and expand comments. Clean up argument names 818a1e6 [Nick Pentreath] Add seqencefile and Hadoop InputFormat support to PythonRDD 4e7c9e3 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats c304cc8 [Nick Pentreath] Adding supporting sequncefiles for tests. Cleaning up 4b0a43f [Nick Pentreath] Refactoring utils into own objects. Cleaning up old commented-out code d86325f [Nick Pentreath] Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop InputFormat
* Make sure that empty string is filtered out when we get the secondary jars ↵DB Tsai2014-06-091-2/+4
| | | | | | | | | | | from conf Author: DB Tsai <dbtsai@dbtsai.com> Closes #1027 from dbtsai/dbtsai-classloader and squashes the following commits: 9ac6be3 [DB Tsai] Fixed line too long c9c7ad7 [DB Tsai] Make sure that empty string is filtered out when we get the secondary jars from conf.
* [SPARK-1704][SQL] Fully support EXPLAIN commands as SchemaRDD.Zongheng Yang2014-06-096-4/+68
| | | | | | | | | | | | | | | This PR attempts to resolve [SPARK-1704](https://issues.apache.org/jira/browse/SPARK-1704) by introducing a physical plan for EXPLAIN commands, which just prints out the debug string (containing various SparkSQL's plans) of the corresponding QueryExecution for the actual query. Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1003 from concretevitamin/explain-cmd and squashes the following commits: 5b7911f [Zongheng Yang] Add a regression test. 1bfa379 [Zongheng Yang] Modify output(). 719ada9 [Zongheng Yang] Override otherCopyArgs for ExplainCommandPhysical. 4318fd7 [Zongheng Yang] Make all output one Row. 439c6ab [Zongheng Yang] Minor cleanups. 408f574 [Zongheng Yang] SPARK-1704: Add CommandStrategy and ExplainCommandPhysical.
* [SQL] Simple framework for debugging query executionMichael Armbrust2014-06-093-50/+119
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only records number of tuples and unique dataTypes output right now... Example: ```scala scala> import org.apache.spark.sql.execution.debug._ scala> hql("SELECT value FROM src WHERE key > 10").debug(sparkContext) Results returned: 489 == Project [value#1:0] == Tuples output: 489 value StringType: {java.lang.String} == Filter (key#0:1 > 10) == Tuples output: 489 value StringType: {java.lang.String} key IntegerType: {java.lang.Integer} == HiveTableScan [value#1,key#0], (MetastoreRelation default, src, None), None == Tuples output: 500 value StringType: {java.lang.String} key IntegerType: {java.lang.Integer} ``` Author: Michael Armbrust <michael@databricks.com> Closes #1005 from marmbrus/debug and squashes the following commits: dcc3ca6 [Michael Armbrust] Add comments. c9dded2 [Michael Armbrust] Simple framework for debugging query execution
* [SPARK-1522] : YARN ClientBase throws a NPE if there is no YARN Application CPBernardo Gomez Palacio2014-06-092-34/+167
| | | | | | | | | | | | | | | | | | | | | The current implementation of ClientBase.getDefaultYarnApplicationClasspath inspects the MRJobConfig class for the field DEFAULT_YARN_APPLICATION_CLASSPATH when it should be really looking into YarnConfiguration. If the Application Configuration has no yarn.application.classpath defined a NPE exception will be thrown. Additional Changes include: * Test Suite for ClientBase added [ticket: SPARK-1522] : https://issues.apache.org/jira/browse/SPARK-1522 Author : bernardo.gomezpalacio@gmail.com Testing : SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt test Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com> Closes #433 from berngp/feature/SPARK-1522 and squashes the following commits: 2c2e118 [Bernardo Gomez Palacio] [SPARK-1522]: YARN ClientBase throws a NPE if there is no YARN Application specific CP
* Added a TaskSetManager unit test.Kay Ousterhout2014-06-091-0/+16
| | | | | | | | | | | | | | | | | | This test ensures that when there are no alive executors that satisfy a particular locality level, the TaskSetManager doesn't ever use that as the maximum allowed locality level (this optimization ensures that a job doesn't wait extra time in an attempt to satisfy a scheduling locality level that is impossible). @mateiz and @lirui-intel this unit test illustrates an issue with #892 (it fails with that patch). Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #1024 from kayousterhout/scheduler_unit_test and squashes the following commits: de6a08f [Kay Ousterhout] Added a TaskSetManager unit test.
* [SPARK-1495][SQL]add support for left semi joinDaoyuan2014-06-0937-3/+216
| | | | | | | | | | | | | | | | | | | Just submit another solution for #395 Author: Daoyuan <daoyuan.wang@intel.com> Author: Michael Armbrust <michael@databricks.com> Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #837 from adrian-wang/left-semi-join-support and squashes the following commits: d39cd12 [Daoyuan Wang] Merge pull request #1 from marmbrus/pr/837 6713c09 [Michael Armbrust] Better debugging for failed query tests. 035b73e [Michael Armbrust] Add test for left semi that can't be done with a hash join. 5ec6fa4 [Michael Armbrust] Add left semi to SQL Parser. 4c726e5 [Daoyuan] improvement according to Michael 8d4a121 [Daoyuan] add golden files for leftsemijoin 83a3c8a [Daoyuan] scala style fix 14cff80 [Daoyuan] add support for left semi join
* SPARK-1944 Document --verbose in spark-shell -hAndrew Ash2014-06-091-0/+3
| | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-1944 Author: Andrew Ash <andrew@andrewash.com> Closes #1020 from ash211/SPARK-1944 and squashes the following commits: a831c4d [Andrew Ash] SPARK-1944 Document --verbose in spark-shell -h
* [SPARK-1308] Add getNumPartitions to pyspark RDDSyed Hashmi2014-06-091-18/+27
| | | | | | | | | | Add getNumPartitions to pyspark RDD to provide an intuitive way to get number of partitions in RDD like we can do in scala today. Author: Syed Hashmi <shashmi@cloudera.com> Closes #995 from syedhashmi/master and squashes the following commits: de0ed5e [Syed Hashmi] [SPARK-1308] Add getNumPartitions to pyspark RDD
* Grammar: read -> readsAndrew Ash2014-06-081-1/+1
| | | | | | | | Author: Andrew Ash <andrew@andrewash.com> Closes #1016 from ash211/patch-6 and squashes the following commits: e3865c8 [Andrew Ash] Grammar: read -> reads
* [SPARK-2067] use relative path for Spark logo in UINeville Li2014-06-081-1/+1
| | | | | | | | Author: Neville Li <neville@spotify.com> Closes #1006 from nevillelyh/gh/SPARK-2067 and squashes the following commits: 9ee64cf [Neville Li] [SPARK-2067] use relative path for Spark logo in UI
* SPARK-1628 follow up: Improve RangePartitioner's documentation.Reynold Xin2014-06-081-1/+4
| | | | | | | | | | | | Adding a paragraph clarifying a weird behavior in RangePartitioner. See also #549. Author: Reynold Xin <rxin@apache.org> Closes #1012 from rxin/partitioner-doc and squashes the following commits: 6f0109e [Reynold Xin] SPARK-1628 follow up: Improve RangePartitioner's documentation.
* Update run-examplemaji20142014-06-081-1/+1
| | | | | | | | | | | | | Old code can only be ran under spark_home and use "bin/run-example". Error "./run-example: line 55: ./bin/spark-submit: No such file or directory" appears when running in other place. So change this Author: maji2014 <maji3@asiainfo-linkage.com> Closes #1011 from maji2014/master and squashes the following commits: 2cc1af6 [maji2014] Update run-example Closes #988.
* SPARK-1628: Add missing hashCode methods in Partitioner subclasseszsxwing2014-06-083-1/+20
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-1628 Added `hashCode` in HashPartitioner, RangePartitioner, PythonPartitioner and PageRankUtils.CustomPartitioner. Author: zsxwing <zsxwing@gmail.com> Closes #549 from zsxwing/SPARK-1628 and squashes the following commits: 2620936 [zsxwing] SPARK-1628: Add missing hashCode methods in Partitioner subclasses
* SPARK-1898: In deploy.yarn.Client, use YarnClient not YarnClientImplColin Patrick McCabe2014-06-082-10/+17
| | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-1898 Author: Colin Patrick McCabe <cmccabe@cloudera.com> Closes #850 from cmccabe/master and squashes the following commits: d66eddc [Colin Patrick McCabe] SPARK-1898: In deploy.yarn.Client, use YarnClient rather than YarnClientImpl
* SPARK-2026: Maven Hadoop Profiles Should Set The Hadoop VersionBernardo Gomez Palacio2014-06-081-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Maven Profiles that refer to hadoopX, e.g. `hadoop2.4`, should set the expected `hadoop.version` and `yarn.version`. e.g. ``` <profile> <id>hadoop-2.4</id> <properties> <hadoop.version>2.4.0</hadoop.version> <yarn.version>${hadoop.version}</yarn.version> <protobuf.version>2.5.0</protobuf.version> <jets3t.version>0.9.0</jets3t.version> </properties> </profile> ``` Builds can still define the `-Dhadoop.version` option but this will correctly default the Hadoop Version to the one that is expected according the profile that is selected. e.g. ```$ mvn -P hadoop-2.4,yarn clean install``` or ```$ mvn -P hadoop-0.23,yarn clean install``` [ticket] : https://issues.apache.org/jira/browse/SPARK-2026 Author : berngp Reviewer : ? Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com> Closes #998 from berngp/feature/SPARK-2026 and squashes the following commits: 07ba4f7 [Bernardo Gomez Palacio] SPARK-2026: Maven Hadoop Profiles Should Set The Hadoop Version
* SPARK-2056 Set RDD name to input pathNeville Li2014-06-071-4/+4
| | | | | | | | Author: Neville Li <neville@spotify.com> Closes #992 from nevillelyh/master and squashes the following commits: 3011739 [Neville Li] [SPARK-2056] Set RDD name to input path
* HOTFIX: Support empty body in merge scriptPatrick Wendell2014-06-071-2/+3
| | | | | | | | | | Discovered in #992 Author: Patrick Wendell <pwendell@gmail.com> Closes #1007 from pwendell/hotfix and squashes the following commits: af90aa0 [Patrick Wendell] HOTFIX: Support empty body in merge script
* [SPARK-1994][SQL] Weird data corruption bug when running Spark SQL on data ↵Michael Armbrust2014-06-071-10/+5
| | | | | | | | | | | | | | in HDFS Basically there is a race condition (possibly a scala bug?) when these values are recomputed on all of the slaves that results in an incorrect projection being generated (possibly because the GUID uniqueness contract is broken?). In general we should probably enforce that all expression planing occurs on the driver, as is now occurring here. Author: Michael Armbrust <michael@databricks.com> Closes #1004 from marmbrus/fixAggBug and squashes the following commits: e0c116c [Michael Armbrust] Compute aggregate expression during planning instead of lazily on workers.
* [SPARK-1841]: update scalatest to version 2.1.5witgo2014-06-0611-36/+47
| | | | | | | | | | | | | | | | | | | | Author: witgo <witgo@qq.com> Closes #713 from witgo/scalatest and squashes the following commits: b627a6a [witgo] merge master 51fb3d6 [witgo] merge master 3771474 [witgo] fix RDDSuite 996d6f9 [witgo] fix TimeStampedWeakValueHashMap test 9dfa4e7 [witgo] merge bug 1479b22 [witgo] merge master 29b9194 [witgo] fix code style 022a7a2 [witgo] fix test dependency a52c0fa [witgo] fix test dependency cd8f59d [witgo] Merge branch 'master' of https://github.com/apache/spark into scalatest 046540d [witgo] fix RDDSuite.scala 2c543b9 [witgo] fix ReplSuite.scala c458928 [witgo] update scalatest to version 2.1.5
* [SPARK-2050 - 2][SQL] DIV and BETWEEN should not be case sensitive.Michael Armbrust2014-06-064-4/+10
| | | | | | | | | | Followup: #989 Author: Michael Armbrust <michael@databricks.com> Closes #994 from marmbrus/caseSensitiveFunctions2 and squashes the following commits: 9d9c8ed [Michael Armbrust] Fix DIV and BETWEEN.
* [SPARK-1552] Fix type comparison bug in {map,outerJoin}VerticesAnkur Dave2014-06-055-8/+40
| | | | | | | | | | | | | | | | | | | | | In GraphImpl, mapVertices and outerJoinVertices use a more efficient implementation when the map function conserves vertex attribute types. This is implemented by comparing the ClassTags of the old and new vertex attribute types. However, ClassTags store erased types, so the comparison will return a false positive for types with different type parameters, such as Option[Int] and Option[Double]. This PR resolves the problem by requesting that the compiler generate evidence of equality between the old and new vertex attribute types, and providing a default value for the evidence parameter if the two types are not equal. The methods can then check the value of the evidence parameter to see whether the types are equal. It also adds a test called "mapVertices changing type with same erased type" that failed before the PR and succeeds now. Callers of mapVertices and outerJoinVertices can no longer use a wildcard for a graph's VD type. To avoid "Error occurred in an application involving default arguments," they must bind VD to a type parameter, as this PR does for ShortestPaths and LabelPropagation. Author: Ankur Dave <ankurdave@gmail.com> Closes #967 from ankurdave/SPARK-1552 and squashes the following commits: 68a4fff [Ankur Dave] Undo conserve naming 7388705 [Ankur Dave] Remove unnecessary ClassTag for VD parameters a704e5f [Ankur Dave] Use type equality constraint with default argument 29a5ab7 [Ankur Dave] Add failing test f458c83 [Ankur Dave] Revert "[SPARK-1552] Fix type comparison bug in mapVertices and outerJoinVertices" 16d6af8 [Ankur Dave] [SPARK-1552] Fix type comparison bug in mapVertices and outerJoinVertices
* [SPARK-2050][SQL] LIKE, RLIKE and IN in HQL should not be case sensitive.Michael Armbrust2014-06-051-4/+8
| | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #989 from marmbrus/caseSensitiveFuncitons and squashes the following commits: 681de54 [Michael Armbrust] LIKE, RLIKE and IN in HQL should not be case sensitive.
* SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keysMatei Zaharia2014-06-052-5/+44
| | | | | | | | | | | The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead. Author: Matei Zaharia <matei@databricks.com> Closes #986 from mateiz/spark-2043 and squashes the following commits: 0959514 [Matei Zaharia] Added unit test for having many hash collisions 892debb [Matei Zaharia] SPARK-2043: don't read a key with the next hash code in ExternalAppendOnlyMap, instead use a buffered iterator to only read values with the current hash code.
* [SPARK-2025] Unpersist edges of previous graph in PregelAnkur Dave2014-06-051-0/+1
| | | | | | | | | | | | | | Due to a bug introduced by apache/spark#497, Pregel does not unpersist replicated vertices from previous iterations. As a result, they stay cached until memory is full, wasting GC time. This PR corrects the problem by unpersisting both the edges and the replicated vertices of previous iterations. This is safe because the edges and replicated vertices of the current iteration are cached by the call to `g.cache()` and then materialized by the call to `messages.count()`. Therefore no unmaterialized RDDs depend on `prevG.edges`. I verified that no recomputation occurs by running PageRank with a custom patch to Spark that warns when a partition is recomputed. Thanks to Tim Weninger for reporting this bug. Author: Ankur Dave <ankurdave@gmail.com> Closes #972 from ankurdave/SPARK-2025 and squashes the following commits: 13d5b07 [Ankur Dave] Unpersist edges of previous graph in Pregel
* Use pluggable clock in DAGSheduler #SPARK-2031CrazyJvm2014-06-051-6/+7
| | | | | | | | | | DAGScheduler supports pluggable clock like what TaskSetManager does. Author: CrazyJvm <crazyjvm@gmail.com> Closes #976 from CrazyJvm/clock and squashes the following commits: 6779a4c [CrazyJvm] Use pluggable clock in DAGSheduler
* [SPARK-2041][SQL] Correctly analyze queries where columnName == tableName.Michael Armbrust2014-06-053-1/+11
| | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #985 from marmbrus/tableName and squashes the following commits: 3caaa27 [Michael Armbrust] Correctly analyze queries where columnName == tableName.
* Remove compile-scoped junit dependency.Marcelo Vanzin2014-06-052-1/+10
| | | | | | | | | | | | | This avoids having junit classes showing up in the assembly jar. I verified that only test classes in the jtransforms package use junit. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #794 from vanzin/junit-dep-exclusion and squashes the following commits: 274e1c2 [Marcelo Vanzin] Remove junit from assembly in sbt build also. ad950be [Marcelo Vanzin] Remove compile-scoped junit dependency.
* sbt 0.13.X should be using sbt-assembly 0.11.XKalpit Shah2014-06-051-1/+1
| | | | | | | | | | https://github.com/sbt/sbt-assembly/blob/master/README.md Author: Kalpit Shah <shahkalpit84@gmail.com> Closes #555 from kalpit/upgrade/sbtassembly and squashes the following commits: 1fa7324 [Kalpit Shah] sbt 0.13.X should be using sbt-assembly 0.11.X
* HOTFIX: Remove generated-mima-excludes file after runing MIMA.Patrick Wendell2014-06-051-0/+1
| | | | | | | | | | | This has been causing some false failures on PR's that don't merge correctly. Author: Patrick Wendell <pwendell@gmail.com> Closes #971 from pwendell/mima and squashes the following commits: 1dc80aa [Patrick Wendell] HOTFIX: Remove generated-mima-excludes file after runing MIMA.
* [SPARK-2036] [SQL] CaseConversionExpression should check if the evaluated ↵Takuya UESHIN2014-06-053-2/+28
| | | | | | | | | | | | value is null. `CaseConversionExpression` should check if the evaluated value is `null`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #982 from ueshin/issues/SPARK-2036 and squashes the following commits: 61e1c54 [Takuya UESHIN] Add check if the evaluated value is null.
* SPARK-1677: allow user to disable output dir existence checkingCodingCat2014-06-053-2/+34
| | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-1677 For compatibility with older versions of Spark it would be nice to have an option `spark.hadoop.validateOutputSpecs` (default true) for the user to disable the output directory existence checking Author: CodingCat <zhunansjtu@gmail.com> Closes #947 from CodingCat/SPARK-1677 and squashes the following commits: 7930f83 [CodingCat] miao c0c0e03 [CodingCat] bug fix and doc update 5318562 [CodingCat] bug fix 13219b5 [CodingCat] allow user to disable output dir existence checking
* [SPARK-2029] Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.Takuya UESHIN2014-06-0523-23/+23
| | | | | | | | Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #974 from ueshin/issues/SPARK-2029 and squashes the following commits: e19e8f4 [Takuya UESHIN] Bump version number to 1.1.0-SNAPSHOT.
* Fix issue in ReplSuite with hadoop-provided profile.Marcelo Vanzin2014-06-041-1/+13
| | | | | | | | | | | | | | | When building the assembly with the maven "hadoop-provided" profile, the executors were failing to come up because Hadoop classes were not found in the classpath anymore; so add them explicitly to the classpath using spark.executor.extraClassPath. This is only needed for the local-cluster mode, but doesn't affect other tests, so it's added for all of them to keep the code simpler. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #781 from vanzin/repl-test-fix and squashes the following commits: 4f0a3b0 [Marcelo Vanzin] Fix issue in ReplSuite with hadoop-provided profile.
* Minor: Fix documentation error from apache/spark#946Ankur Dave2014-06-041-2/+2
| | | | | | | | Author: Ankur Dave <ankurdave@gmail.com> Closes #970 from ankurdave/SPARK-1991_docfix and squashes the following commits: 6d07343 [Ankur Dave] Minor: Fix documentation error from apache/spark#946
* SPARK-1790: Update EC2 scripts to support r3 instance typesVarakhedi Sujeet2014-06-041-2/+12
| | | | | | | | Author: Varakhedi Sujeet <svarakhedi@gopivotal.com> Closes #960 from sujeetv/ec2-r3 and squashes the following commits: 3cb9fd5 [Varakhedi Sujeet] SPARK-1790: Update EC2 scripts to support r3 instance
* SPARK-1518: FileLogger: Fix compile against Hadoop trunkColin McCabe2014-06-041-4/+12
| | | | | | | | | | | | | In Hadoop trunk (currently Hadoop 3.0.0), the deprecated FSDataOutputStream#sync() method has been removed. Instead, we should call FSDataOutputStream#hflush, which does the same thing as the deprecated method used to do. Author: Colin McCabe <cmccabe@cloudera.com> Closes #898 from cmccabe/SPARK-1518 and squashes the following commits: 752b9d7 [Colin McCabe] FileLogger: Fix compile against Hadoop trunk
* [SPARK-1752][MLLIB] Standardize text format for vectors and labeled pointsXiangrui Meng2014-06-0418-72/+579
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should standardize the text format used to represent vectors and labeled points. The proposed formats are the following: 1. dense vector: `[v0,v1,..]` 2. sparse vector: `(size,[i0,i1],[v0,v1])` 3. labeled point: `(label,vector)` where "(..)" indicates a tuple and "[...]" indicate an array. `loadLabeledPoints` is added to pyspark's `MLUtils`. I didn't add `loadVectors` to pyspark because `RDD.saveAsTextFile` cannot stringify dense vectors in the proposed format automatically. `MLUtils#saveLabeledData` and `MLUtils#loadLabeledData` are deprecated. Users should use `RDD#saveAsTextFile` and `MLUtils#loadLabeledPoints` instead. In Scala, `MLUtils#loadLabeledPoints` is compatible with the format used by `MLUtils#loadLabeledData`. CC: @mateiz, @srowen Author: Xiangrui Meng <meng@databricks.com> Closes #685 from mengxr/labeled-io and squashes the following commits: 2d1116a [Xiangrui Meng] make loadLabeledData/saveLabeledData deprecated since 1.0.1 297be75 [Xiangrui Meng] change LabeledPoint.parse to LabeledPointParser.parse to maintain binary compatibility d6b1473 [Xiangrui Meng] Merge branch 'master' into labeled-io 56746ea [Xiangrui Meng] replace # by . 623a5f0 [Xiangrui Meng] merge master f06d5ba [Xiangrui Meng] add docs and minor updates 640fe0c [Xiangrui Meng] throw SparkException 5bcfbc4 [Xiangrui Meng] update test to add scientific notations e86bf38 [Xiangrui Meng] remove NumericTokenizer 050fca4 [Xiangrui Meng] use StringTokenizer 6155b75 [Xiangrui Meng] merge master f644438 [Xiangrui Meng] remove parse methods based on eval from pyspark a41675a [Xiangrui Meng] python loadLabeledPoint uses Scala's implementation ce9a475 [Xiangrui Meng] add deserialize_labeled_point to pyspark with tests e9fcd49 [Xiangrui Meng] add serializeLabeledPoint and tests aea4ae3 [Xiangrui Meng] minor updates 810d6df [Xiangrui Meng] update tokenizer/parser implementation 7aac03a [Xiangrui Meng] remove Scala parsers c1885c1 [Xiangrui Meng] add headers and minor changes b0c50cb [Xiangrui Meng] add customized parser d731817 [Xiangrui Meng] style update 63dc396 [Xiangrui Meng] add loadLabeledPoints to pyspark ea122b5 [Xiangrui Meng] Merge branch 'master' into labeled-io cd6c78f [Xiangrui Meng] add __str__ and parse to LabeledPoint a7a178e [Xiangrui Meng] add stringify to pyspark's Vectors 5c2dbfa [Xiangrui Meng] add parse to pyspark's Vectors 7853f88 [Xiangrui Meng] update pyspark's SparseVector.__str__ e761d32 [Xiangrui Meng] make LabelPoint.parse compatible with the dense format used before v1.0 and deprecate loadLabeledData and saveLabeledData 9e63a02 [Xiangrui Meng] add loadVectors and loadLabeledPoints 19aa523 [Xiangrui Meng] update toString and add parsers for Vectors and LabeledPoint