| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
| |
Since its name reduced at https://github.com/apache/spark/pull/560, the log4j-spark-container.properties was never used again.
And I have searched its name globally in code and found no cite.
Author: WangTaoTheTonic <barneystinson@aliyun.com>
Closes #2977 from WangTaoTheTonic/delLog4j and squashes the following commits:
fb2729f [WangTaoTheTonic] delete the log4j file obsoleted
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added completed Python API for MLlib.feature
Normalizer
StandardScalerModel
StandardScaler
HashTF
IDFModel
IDF
cc mengxr
Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>
Closes #2819 from davies/feature and squashes the following commits:
4f48f48 [Davies Liu] add a note for HashingTF
67f6d21 [Davies Liu] address comments
b628693 [Davies Liu] rollback changes in Word2Vec
efb4f4f [Davies Liu] Merge branch 'master' into feature
806c7c2 [Davies Liu] address comments
3abb8c2 [Davies Liu] address comments
59781b9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into feature
a405ae7 [Davies Liu] fix tests
7a1891a [Davies Liu] fix tests
486795f [Davies Liu] update programming guide, HashTF -> HashingTF
8a50584 [Davies Liu] Python API for mllib.feature
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`read()` may return fewer bytes than requested; when this occurred, the old code would silently return less data than requested, which might cause stream corruption errors. `skip()` faces similar issues, too.
This patch fixes several cases where we mis-handle these methods' return values.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #2969 from JoshRosen/file-channel-read-fix and squashes the following commits:
e724a9f [Josh Rosen] Fix similar issue of not checking skip() return value.
cbc03ce [Josh Rosen] Update the other log message, too.
01e6015 [Josh Rosen] file.getName -> file.getAbsolutePath
d961d95 [Josh Rosen] Fix another issue in FileServerSuite.
b9265d2 [Josh Rosen] Fix a similar (minor) issue in TestUtils.
cd9d76f [Josh Rosen] Fix a similar error in Tachyon:
3db0008 [Josh Rosen] Fix a similar read() error in Utils.offsetBytes().
db985ed [Josh Rosen] Fix unsafe usage of FileChannel.read():
|
|
|
|
|
|
|
|
|
|
|
|
| |
seems like `building-spark.html` was renamed to `building-with-maven.html`?
Is Maven the blessed build tool these days, or SBT? I couldn't find a building-with-sbt page so I went with the Maven one here.
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes #2859 from ryan-williams/broken-links-readme and squashes the following commits:
7692253 [Ryan Williams] fix broken links in README.md
|
|
|
|
|
|
|
|
|
|
|
|
| |
cc @rxin
Author: GuoQiang Li <witgo@qq.com>
Closes #2929 from witgo/SPARK-4064 and squashes the following commits:
20110f2 [GuoQiang Li] Modify the exception msg
3425225 [GuoQiang Li] review commits
2b07e49 [GuoQiang Li] If we create a lot of big broadcast variables, Spark may hang
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA issue: [SPARK-3907]https://issues.apache.org/jira/browse/SPARK-3907
Add turncate table support
TRUNCATE TABLE table_name [PARTITION partition_spec];
partition_spec:
: (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
Removes all rows from a table or partition(s). Currently target table should be native/managed table or exception will be thrown. User can specify partial partition_spec for truncating multiple partitions at once and omitting partition_spec will truncate all partitions in the table.
Author: wangxiaojing <u9jing@gmail.com>
Closes #2770 from wangxiaojing/spark-3907 and squashes the following commits:
63dbd81 [wangxiaojing] change hive scalastyle
7a03707 [wangxiaojing] add comment
f6e710e [wangxiaojing] change truncate table
a1f692c [wangxiaojing] Correct spelling mistakes
3b20007 [wangxiaojing] add truncate can not support column err message
e483547 [wangxiaojing] add golden file
77b1f20 [wangxiaojing] add truncate table support
|
|
|
|
|
|
|
|
|
|
| |
`schemaRDD2` is not tested because `schemaRDD1` is registered again.
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #2869 from yhuai/JavaApplySchemaSuite and squashes the following commits:
95fe894 [Yin Huai] Correct variable name.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
lowercase when compare with relation attributes
In ```MetastoreRelation``` the attributes name is lowercase because of hive using lowercase for fields name, so we should convert attributes name in table scan lowercase in ```indexWhere(_.name == a.name)```.
```neededColumnIDs``` may be not correct if not convert to lowercase.
Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>
Closes #2884 from scwf/fixColumnIds and squashes the following commits:
6174046 [scwf] use AttributeMap for this issue
dc74a24 [wangfei] use lowerName and add a test case for this issue
3ff3a80 [wangfei] more safer change
294fcb7 [scwf] attributes names in table scan should convert lowercase in neededColumnsIDs
|
|
|
|
|
|
|
|
|
|
| |
...ob conf in SparkHadoopWriter class
Author: Alex Liu <alex_liu68@yahoo.com>
Closes #2677 from alexliu68/SPARK-SQL-3816 and squashes the following commits:
79c269b [Alex Liu] [SPARK-3816][SQL] Add table properties from storage handler to job conf
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
```
explain extended select cos(null) from src limit 1;
```
outputs:
```
Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5]
MetastoreRelation default, src, None
== Optimized Logical Plan ==
Limit 1
Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5]
MetastoreRelation default, src, None
== Physical Plan ==
Limit 1
Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5]
HiveTableScan [], (MetastoreRelation default, src, None), None
```
After patching this PR it outputs
```
== Parsed Logical Plan ==
Limit 1
Project ['cos(null) AS c_0#0]
UnresolvedRelation None, src, None
== Analyzed Logical Plan ==
Limit 1
Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#0]
MetastoreRelation default, src, None
== Optimized Logical Plan ==
Limit 1
Project [null AS c_0#0]
MetastoreRelation default, src, None
== Physical Plan ==
Limit 1
Project [null AS c_0#0]
HiveTableScan [], (MetastoreRelation default, src, None), None
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes #2771 from chenghao-intel/hive_udf_constant_folding and squashes the following commits:
1379c73 [Cheng Hao] duplicate the PlanTest with catalyst/plans/PlanTest
1e52dda [Cheng Hao] add unit test for hive simple udf constant folding
01609ff [Cheng Hao] support constant folding for HiveSimpleUdf
|
|
|
|
|
|
|
|
|
|
| |
Also update step parameter to pass the proposed test
Author: coderxiang <shuoxiangpub@gmail.com>
Closes #2965 from coderxiang/nnls-test and squashes the following commits:
24b06f9 [coderxiang] add test case on objective value for NNLS; update step parameter to pass the test
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change replaces usages of colt with commons-math3 equivalents, and makes some minor necessary adjustments to related code and tests to match.
Author: Sean Owen <sowen@cloudera.com>
Closes #2928 from srowen/SPARK-4022 and squashes the following commits:
61a232f [Sean Owen] Fix failure due to different sampling in JavaAPISuite.sample()
16d66b8 [Sean Owen] Simplify seeding with call to reseedRandomGenerator
a1a78e0 [Sean Owen] Use Well19937c
31c7641 [Sean Owen] Fix Python Poisson test by choosing a different seed; about 88% of seeds should work but 1 didn't, it seems
5c9c67f [Sean Owen] Additional test fixes from review
d8f88e0 [Sean Owen] Replace colt with commons-math3. Some tests do not pass yet.
|
|
|
|
|
|
|
|
|
|
|
|
| |
PR #2860 refines in-memory table statistics and enables broader broadcasted hash join optimization for in-memory tables. This makes `JoinSuite` fail when some test suite caches test table `testData` and gets executed before `JoinSuite`. Because expected `ShuffledHashJoin`s are optimized to `BroadcastedHashJoin` according to collected in-memory table statistics.
This PR fixes this issue by clearing the cache before testing join operator selection. A separate test case is also added to test broadcasted hash join operator selection.
Author: Cheng Lian <lian@databricks.com>
Closes #2960 from liancheng/fix-join-suite and squashes the following commits:
715b2de [Cheng Lian] Fixes caching related JoinSuite failure
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The patch takes advantage an API provided in Hadoop 2.5 that allows getting accurate data on Hadoop FileSystem bytes read. It eliminates the old method, which naively accepts the split size as the input bytes. An impact of this change will be that input metrics go away when using against Hadoop versions earlier thatn 2.5. I can add this back in, but my opinion is that no metrics are better than inaccurate metrics.
This is difficult to write a test for because we don't usually build against a version of Hadoop that contains the function we need. I've tested it manually on a pseudo-distributed cluster.
Author: Sandy Ryza <sandy@cloudera.com>
Closes #2087 from sryza/sandy-spark-2621 and squashes the following commits:
23010b8 [Sandy Ryza] Missing style fixes
74fc9bb [Sandy Ryza] Make getFSBytesReadOnThreadCallback private
1ab662d [Sandy Ryza] Clear things up a bit
984631f [Sandy Ryza] Switch from pull to push model and add test
7ef7b22 [Sandy Ryza] Add missing curly braces
219abc9 [Sandy Ryza] Fall back to split size
90dbc14 [Sandy Ryza] SPARK-2621. Update task InputMetrics incrementally
|
|
|
|
|
|
|
|
|
|
| |
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes #2878 from ScrapCodes/SPARK-4032/deprecate-yarn-alpha and squashes the following commits:
17e9857 [Prashant Sharma] added deperecated comment to Client and ExecutorRunnable.
3a34b1e [Prashant Sharma] Updated docs...
4608dea [Prashant Sharma] [SPARK-4032] Deprecate YARN alpha support in Spark 1.2
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change makes the destroy function public for broadcast variables. Motivation for the change is described in https://issues.apache.org/jira/browse/SPARK-4030.
This patch also logs where destroy was called from if a broadcast variable is used after destruction.
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes #2922 from shivaram/broadcast-destroy and squashes the following commits:
a11abab [Shivaram Venkataraman] Fix scala style in Utils.scala
bed9c9d [Shivaram Venkataraman] Make destroy blocking by default
e80c1ab [Shivaram Venkataraman] Make destroy public for broadcast variables Also log where destroy was called from if a broadcast variable is used after destruction.
|
|
|
|
|
|
|
|
|
|
|
| |
The shutdown hook of `DiskBlockManager` would remove localDirs. So do not need to register them with `Utils.registerShutdownDeleteDir`. It causes duplicate removal of these local dirs and corresponding exceptions.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #2826 from viirya/fix_duplicate_localdir_remove and squashes the following commits:
051d4b5 [Liang-Chi Hsieh] check dir existing and return empty List as default.
2b91a9c [Liang-Chi Hsieh] remove duplicate removal of local dirs.
|
|
|
|
|
|
|
|
|
|
| |
Append columns ids and names before broadcast ```hiveExtraConf``` in ```HadoopTableReader```.
Author: scwf <wangfei1@huawei.com>
Closes #2885 from scwf/HadoopTableReader and squashes the following commits:
a8c498c [scwf] append columns ids and names before broadcast
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We cannot use EOL character like \n or \r in the operand of LIKE predicate.
So following condition is never true.
-- someStr is 'hoge\nfuga'
where someStr LIKE 'hoge_fuga'
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #2908 from sarutak/spark-sql-like-match-modification and squashes the following commits:
d15798b [Kousuke Saruta] Remove test setting for thriftserver
f99a2f4 [Kousuke Saruta] Fixed LIKE predicate so that we can use EOL character as in a operand
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
-9223372036854775808 (Long.MinValue). / We can apply unary minus only to literal.
SqlParser fails to parse -9223372036854775808 (Long.MinValue) so we cannot write queries such like as follows.
SELECT value FROM someTable WHERE value > -9223372036854775808
Additionally, because of the wrong syntax definition, we cannot apply unary minus only to literal. So, we cannot write such expressions.
-(value1 + value2) // Parenthesized expressions
-column // Columns
-MAX(column) // Functions
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #2816 from sarutak/spark-sql-dsl-improvement2 and squashes the following commits:
32a5005 [Kousuke Saruta] Remove test setting for thriftserver
c2bab5e [Kousuke Saruta] Fixed SPARK-3959 and SPARK-3960
|
|
|
|
|
|
|
|
|
|
| |
Supporting special chars in column names by using back ticks. Closed https://github.com/apache/spark/pull/2804 and created this PR as it has merge conflicts
Author: ravipesala <ravindra.pesala@huawei.com>
Closes #2927 from ravipesala/SPARK-3483-NEW and squashes the following commits:
f6329f3 [ravipesala] Rebased with master
|
|
|
|
|
|
|
|
|
|
|
|
| |
Please refer to added tests for cases that can trigger the bug.
JIRA: https://issues.apache.org/jira/browse/SPARK-4068
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #2918 from yhuai/SPARK-4068 and squashes the following commits:
d360eae [Yin Huai] Handle nulls when building key paths from elements of an array.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
using Predef.Map (it is scala.collection.immutable.Map)
Please check https://issues.apache.org/jira/browse/SPARK-4052 for cases triggering this bug.
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #2899 from yhuai/SPARK-4052 and squashes the following commits:
1188f70 [Yin Huai] Address liancheng's comments.
b6712be [Yin Huai] Use scala.collection.Map instead of Predef.Map (scala.collection.immutable.Map).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In SqlParser.scala, there is following code.
case d ~ p ~ r ~ f ~ g ~ h ~ o ~ l =>
val base = r.getOrElse(NoRelation)
val withFilter = f.map(f => Filter(f, base)).getOrElse(base)
In the code above, there are 2 variables which have same name "f" in near place.
One is receiver "f" and other is bound variable "f".
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #2807 from sarutak/SPARK-3953 and squashes the following commits:
4957c32 [Kousuke Saruta] Improved variable name in SqlParser.scala
|
|
|
|
|
|
|
|
|
|
| |
In sql-programming-guide.md, there is a wrong package name "scala.math.sql".
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #2873 from sarutak/wrong-packagename-fix and squashes the following commits:
4d5ecf4 [Kousuke Saruta] Fixed wrong package name in sql-programming-guide.md
|
|
|
|
|
|
|
|
| |
Author: GuoQiang Li <witgo@qq.com>
Closes #2846 from witgo/SPARK-3997 and squashes the following commits:
d6a57f8 [GuoQiang Li] scalastyle should output the error location
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR refines in-memory columnar table statistics:
1. adds 2 more statistics for in-memory table columns: `count` and `sizeInBytes`
1. adds filter pushdown support for `IS NULL` and `IS NOT NULL`.
1. caches and propagates statistics in `InMemoryRelation` once the underlying cached RDD is materialized.
Statistics are collected to driver side with an accumulator.
This PR also fixes SPARK-3914 by properly propagating in-memory statistics.
Author: Cheng Lian <lian@databricks.com>
Closes #2860 from liancheng/propagates-in-mem-stats and squashes the following commits:
0cc5271 [Cheng Lian] Restricts visibility of o.a.s.s.c.p.l.Statistics
c5ff904 [Cheng Lian] Fixes test table name conflict
a8c818d [Cheng Lian] Refines tests
1d01074 [Cheng Lian] Bug fix: shouldn't call STRING.actualSize on null string value
7dc6a34 [Cheng Lian] Adds more in-memory table statistics and propagates them properly
|
|
|
|
|
|
|
|
|
|
| |
The thirift server is not available in the default (hive13) profile yet which is breaking all SQL only PRs. This turns off these test until #2685 is merged.
Author: Michael Armbrust <michael@databricks.com>
Closes #2950 from marmbrus/fixTests and squashes the following commits:
1a6dfee [Michael Armbrust] [HOTFIX][SQL] Temporarily turn of hive-server tests.
|
|
|
|
|
|
|
|
|
|
|
| |
The orderings should not be considered during the comparison between old qualifiers and new qualifiers.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #2783 from viirya/full_qualifier_comp and squashes the following commits:
89f652c [Liang-Chi Hsieh] modification for comment.
abb5762 [Liang-Chi Hsieh] More comprehensive comparison of qualifiers.
|
|
|
|
|
|
|
|
| |
Author: anant asthana <anant.asty@gmail.com>
Closes #2948 from anantasty/patch-1 and squashes the following commits:
d8fea0b [anant asthana] Just fixing comment that shows usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds Selenium tests for Spark's web UI. To avoid adding extra
dependencies to the test environment, the tests use Selenium's HtmlUnitDriver,
which is pure-Java, instead of, say, ChromeDriver.
I added new tests to try to reproduce a few UI bugs reported on JIRA, namely
SPARK-3021, SPARK-2105, and SPARK-2527. I wasn't able to reproduce these bugs;
I suspect that the older ones might have been fixed by other patches.
In order to use HtmlUnitDriver, I added an explicit dependency on the
org.apache.httpcomponents version of httpclient in order to prevent jets3t's
older version from taking precedence on the classpath.
I also upgraded ScalaTest to 2.2.1.
Author: Josh Rosen <joshrosen@apache.org>
Author: Josh Rosen <joshrosen@databricks.com>
Closes #2474 from JoshRosen/webui-selenium-tests and squashes the following commits:
fcc9e83 [Josh Rosen] scalautils -> scalactic package rename
510e54a [Josh Rosen] [SPARK-3616] Add basic Selenium tests to WebUISuite.
|
|
|
|
|
|
|
|
|
|
| |
Roaring has been updated to version 0.4.3. We fixed a rarely occurring bug with serialization. No API or format changes were made.
Author: Daniel Lemire <lemire@gmail.com>
Closes #2938 from lemire/master and squashes the following commits:
431f3a0 [Daniel Lemire] Recommended bug fix release
|
|
|
|
|
|
|
|
|
|
| |
This follows https://github.com/apache/spark/pull/2893 , but does not completely fix SPARK-3359 either. This fixes minor scaladoc/javadoc issues that Javadoc 8 will treat as errors.
Author: Sean Owen <sowen@cloudera.com>
Closes #2909 from srowen/SPARK-3359 and squashes the following commits:
f62c347 [Sean Owen] Fix some javadoc issues that javadoc 8 considers errors. This is not all of the errors turned up when javadoc 8 runs on output of genjavadoc.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In tests, we may want to have BlockManagers of size < 1MB (spark.storage.unrollMemoryThreshold). However, these BlockManagers are useless because we can't unroll anything in them ever. At the very least we need to log a warning.
tdas
Author: Andrew Or <andrew@databricks.com>
Closes #2917 from andrewor14/unroll-safely-logging and squashes the following commits:
38947e3 [Andrew Or] Warn against starting a block manager that's too small
fd621b4 [Andrew Or] Warn against failure to reserve initial memory threshold
|
|
|
|
|
|
| |
This reverts commit 898b22ab1fe90e8a3935b19566465046f2256fa6.
Reverting because this may be causing OOMs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In case of take() or exception in Python, python worker may exit before JVM read() all the response, then the write thread may raise "Connection reset" exception.
Python should always wait JVM to close the socket first.
cc JoshRosen This is a warm fix, or the tests will be flaky, sorry for that.
Author: Davies Liu <davies@databricks.com>
Closes #2941 from davies/fix_exit and squashes the following commits:
9d4d21e [Davies Liu] fix race
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This pull request is a first step towards the implementation of a stable, pull-based progress / status API for Spark (see [SPARK-2321](https://issues.apache.org/jira/browse/SPARK-2321)). For now, I'd like to discuss the basic implementation, API names, and overall interface design. Once we arrive at a good design, I'll go back and add additional methods to expose more information via these API.
#### Design goals:
- Pull-based API
- Usable from Java / Scala / Python (eventually, likely with a wrapper)
- Can be extended to expose more information without introducing binary incompatibilities.
- Returns immutable objects.
- Don't leak any implementation details, preserving our freedom to change the implementation.
#### Implementation:
- Add public methods (`getJobInfo`, `getStageInfo`) to SparkContext to allow status / progress information to be retrieved.
- Add public interfaces (`SparkJobInfo`, `SparkStageInfo`) for our API return values. These interfaces consist entirely of Java-style getter methods. The interfaces are currently implemented in Java. I decided to explicitly separate the interface from its implementation (`SparkJobInfoImpl`, `SparkStageInfoImpl`) in order to prevent users from constructing these responses themselves.
-Allow an existing JobProgressListener to be used when constructing a live SparkUI. This allows us to re-use this listeners in the implementation of this status API. There are a few reasons why this listener re-use makes sense:
- The status API and web UI are guaranteed to show consistent information.
- These listeners are already well-tested.
- The same garbage-collection / information retention configurations can apply to both this API and the web UI.
- Extend JobProgressListener to maintain `jobId -> Job` and `stageId -> Stage` mappings.
The progress API methods are implemented in a separate trait that's mixed into SparkContext. This helps to avoid SparkContext.scala from becoming larger and more difficult to read.
Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <joshrosen@apache.org>
Closes #2696 from JoshRosen/progress-reporting-api and squashes the following commits:
e6aa78d [Josh Rosen] Add tests.
b585c16 [Josh Rosen] Accept SparkListenerBus instead of more specific subclasses.
c96402d [Josh Rosen] Address review comments.
2707f98 [Josh Rosen] Expose current stage attempt id
c28ba76 [Josh Rosen] Update demo code:
646ff1d [Josh Rosen] Document spark.ui.retainedJobs.
7f47d6d [Josh Rosen] Clean up SparkUI constructors, per Andrew's feedback.
b77b3d8 [Josh Rosen] Merge remote-tracking branch 'origin/master' into progress-reporting-api
787444c [Josh Rosen] Move status API methods into trait that can be mixed into SparkContext.
f9a9a00 [Josh Rosen] More review comments:
3dc79af [Josh Rosen] Remove creation of unused listeners in SparkContext.
249ca16 [Josh Rosen] Address several review comments:
da5648e [Josh Rosen] Add example of basic progress reporting in Java.
7319ffd [Josh Rosen] Add getJobIdsForGroup() and num*Tasks() methods.
cc568e5 [Josh Rosen] Add note explaining that interfaces should not be implemented outside of Spark.
6e840d4 [Josh Rosen] Remove getter-style names and "consistent snapshot" semantics:
08cbec9 [Josh Rosen] Begin to sketch the interfaces for a stable, public status API.
ac2d13a [Josh Rosen] Add jobId->stage, stageId->stage mappings in JobProgressListener
24de263 [Josh Rosen] Create UI listeners in SparkContext instead of in Tabs:
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As part of the upgrade I also copy the newest version of the query tests, and whitelist a bunch of new ones that are now passing.
Author: Michael Armbrust <michael@databricks.com>
Closes #2936 from marmbrus/fix13tests and squashes the following commits:
d9cbdab [Michael Armbrust] Remove user specific tests
65801cd [Michael Armbrust] style and rat
8f6b09a [Michael Armbrust] Update test harness to work with both Hive 12 and 13.
f044843 [Michael Armbrust] Update Hive query tests and golden files to 0.13
|
|
|
|
|
|
|
|
|
|
|
|
| |
This upgrades snappy-java to 1.1.1.5, which improves error messages when attempting to deserialize empty inputs using SnappyInputStream (see https://github.com/xerial/snappy-java/issues/89).
Author: Josh Rosen <rosenville@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>
Closes #2911 from JoshRosen/upgrade-snappy-java and squashes the following commits:
adec96c [Josh Rosen] Use snappy-java 1.1.1.5
cc953d6 [Josh Rosen] [SPARK-4056] Upgrade snappy-java to 1.1.1.4
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If classes implementing Serializable or Externalizable interfaces throw
exceptions other than IOException or ClassNotFoundException from their
(de)serialization methods, then this results in an unhelpful
"IOException: unexpected exception type" rather than the actual exception that
produced the (de)serialization error.
This patch fixes this by adding a utility method that re-wraps any uncaught
exceptions in IOException (unless they are already instances of IOException).
Author: Josh Rosen <joshrosen@databricks.com>
Closes #2932 from JoshRosen/SPARK-4080 and squashes the following commits:
cd3a9be [Josh Rosen] [SPARK-4080] Only throw IOException from [write|read][Object|External].
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #2934 from marmbrus/patch-2 and squashes the following commits:
a96dab2 [Michael Armbrust] Remove sleep on reset() failure.
|
|
|
|
|
|
|
|
|
|
| |
Now graphx.SynthBenchmark example has an option of iteration number named as "niter". However, in its document, it is named as "niters". The mismatch between the implementation and document causes certain IllegalArgumentException while trying that example.
Author: Grace <jie.huang@intel.com>
Closes #2888 from GraceH/synthbenchmark and squashes the following commits:
f101ee1 [Grace] Modify option name according to example doc
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-4067
currently , we call Utils.tryOrExit everywhere
AppClient
Executor
TaskSchedulerImpl
It makes the name of ExecutorUncaughtExceptionHandler unfit to the real case....
Author: Nan Zhu <nanzhu@Nans-MacBook-Pro.local>
Author: Nan Zhu <nanzhu@nans-mbp.home>
Closes #2913 from CodingCat/SPARK-4067 and squashes the following commits:
035ee3d [Nan Zhu] make RAT happy
e62e416 [Nan Zhu] add some general Exit code
a10b63f [Nan Zhu] refactor
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the existing code, each coarse-grained executor has two concurrently running actor systems. This causes many more error messages to be logged than necessary when the executor is lost or killed because we receive a disassociation event for each of these actor systems.
This is blocking #2840.
Author: Andrew Or <andrewor14@gmail.com>
Closes #2863 from andrewor14/executor-actor-system and squashes the following commits:
44ce2e0 [Andrew Or] Avoid starting two actor systems on each executor
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In deploy.ClientArguments.isValidJarUrl, the url is checked as follows.
def isValidJarUrl(s: String): Boolean = s.matches("(.+):(.+)jar")
So, it allows like 'hdfs:file.jar' (no authority).
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #2925 from sarutak/uri-syntax-check-improvement and squashes the following commits:
cf06173 [Kousuke Saruta] Improved URI syntax checking
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In sbin/spark-config.sh, parameter expansion is used to extract source root as follows.
this="${BASH_SOURCE-$0}"
I think, the parameter expansion should be ":" instead of "".
If we use "-" and BASH_SOURCE="", (empty character is set, not unset),
"" (empty character) is set to $this.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #2930 from sarutak/SPARK-4076 and squashes the following commits:
32a0370 [Kousuke Saruta] Fixed wrong parameter expansion
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
files & jars once
If Spark lunched multiple executors in one host for one application, every executor would download it dependent files and jars (if not using local: url) independently. It maybe result in huge latency. In my case, it result in 20 seconds latency to download dependent jars(size about 17M) when I lunched 32 executors in every host(total 4 hosts).
This patch will cache downloaded files and jars for executors to reduce network throughput and download latency. In my case, the latency was reduced from 20 seconds to less than 1 second.
Author: Li Zhihui <zhihui.li@intel.com>
Author: li-zhihui <zhihui.li@intel.com>
Closes #1616 from li-zhihui/cachefiles and squashes the following commits:
36940df [Li Zhihui] Close cache for local mode
935fed6 [Li Zhihui] Clean code.
f9330d4 [Li Zhihui] Clean code again
7050d46 [Li Zhihui] Clean code
074a422 [Li Zhihui] Fix: deal with spark.files.overwrite
03ed3a8 [li-zhihui] rename cache file name as XXXXXXXXX_cache
2766055 [li-zhihui] Use url.hashCode + timestamp as cachedFileName
76a7b66 [Li Zhihui] Clean code & use applcation work directory as cache directory
3510eb0 [Li Zhihui] Keep fetchFile private
2ffd742 [Li Zhihui] add comment for FileLock
e0ebd48 [Li Zhihui] Try and finally lock.release
7fb7c0b [Li Zhihui] Release lock before copy files
6b997bf [Li Zhihui] Executors of same application in same host should only download files & jars once
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As part of the effort to avoid data loss on Spark Streaming driver failure, we want to implement a write ahead log that can write received data to HDFS. This allows the received data to be persist across driver failures. So when the streaming driver is restarted, it can find and reprocess all the data that were received but not processed.
This was primarily implemented by @harishreedharan. This is still WIP, as he is going to improve the unitests by using HDFS mini cluster.
Author: Hari Shreedharan <hshreedharan@apache.org>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #2882 from tdas/driver-ha-wal and squashes the following commits:
e4bee20 [Tathagata Das] Removed synchronized, Path.getFileSystem is threadsafe
55514e2 [Tathagata Das] Minor changes based on PR comments.
d29fddd [Tathagata Das] Merge pull request #20 from harishreedharan/driver-ha-wal
a317a4d [Hari Shreedharan] Directory deletion should not fail tests
9514dc8 [Tathagata Das] Added unit tests to test reading of corrupted data and other minor edits
3881706 [Tathagata Das] Merge pull request #19 from harishreedharan/driver-ha-wal
4705fff [Hari Shreedharan] Sort listed files by name. Use local files for WAL tests.
eb356ca [Tathagata Das] Merge pull request #18 from harishreedharan/driver-ha-wal
82ce56e [Hari Shreedharan] Fix file ordering issue in WALManager tests
5ff90ee [Hari Shreedharan] Fix tests to not ignore ordering and also assert all data is present
ef8db09 [Tathagata Das] Merge pull request #17 from harishreedharan/driver-ha-wal
7e40e56 [Hari Shreedharan] Restore old build directory after tests
587b876 [Hari Shreedharan] Fix broken test. Call getFileSystem only from synchronized method.
b4be0c1 [Hari Shreedharan] Remove unused method
edcbee1 [Hari Shreedharan] Tests reading and writing data using writers now use Minicluster.
5c70d1f [Hari Shreedharan] Remove underlying stream from the WALWriter.
4ab602a [Tathagata Das] Refactored write ahead stuff from streaming.storage to streaming.util
b06be2b [Tathagata Das] Adding missing license.
5182ffb [Hari Shreedharan] Added documentation
172358d [Tathagata Das] Pulled WriteAheadLog-related stuff from tdas/spark/tree/driver-ha-working
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Given that a lot of users are trying to use hive 0.13 in spark, and the incompatibility between hive-0.12 and hive-0.13 on the API level I want to propose following approach, which has no or minimum impact on existing hive-0.12 support, but be able to jumpstart the development of hive-0.13 and future version support.
Approach: Introduce “hive-version” property, and manipulate pom.xml files to support different hive version at compiling time through shim layer, e.g., hive-0.12.0 and hive-0.13.1. More specifically,
1. For each different hive version, there is a very light layer of shim code to handle API differences, sitting in sql/hive/hive-version, e.g., sql/hive/v0.12.0 or sql/hive/v0.13.1
2. Add a new profile hive-default active by default, which picks up all existing configuration and hive-0.12.0 shim (v0.12.0) if no hive.version is specified.
3. If user specifies different version (currently only 0.13.1 by -Dhive.version = 0.13.1), hive-versions profile will be activated, which pick up hive-version specific shim layer and configuration, mainly the hive jars and hive-version shim, e.g., v0.13.1.
4. With this approach, nothing is changed with current hive-0.12 support.
No change by default: sbt/sbt -Phive
For example: sbt/sbt -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly
To enable hive-0.13: sbt/sbt -Dhive.version=0.13.1
For example: sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly
Note that in hive-0.13, hive-thriftserver is not enabled, which should be fixed by other Jira, and we don’t need -Phive with -Dhive.version in building (probably we should use -Phive -Dhive.version=xxx instead after thrift server is also supported in hive-0.13.1).
Author: Zhan Zhang <zhazhan@gmail.com>
Author: zhzhan <zhazhan@gmail.com>
Author: Patrick Wendell <pwendell@gmail.com>
Closes #2241 from zhzhan/spark-2706 and squashes the following commits:
3ece905 [Zhan Zhang] minor fix
410b668 [Zhan Zhang] solve review comments
cbb4691 [Zhan Zhang] change run-test for new options
0d4d2ed [Zhan Zhang] rebase
497b0f4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
8fad1cf [Zhan Zhang] change the pom file and make hive-0.13.1 as the default
ab028d1 [Zhan Zhang] rebase
4a2e36d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
4cb1b93 [zhzhan] Merge pull request #1 from pwendell/pr-2241
b0478c0 [Patrick Wendell] Changes to simplify the build of SPARK-2706
2b50502 [Zhan Zhang] rebase
a72c0d4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cb22863 [Zhan Zhang] correct the typo
20f6cf7 [Zhan Zhang] solve compatability issue
f7912a9 [Zhan Zhang] rebase and solve review feedback
301eb4a [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
10c3565 [Zhan Zhang] address review comments
6bc9204 [Zhan Zhang] rebase and remove temparory repo
d3aa3f2 [Zhan Zhang] Merge branch 'master' into spark-2706
cedcc6f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3ced0d7 [Zhan Zhang] rebase
d9b981d [Zhan Zhang] rebase and fix error due to rollback
adf4924 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3dd50e8 [Zhan Zhang] solve conflicts and remove unnecessary implicts
d10bf00 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
dc7bdb3 [Zhan Zhang] solve conflicts
7e0cc36 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
d7c3e1e [Zhan Zhang] Merge branch 'master' into spark-2706
68deb11 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
d48bd18 [Zhan Zhang] address review comments
3ee3b2b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
57ea52e [Zhan Zhang] Merge branch 'master' into spark-2706
2b0d513 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
9412d24 [Zhan Zhang] address review comments
f4af934 [Zhan Zhang] rebase
1ccd7cc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
128b60b [Zhan Zhang] ignore 0.12.0 test cases for the time being
af9feb9 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
5f5619f [Zhan Zhang] restructure the directory and different hive version support
05d3683 [Zhan Zhang] solve conflicts
e4c1982 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
94b4fdc [Zhan Zhang] Spark-2706: hive-0.13.1 support on spark
87ebf3b [Zhan Zhang] Merge branch 'master' into spark-2706
921e914 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
f896b2a [Zhan Zhang] Merge branch 'master' into spark-2706
789ea21 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cb53a2c [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
f6a8a40 [Zhan Zhang] revert
ba14f28 [Zhan Zhang] test
dbedff3 [Zhan Zhang] Merge remote-tracking branch 'upstream/master'
70964fe [Zhan Zhang] revert
fe0f379 [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
70ffd93 [Zhan Zhang] revert
42585ec [Zhan Zhang] test
7d5fce2 [Zhan Zhang] test
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously cached data was found by `sameResult` plan matching on optimized plans. This technique however fails to locate the cached data when a temporary table with a projection is queried with a further reduced projection. The failure is due to the fact that optimization will collapse the projections, producing a plan that no longer produces the sameResult as the cached data (though the cached data still subsumes the desired data). For example consider the following previously failing test case.
```scala
sql("CACHE TABLE tempTable AS SELECT key FROM testData")
assertCached(sql("SELECT COUNT(*) FROM tempTable"))
```
In this PR I change the matching to occur after analysis instead of optimization, so that in the case of temporary tables, the plans will always match. I think this should work generally, however, this error does raise questions about the need to do more thorough subsumption checking when locating cached data.
Another question is what sort of semantics we want to provide when uncaching data from temporary tables. For example consider the following sequence of commands:
```scala
testData.select('key).registerTempTable("tempTable1")
testData.select('key).registerTempTable("tempTable2")
cacheTable("tempTable1")
// This obviously works.
assertCached(sql("SELECT COUNT(*) FROM tempTable1"))
// It seems good that this works ...
assertCached(sql("SELECT COUNT(*) FROM tempTable2"))
// ... but is this valid?
uncacheTable("tempTable2")
// Should this still be cached?
assertCached(sql("SELECT COUNT(*) FROM tempTable1"), 0)
```
Author: Michael Armbrust <michael@databricks.com>
Closes #2912 from marmbrus/cachingBug and squashes the following commits:
9c822d4 [Michael Armbrust] remove commented out code
5c72fb7 [Michael Armbrust] Add a test case / question about uncaching semantics.
63a23e4 [Michael Armbrust] Perform caching on analyzed instead of optimized plan.
03f1cfe [Michael Armbrust] Clean-up / add tests to SameResult suite.
|