diff options
author | Eric Liang <ekl@databricks.com> | 2016-06-09 18:05:16 -0700 |
---|---|---|
committer | Josh Rosen <joshrosen@databricks.com> | 2016-06-09 18:05:16 -0700 |
commit | b914e1930fd5c5f2808f92d4958ec6fbeddf2e30 (patch) | |
tree | 4503a4aeaf068a7e8e5565f8a9a3e1bd7732cb23 /core/src/test/scala/org | |
parent | 83070cd1d459101e1189f3b07ea59e22f98e84ce (diff) | |
download | spark-b914e1930fd5c5f2808f92d4958ec6fbeddf2e30.tar.gz spark-b914e1930fd5c5f2808f92d4958ec6fbeddf2e30.tar.bz2 spark-b914e1930fd5c5f2808f92d4958ec6fbeddf2e30.zip |
[SPARK-15794] Should truncate toString() of very wide plans
## What changes were proposed in this pull request?
With very wide tables, e.g. thousands of fields, the plan output is unreadable and often causes OOMs due to inefficient string processing. This truncates all struct and operator field lists to a user configurable threshold to limit performance impact.
It would also be nice to optimize string generation to avoid these sort of O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including expressions), but this is probably too large of a change for 2.0 at this point, and truncation has other benefits for usability.
## How was this patch tested?
Added a microbenchmark that covers this case particularly well. I also ran the microbenchmark while varying the truncation threshold.
```
numFields = 5
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 2336 / 2558 0.0 23364.4 0.1X
numFields = 25
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 4237 / 4465 0.0 42367.9 0.1X
numFields = 100
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 10458 / 11223 0.0 104582.0 0.0X
numFields = Infinity
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
[info] java.lang.OutOfMemoryError: Java heap space
```
Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>
Closes #13537 from ericl/truncated-string.
Diffstat (limited to 'core/src/test/scala/org')
-rw-r--r-- | core/src/test/scala/org/apache/spark/util/UtilsSuite.scala | 8 |
1 files changed, 8 insertions, 0 deletions
diff --git a/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala b/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala index 6698749866..a5363f0bfd 100644 --- a/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala +++ b/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala @@ -40,6 +40,14 @@ import org.apache.spark.network.util.ByteUnit class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging { + test("truncatedString") { + assert(Utils.truncatedString(Nil, "[", ", ", "]", 2) == "[]") + assert(Utils.truncatedString(Seq(1, 2), "[", ", ", "]", 2) == "[1, 2]") + assert(Utils.truncatedString(Seq(1, 2, 3), "[", ", ", "]", 2) == "[1, ... 2 more fields]") + assert(Utils.truncatedString(Seq(1, 2, 3), "[", ", ", "]", -5) == "[, ... 3 more fields]") + assert(Utils.truncatedString(Seq(1, 2, 3), ", ") == "1, 2, 3") + } + test("timeConversion") { // Test -1 assert(Utils.timeStringAsSeconds("-1") === -1) |