aboutsummaryrefslogtreecommitdiff
path: root/core
diff options
context:
space:
mode:
authorJosh Rosen <joshrosen@databricks.com>2016-05-09 13:11:18 -0700
committerDavies Liu <davies.liu@gmail.com>2016-05-09 13:11:18 -0700
commitc3350cadb8369ad016f89135bbcbe126705c463c (patch)
tree0a18d5d85de0de13ac67a5c041b73d96ec58bd9a /core
parent2adb11f6db591a7d8199e42dd23c7fb23ef5df3b (diff)
downloadspark-c3350cadb8369ad016f89135bbcbe126705c463c.tar.gz
spark-c3350cadb8369ad016f89135bbcbe126705c463c.tar.bz2
spark-c3350cadb8369ad016f89135bbcbe126705c463c.zip
[SPARK-14972] Improve performance of JSON schema inference's compatibleType method
This patch improves the performance of `InferSchema.compatibleType` and `inferField`. The net result of this patch is a 6x speedup in local benchmarks running against cached data with a massive nested schema. The key idea is to remove unnecessary sorting in `compatibleType`'s `StructType` merging code. This code takes two structs, merges the fields with matching names, and copies over the unique fields, producing a new schema which is the union of the two structs' schemas. Previously, this code performed a very inefficient `groupBy()` to match up fields with the same name, but this is unnecessary because `inferField` already sorts structs' fields by name: since both lists of fields are sorted, we can simply merge them in a single pass. This patch also speeds up the existing field sorting in `inferField`: the old sorting code allocated unnecessary intermediate collections, while the new code uses mutable collects and performs in-place sorting. I rewrote inefficient `equals()` implementations in `StructType` and `Metadata`, significantly reducing object allocations in those methods. Finally, I replaced a `treeAggregate` call with `fold`: I doubt that `treeAggregate` will benefit us very much because the schemas would have to be enormous to realize large savings in network traffic. Since most schemas are probably fairly small in serialized form, they should typically fit within a direct task result and therefore can be incrementally merged at the driver as individual tasks finish. This change eliminates an entire (short) scheduler stage. Author: Josh Rosen <joshrosen@databricks.com> Closes #12750 from JoshRosen/schema-inference-speedups.
Diffstat (limited to 'core')
0 files changed, 0 insertions, 0 deletions