[SPARK-14972] Improve performance of JSON schema inference's compatibleType method - spark

diff options

author	Josh Rosen <joshrosen@databricks.com>	2016-05-09 13:11:18 -0700
committer	Davies Liu <davies.liu@gmail.com>	2016-05-09 13:11:18 -0700
commit	c3350cadb8369ad016f89135bbcbe126705c463c (patch)
tree	0a18d5d85de0de13ac67a5c041b73d96ec58bd9a /core
parent	2adb11f6db591a7d8199e42dd23c7fb23ef5df3b (diff)
download	spark-c3350cadb8369ad016f89135bbcbe126705c463c.tar.gz spark-c3350cadb8369ad016f89135bbcbe126705c463c.tar.bz2 spark-c3350cadb8369ad016f89135bbcbe126705c463c.zip

[SPARK-14972] Improve performance of JSON schema inference's compatibleType method

This patch improves the performance of `InferSchema.compatibleType` and `inferField`. The net result of this patch is a 6x speedup in local benchmarks running against cached data with a massive nested schema. The key idea is to remove unnecessary sorting in `compatibleType`'s `StructType` merging code. This code takes two structs, merges the fields with matching names, and copies over the unique fields, producing a new schema which is the union of the two structs' schemas. Previously, this code performed a very inefficient `groupBy()` to match up fields with the same name, but this is unnecessary because `inferField` already sorts structs' fields by name: since both lists of fields are sorted, we can simply merge them in a single pass. This patch also speeds up the existing field sorting in `inferField`: the old sorting code allocated unnecessary intermediate collections, while the new code uses mutable collects and performs in-place sorting. I rewrote inefficient `equals()` implementations in `StructType` and `Metadata`, significantly reducing object allocations in those methods. Finally, I replaced a `treeAggregate` call with `fold`: I doubt that `treeAggregate` will benefit us very much because the schemas would have to be enormous to realize large savings in network traffic. Since most schemas are probably fairly small in serialized form, they should typically fit within a direct task result and therefore can be incrementally merged at the driver as individual tasks finish. This change eliminates an entire (short) scheduler stage. Author: Josh Rosen <joshrosen@databricks.com> Closes #12750 from JoshRosen/schema-inference-speedups.

Diffstat (limited to 'core')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: