aboutsummaryrefslogtreecommitdiff
path: root/sql/hive
diff options
context:
space:
mode:
authorTejas Patil <tejasp@fb.com>2016-10-22 20:43:43 -0700
committergatorsmile <gatorsmile@gmail.com>2016-10-22 20:43:43 -0700
commiteff4aed1ac1e500d4aa40665dd06b527dffbc111 (patch)
tree130c5f6f52410aefec45b5b04b5cb2c5c0fb1fee /sql/hive
parentbc167a2a53f5a795d089e8a884569b1b3e2cd439 (diff)
downloadspark-eff4aed1ac1e500d4aa40665dd06b527dffbc111.tar.gz
spark-eff4aed1ac1e500d4aa40665dd06b527dffbc111.tar.bz2
spark-eff4aed1ac1e500d4aa40665dd06b527dffbc111.zip
[SPARK-18035][SQL] Introduce performant and memory efficient APIs to create ArrayBasedMapData
## What changes were proposed in this pull request? Jira: https://issues.apache.org/jira/browse/SPARK-18035 In HiveInspectors, I saw that converting Java map to Spark's `ArrayBasedMapData` spent quite sometime in buffer copying : https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658 The reason being `map.toSeq` allocates a new buffer and copies the map entries to it: https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323 This copy is not needed as we get rid of it once we extract the key and value arrays. Here is the call trace: ``` org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664) scala.collection.AbstractMap.toSeq(Map.scala:59) scala.collection.MapLike$class.toSeq(MapLike.scala:323) scala.collection.AbstractMap.toBuffer(Map.scala:59) scala.collection.MapLike$class.toBuffer(MapLike.scala:326) scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104) scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) scala.collection.AbstractIterable.foreach(Iterable.scala:54) scala.collection.IterableLike$class.foreach(IterableLike.scala:72) scala.collection.AbstractIterator.foreach(Iterator.scala:1336) scala.collection.Iterator$class.foreach(Iterator.scala:893) scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) ``` Also, earlier code was populating keys and values arrays separately by iterating twice. The PR avoids double iteration of the map and does it in one iteration. EDIT: During code review, there were several more places in the code which were found to do similar thing. The PR dedupes those instances and introduces convenient APIs which are performant and memory efficient ## Performance gains The number is subjective and depends on how many map columns are accessed in the query and average entries per map. For one the queries that I tried out, I saw 3% CPU savings (end-to-end) for the query. ## How was this patch tested? This does not change the end result produced so relying on existing tests. Author: Tejas Patil <tejasp@fb.com> Closes #15573 from tejasapatil/SPARK-18035_avoid_toSeq.
Diffstat (limited to 'sql/hive')
-rw-r--r--sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala11
1 files changed, 3 insertions, 8 deletions
diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala
index 1625116803..e303065127 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala
@@ -473,10 +473,8 @@ private[hive] trait HiveInspectors {
case mi: StandardConstantMapObjectInspector =>
val keyUnwrapper = unwrapperFor(mi.getMapKeyObjectInspector)
val valueUnwrapper = unwrapperFor(mi.getMapValueObjectInspector)
- val keyValues = mi.getWritableConstantValue.asScala.toSeq
- val keys = keyValues.map(kv => keyUnwrapper(kv._1)).toArray
- val values = keyValues.map(kv => valueUnwrapper(kv._2)).toArray
- val constant = ArrayBasedMapData(keys, values)
+ val keyValues = mi.getWritableConstantValue
+ val constant = ArrayBasedMapData(keyValues, keyUnwrapper, valueUnwrapper)
_ => constant
case li: StandardConstantListObjectInspector =>
val unwrapper = unwrapperFor(li.getListElementObjectInspector)
@@ -655,10 +653,7 @@ private[hive] trait HiveInspectors {
if (map == null) {
null
} else {
- val keyValues = map.asScala.toSeq
- val keys = keyValues.map(kv => keyUnwrapper(kv._1)).toArray
- val values = keyValues.map(kv => valueUnwrapper(kv._2)).toArray
- ArrayBasedMapData(keys, values)
+ ArrayBasedMapData(map, keyUnwrapper, valueUnwrapper)
}
} else {
null