aboutsummaryrefslogtreecommitdiff
path: root/sql/catalyst
diff options
context:
space:
mode:
authorPete Robbins <robbinspg@gmail.com>2016-06-16 22:27:32 -0700
committerDavies Liu <davies.liu@gmail.com>2016-06-16 22:27:32 -0700
commit5ada606144c7bf38a797764619d7d1ff677802b3 (patch)
treeac4a38521729a11816c2c9e4441506da5b45385f /sql/catalyst
parent513a03e41e27d9c5f70911faccc5d3aecd8bdde9 (diff)
downloadspark-5ada606144c7bf38a797764619d7d1ff677802b3.tar.gz
spark-5ada606144c7bf38a797764619d7d1ff677802b3.tar.bz2
spark-5ada606144c7bf38a797764619d7d1ff677802b3.zip
[SPARK-15822] [SQL] Prevent byte array backed classes from referencing freed memory
## What changes were proposed in this pull request? `UTF8String` and all `Unsafe*` classes are backed by either on-heap or off-heap byte arrays. The code generated version `SortMergeJoin` buffers the left hand side join keys during iteration. This was actually problematic in off-heap mode when one of the keys is a `UTF8String` (or any other 'Unsafe*` object) and the left hand side iterator was exhausted (and released its memory); the buffered keys would reference freed memory. This causes Seg-faults and all kinds of other undefined behavior when we would use one these buffered keys. This PR fixes this problem by creating copies of the buffered variables. I have added a general method to the `CodeGenerator` for this. I have checked all places in which this could happen, and only `SortMergeJoin` had this problem. This PR is largely based on the work of robbinspg and he should be credited for this. closes https://github.com/apache/spark/pull/13707 ## How was this patch tested? Manually tested on problematic workloads. Author: Pete Robbins <robbinspg@gmail.com> Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13723 from hvanhovell/SPARK-15822-2.
Diffstat (limited to 'sql/catalyst')
-rw-r--r--sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala16
1 files changed, 16 insertions, 0 deletions
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
index ff97cd3211..6392ff42d7 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
@@ -130,6 +130,22 @@ class CodegenContext {
mutableStates += ((javaType, variableName, initCode))
}
+ /**
+ * Add buffer variable which stores data coming from an [[InternalRow]]. This methods guarantees
+ * that the variable is safely stored, which is important for (potentially) byte array backed
+ * data types like: UTF8String, ArrayData, MapData & InternalRow.
+ */
+ def addBufferedState(dataType: DataType, variableName: String, initCode: String): ExprCode = {
+ val value = freshName(variableName)
+ addMutableState(javaType(dataType), value, "")
+ val code = dataType match {
+ case StringType => s"$value = $initCode.clone();"
+ case _: StructType | _: ArrayType | _: MapType => s"$value = $initCode.copy();"
+ case _ => s"$value = $initCode;"
+ }
+ ExprCode(code, "false", value)
+ }
+
def declareMutableStates(): String = {
// It's possible that we add same mutable state twice, e.g. the `mergeExpressions` in
// `TypedAggregateExpression`, we should call `distinct` here to remove the duplicated ones.