SPARK-4963 [SQL] Add copy to SQL's Sample operator

https://issues.apache.org/jira/browse/SPARK-4963 SchemaRDD.sample() return wrong results due to GapSamplingIterator operating on mutable row. HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will return GapSamplingIterator for iterating. override def next(): T = { val r = data.next() advance r } GapSamplingIterator.next() return the current underlying element and assigned it to r. However if the underlying iterator is mutable row just like what HiveTableScan returned, underlying iterator and r will point to the same object. After advance operation, we drop some underlying elments and it also changed r which is not expected. Then we return the wrong value different from initial r. To fix this issue, the most direct way is to make HiveTableScan return mutable row with copy just like the initial commit that I have made. This solution will make HiveTableScan can not get the full advantage of reusable MutableRow, but it can make sample operation return correct result. Further more, we need to investigate GapSamplingIterator.next() and make it can implement copy operation inside it. To achieve this, we should define every elements that RDD can store implement the function like cloneable and it will make huge change. Author: Yanbo Liang <yanbohappy@gmail.com> Closes #3827 from yanbohappy/spark-4963 and squashes the following commits: 0912ca0 [Yanbo Liang] code format keep 65c4e7c [Yanbo Liang] import file and clear annotation 55c7c56 [Yanbo Liang] better output of test case cea7e2e [Yanbo Liang] SchemaRDD add copy operation before Sample operator e840829 [Yanbo Liang] HiveTableScan return mutable row with copy
author: Yanbo Liang <yanbohappy@gmail.com> 2015-01-10 14:16:37 -0800
committer: Michael Armbrust <michael@databricks.com> 2015-01-10 14:19:32 -0800
commit: 77106df69147aba5eb1784adb84e2b574927c6de (patch)
tree: 92c29ca53a2b4ce66b2123df1729b5c40c940454 /sql/core
parent: b3e86dc62476abb03b330f86a788aa19a6565317 (diff)
download: spark-77106df69147aba5eb1784adb84e2b574927c6de.tar.gz
spark-77106df69147aba5eb1784adb84e2b574927c6de.tar.bz2
spark-77106df69147aba5eb1784adb84e2b574927c6de.zip
1 files changed, 1 insertions, 1 deletions
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala
index e53723c176..16ca4be558 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala
@@ -70,7 +70,7 @@ case class Sample(fraction: Double, withReplacement: Boolean, seed: Long, child:
   override def output = child.output
 
   // TODO: How to pick seed?
-  override def execute() = child.execute().sample(withReplacement, fraction, seed)
+  override def execute() = child.execute().map(_.copy()).sample(withReplacement, fraction, seed)
 }
 
 /**
author	Yanbo Liang <yanbohappy@gmail.com>	2015-01-10 14:16:37 -0800
committer	Michael Armbrust <michael@databricks.com>	2015-01-10 14:19:32 -0800
commit	77106df69147aba5eb1784adb84e2b574927c6de (patch)
tree	92c29ca53a2b4ce66b2123df1729b5c40c940454 /sql/core
parent	b3e86dc62476abb03b330f86a788aa19a6565317 (diff)
download	spark-77106df69147aba5eb1784adb84e2b574927c6de.tar.gz spark-77106df69147aba5eb1784adb84e2b574927c6de.tar.bz2 spark-77106df69147aba5eb1784adb84e2b574927c6de.zip