aboutsummaryrefslogtreecommitdiff
path: root/dev/merge_spark_pr.py
diff options
context:
space:
mode:
authorzsxwing <zsxwing@gmail.com>2014-11-29 20:23:08 -0500
committerPatrick Wendell <pwendell@gmail.com>2014-11-29 20:23:08 -0500
commitc06222427f866fe216d819bbf4eba7b1c834835c (patch)
tree54b3475d50fa1a977558fd73eb48a32d1f4025be /dev/merge_spark_pr.py
parent938dc141ee4448c20441fa9dfa3a9897a11ed4b6 (diff)
downloadspark-c06222427f866fe216d819bbf4eba7b1c834835c.tar.gz
spark-c06222427f866fe216d819bbf4eba7b1c834835c.tar.bz2
spark-c06222427f866fe216d819bbf4eba7b1c834835c.zip
[SPARK-4505][Core] Add a ClassTag parameter to CompactBuffer[T]
Added a ClassTag parameter to CompactBuffer. So CompactBuffer[T] can create primitive arrays for primitive types. It will reduce the memory usage for primitive types significantly and only pay minor performance lost. Here is my test code: ```Scala // Call org.apache.spark.util.SizeEstimator.estimate def estimateSize(obj: AnyRef): Long = { val c = Class.forName("org.apache.spark.util.SizeEstimator$") val f = c.getField("MODULE$") val o = f.get(c) val m = c.getMethod("estimate", classOf[Object]) m.setAccessible(true) m.invoke(o, obj).asInstanceOf[Long] } sc.parallelize(1 to 10000).groupBy(_ => 1).foreach { case (k, v) => println(v.getClass() + " size: " + estimateSize(v)) } ``` Using the previous CompactBuffer outputed ``` class org.apache.spark.util.collection.CompactBuffer size: 313358 ``` Using the new CompactBuffer outputed ``` class org.apache.spark.util.collection.CompactBuffer size: 65712 ``` In this case, the new `CompactBuffer` only used 20% memory of the previous one. It's really helpful for `groupByKey` when using a primitive value. Author: zsxwing <zsxwing@gmail.com> Closes #3378 from zsxwing/SPARK-4505 and squashes the following commits: 4abdbba [zsxwing] Add a ClassTag parameter to reduce the memory usage of CompactBuffer[T] when T is a primitive type
Diffstat (limited to 'dev/merge_spark_pr.py')
0 files changed, 0 insertions, 0 deletions