aboutsummaryrefslogtreecommitdiff
path: root/project
diff options
context:
space:
mode:
authorMatei Zaharia <matei@eecs.berkeley.edu>2013-10-10 13:55:47 -0700
committerMatei Zaharia <matei@eecs.berkeley.edu>2013-10-10 13:55:47 -0700
commitcd08f73483658b872701ec1f74ce84933a45c6f0 (patch)
tree2ae8656cb3b7ee667bdd9dd727c4f91f26e3abcc /project
parent320418f7c8b42d4ce781b32c9ee47a9b54550b89 (diff)
parent001d13f7b9db9584492a8117077110a11eae4600 (diff)
downloadspark-cd08f73483658b872701ec1f74ce84933a45c6f0.tar.gz
spark-cd08f73483658b872701ec1f74ce84933a45c6f0.tar.bz2
spark-cd08f73483658b872701ec1f74ce84933a45c6f0.zip
Merge pull request #44 from mateiz/fast-map
A fast and low-memory append-only map for shuffle operations This is a continuation of the old repo's pull request https://github.com/mesos/spark/pull/823 to add a more efficient hashmap class for shuffles. I've optimized and tested this more thoroughly now so I think it's good to go. I've also addressed some of the comments that were outstanding there. The idea is to reduce the cost of shuffles by taking advantage of the properties their hashmaps need. In particular, the hashmaps there are append-only, and a common operation is updating a key's value based on the old value. The included AppendOnlyMap class uses open hashing to use less space than Java's (by not having a linked list per bucket), does not support deletes, and has a changeValue operation to update a key in place without following the hash chain twice. In micro-benchmarks against java.util.HashMap and scala.collection.mutable.HashMap, this is 20-30% smaller and 10-40% faster depending on the number and type of keys. It's also noticeably faster than fastutil's Object2ObjectOpenHashMap. I've also tested this in Spark apps now. While the speed gain is modest (partly due to other overheads, like serialization), there is some, and I think the lower memory usage is worth it. Here's one example where the speedup is most noticeable, in spark-shell on local mode: ``` scala> val nums = sc.parallelize(1 to 8).flatMap(x => (1 to 5e6.toInt)).cache scala> nums.count scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now } scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, x)).reduceByKey(_ + _).count) / 1000.0) ``` This prints the following times before and after this change: ``` Before: Vector(4.368, 2.635, 2.549, 2.522, 2.233, 2.222, 2.214, 2.195) After: Vector(3.588, 1.741, 1.706, 1.648, 1.777, 1.81, 1.776, 1.731) ``` I've also run the spark-perf suite, enhanced with some tests that use Ints (https://github.com/amplab/spark-perf/pull/9), and it shows some speedup on those, but less on the string ones (presumably due to existing overhead): https://gist.github.com/mateiz/6897121.
Diffstat (limited to 'project')
0 files changed, 0 insertions, 0 deletions