Merge pull request #44 from mateiz/fast-map - spark

diff options

author	Matei Zaharia <matei@eecs.berkeley.edu>	2013-10-10 13:55:47 -0700
committer	Matei Zaharia <matei@eecs.berkeley.edu>	2013-10-10 13:55:47 -0700
commit	cd08f73483658b872701ec1f74ce84933a45c6f0 (patch)
tree	2ae8656cb3b7ee667bdd9dd727c4f91f26e3abcc /project
parent	320418f7c8b42d4ce781b32c9ee47a9b54550b89 (diff)
parent	001d13f7b9db9584492a8117077110a11eae4600 (diff)
download	spark-cd08f73483658b872701ec1f74ce84933a45c6f0.tar.gz spark-cd08f73483658b872701ec1f74ce84933a45c6f0.tar.bz2 spark-cd08f73483658b872701ec1f74ce84933a45c6f0.zip

Merge pull request #44 from mateiz/fast-map

A fast and low-memory append-only map for shuffle operations This is a continuation of the old repo's pull request https://github.com/mesos/spark/pull/823 to add a more efficient hashmap class for shuffles. I've optimized and tested this more thoroughly now so I think it's good to go. I've also addressed some of the comments that were outstanding there. The idea is to reduce the cost of shuffles by taking advantage of the properties their hashmaps need. In particular, the hashmaps there are append-only, and a common operation is updating a key's value based on the old value. The included AppendOnlyMap class uses open hashing to use less space than Java's (by not having a linked list per bucket), does not support deletes, and has a changeValue operation to update a key in place without following the hash chain twice. In micro-benchmarks against java.util.HashMap and scala.collection.mutable.HashMap, this is 20-30% smaller and 10-40% faster depending on the number and type of keys. It's also noticeably faster than fastutil's Object2ObjectOpenHashMap. I've also tested this in Spark apps now. While the speed gain is modest (partly due to other overheads, like serialization), there is some, and I think the lower memory usage is worth it. Here's one example where the speedup is most noticeable, in spark-shell on local mode: ``` scala> val nums = sc.parallelize(1 to 8).flatMap(x => (1 to 5e6.toInt)).cache scala> nums.count scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now } scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, x)).reduceByKey(_ + _).count) / 1000.0) ``` This prints the following times before and after this change: ``` Before: Vector(4.368, 2.635, 2.549, 2.522, 2.233, 2.222, 2.214, 2.195) After: Vector(3.588, 1.741, 1.706, 1.648, 1.777, 1.81, 1.776, 1.731) ``` I've also run the spark-perf suite, enhanced with some tests that use Ints (https://github.com/amplab/spark-perf/pull/9), and it shows some speedup on those, but less on the string ones (presumably due to existing overhead): https://gist.github.com/mateiz/6897121.

Diffstat (limited to 'project')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: