Merge pull request #87 from aarondav/shuffle-base - spark

diff options

author	Reynold Xin <rxin@apache.org>	2013-10-21 22:45:00 -0700
committer	Reynold Xin <rxin@apache.org>	2013-10-21 22:45:00 -0700
commit	48952d67e6dde25faaba241b9deba737b83a5372 (patch)
tree	bc0293e1bf603b8611590a80ad8f03bcdd4cee36 /project
parent	a51359c917a9ebe379b32ebc53fd093c454ea195 (diff)
parent	053ef949ace4fa5581e86d71c5a8162ff5e376a4 (diff)
download	spark-48952d67e6dde25faaba241b9deba737b83a5372.tar.gz spark-48952d67e6dde25faaba241b9deba737b83a5372.tar.bz2 spark-48952d67e6dde25faaba241b9deba737b83a5372.zip

Merge pull request #87 from aarondav/shuffle-base

Basic shuffle file consolidation The Spark shuffle phase can produce a large number of files, as one file is created per mapper per reducer. For large or repeated jobs, this often produces millions of shuffle files, which sees extremely degredaded performance from the OS file system. This patch seeks to reduce that burden by combining multipe shuffle files into one. This PR draws upon the work of @jason-dai in https://github.com/mesos/spark/pull/669. However, it simplifies the design in order to get the majority of the gain with less overall intellectual and code burden. The vast majority of code in this pull request is a refactor to allow the insertion of a clean layer of indirection between logical block ids and physical files. This, I feel, provides some design clarity in addition to enabling shuffle file consolidation. The main goal is to produce one shuffle file per reducer per active mapper thread. This allows us to isolate the mappers (simplifying the failure modes), while still allowing us to reduce the number of mappers tremendously for large tasks. In order to accomplish this, we simply create a new set of shuffle files for every parallel task, and return the files to a pool which will be given out to the next run task. I have run some ad hoc query testing on 5 m1.xlarge EC2 nodes with 2g of executor memory and the following microbenchmark: scala> val nums = sc.parallelize(1 to 1000, 1000).flatMap(x => (1 to 1e6.toInt)) scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now } scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, 2000, x)).reduceByKey(_ + _).count) / 1000.0) For this particular workload, with 1000 mappers and 2000 reducers, I saw the old method running at around 15 minutes, with the consolidated shuffle files running at around 4 minutes. There was a very sharp increase in running time for the non-consolidated version after around 1 million total shuffle files. Below this threshold, however, there wasn't a significant difference between the two. Better performance measurement of this patch is warranted, and I plan on doing so in the near future as part of a general investigation of our shuffle file bottlenecks and performance.

Diffstat (limited to 'project')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: