diff options
author | Reynold Xin <rxin@apache.org> | 2013-10-21 22:45:00 -0700 |
---|---|---|
committer | Reynold Xin <rxin@apache.org> | 2013-10-21 22:45:00 -0700 |
commit | 48952d67e6dde25faaba241b9deba737b83a5372 (patch) | |
tree | bc0293e1bf603b8611590a80ad8f03bcdd4cee36 /examples/pom.xml | |
parent | a51359c917a9ebe379b32ebc53fd093c454ea195 (diff) | |
parent | 053ef949ace4fa5581e86d71c5a8162ff5e376a4 (diff) | |
download | spark-48952d67e6dde25faaba241b9deba737b83a5372.tar.gz spark-48952d67e6dde25faaba241b9deba737b83a5372.tar.bz2 spark-48952d67e6dde25faaba241b9deba737b83a5372.zip |
Merge pull request #87 from aarondav/shuffle-base
Basic shuffle file consolidation
The Spark shuffle phase can produce a large number of files, as one file is created
per mapper per reducer. For large or repeated jobs, this often produces millions of
shuffle files, which sees extremely degredaded performance from the OS file system.
This patch seeks to reduce that burden by combining multipe shuffle files into one.
This PR draws upon the work of @jason-dai in https://github.com/mesos/spark/pull/669.
However, it simplifies the design in order to get the majority of the gain with less
overall intellectual and code burden. The vast majority of code in this pull request
is a refactor to allow the insertion of a clean layer of indirection between logical
block ids and physical files. This, I feel, provides some design clarity in addition
to enabling shuffle file consolidation.
The main goal is to produce one shuffle file per reducer per active mapper thread.
This allows us to isolate the mappers (simplifying the failure modes), while still
allowing us to reduce the number of mappers tremendously for large tasks. In order
to accomplish this, we simply create a new set of shuffle files for every parallel
task, and return the files to a pool which will be given out to the next run task.
I have run some ad hoc query testing on 5 m1.xlarge EC2 nodes with 2g of executor memory and the following microbenchmark:
scala> val nums = sc.parallelize(1 to 1000, 1000).flatMap(x => (1 to 1e6.toInt))
scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now }
scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, 2000, x)).reduceByKey(_ + _).count) / 1000.0)
For this particular workload, with 1000 mappers and 2000 reducers, I saw the old method running at around 15 minutes, with the consolidated shuffle files running at around 4 minutes. There was a very sharp increase in running time for the non-consolidated version after around 1 million total shuffle files. Below this threshold, however, there wasn't a significant difference between the two.
Better performance measurement of this patch is warranted, and I plan on doing so in the near future as part of a general investigation of our shuffle file bottlenecks and performance.
Diffstat (limited to 'examples/pom.xml')
0 files changed, 0 insertions, 0 deletions