[SPARK-17931] Eliminate unnecessary task (de) serialization - spark

diff options

author	Kay Ousterhout <kayousterhout@gmail.com>	2017-01-06 10:48:00 -0600
committer	Imran Rashid <irashid@cloudera.com>	2017-01-06 10:48:08 -0600
commit	2e139eed3194c7b8814ff6cf007d4e8a874c1e4d (patch)
tree	e175b38ba8df154564e74eb32fe44ee1ae783ea5 /sbin/start-history-server.sh
parent	4a4c3dc9ca10e52f7981b225ec44e97247986905 (diff)
download	spark-2e139eed3194c7b8814ff6cf007d4e8a874c1e4d.tar.gz spark-2e139eed3194c7b8814ff6cf007d4e8a874c1e4d.tar.bz2 spark-2e139eed3194c7b8814ff6cf007d4e8a874c1e4d.zip

[SPARK-17931] Eliminate unnecessary task (de) serialization

In the existing code, there are three layers of serialization involved in sending a task from the scheduler to an executor: - A Task object is serialized - The Task object is copied to a byte buffer that also contains serialized information about any additional JARs, files, and Properties needed for the task to execute. This byte buffer is stored as the member variable serializedTask in the TaskDescription class. - The TaskDescription is serialized (in addition to the serialized task + JARs, the TaskDescription class contains the task ID and other metadata) and sent in a LaunchTask message. While it *is* necessary to have two layers of serialization, so that the JAR, file, and Property info can be deserialized prior to deserializing the Task object, the third layer of deserialization is unnecessary. This commit eliminates a layer of serialization by moving the JARs, files, and Properties into the TaskDescription class. This commit also serializes the Properties manually (by traversing the map), as is done with the JARs and files, which reduces the final serialized size. Unit tests This is a simpler alternative to the approach proposed in #15505. shivaram and I did some benchmarking of this and #15505 on a 20-machine m2.4xlarge EC2 machines (160 cores). We ran ~30 trials of code [1] (a very simple job with 10K tasks per stage) and measured the average time per stage: Before this change: 2490ms With this change: 2345 ms (so ~6% improvement over the baseline) With witgo's approach in #15505: 2046 ms (~18% improvement over baseline) The reason that #15505 has a more significant improvement is that it also moves the serialization from the TaskSchedulerImpl thread to the CoarseGrainedSchedulerBackend thread. I added that functionality on top of this change, and got almost the same improvement [1] as #15505 (average of 2103ms). I think we should decouple these two changes, both so we have some record of the improvement form each individual improvement, and because this change is more about simplifying the code base (the improvement is negligible) while the other is about performance improvement. The plan, currently, is to merge this PR and then merge the remaining part of #15505 that moves serialization. [1] The reason the improvement wasn't quite as good as with #15505 when we ran the benchmarks is almost certainly because, at the point when we ran the benchmarks, I hadn't updated the code to manually serialize the Properties (instead the code was using Java's default serialization for the Properties object, whereas #15505 manually serialized the Properties). This PR has since been updated to manually serialize the Properties, just like the other maps. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #16053 from kayousterhout/SPARK-17931.

Diffstat (limited to 'sbin/start-history-server.sh')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: