aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/sql/functions.py
diff options
context:
space:
mode:
authorMingfei <mingfei.shi@intel.com>2015-05-20 22:33:03 -0700
committerPatrick Wendell <patrick@databricks.com>2015-05-20 22:33:03 -0700
commit04940c49755fd2e7f1ed7b875da287c946bfebeb (patch)
tree5eabc4b6d9509670bb3a06c57f19fe3934e4b6d2 /python/pyspark/sql/functions.py
parentd0eb9ffe978c663b7aa06e908cadee81767d23d1 (diff)
downloadspark-04940c49755fd2e7f1ed7b875da287c946bfebeb.tar.gz
spark-04940c49755fd2e7f1ed7b875da287c946bfebeb.tar.bz2
spark-04940c49755fd2e7f1ed7b875da287c946bfebeb.zip
[SPARK-7389] [CORE] Tachyon integration improvement
Two main changes: Add two functions in ExternalBlockManager, which are putValues and getValues because the implementation may not rely on the putBytes and getBytes improve Tachyon integration. Currently, when putting data into Tachyon, Spark first serialize all data in one partition into a ByteBuffer, and then write into Tachyon, this will uses much memory and increase GC overhead when get data from Tachyon, getValues depends on getBytes, which also read all data into On heap byte arry, and result in much memory usage. This PR changes the approach of the two functions, make them read / write data by stream to reduce memory usage. In our testing, when data size is huge, this patch reduces about 30% GC time and 70% full GC time, and total execution time reduces about 10% Author: Mingfei <mingfei.shi@intel.com> Closes #5908 from shimingfei/Tachyon-integration-rebase and squashes the following commits: 033bc57 [Mingfei] modify accroding to comments 747c69a [Mingfei] modify according to comments - format changes ce52c67 [Mingfei] put close() in a finally block d2c60bb [Mingfei] modify according to comments, some code style change 4c11591 [Mingfei] modify according to comments split putIntoExternalBlockStore into two functions add default implementation for getValues and putValues cc0a32e [Mingfei] Make getValues read data from Tachyon by stream Make putValues write data to Tachyon by stream 017593d [Mingfei] add getValues and putValues in ExternalBlockManager's Interface
Diffstat (limited to 'python/pyspark/sql/functions.py')
0 files changed, 0 insertions, 0 deletions