[SPARK-11493] remove bitset from BytesToBytesMap - spark

diff options

author	Davies Liu <davies@databricks.com>	2015-11-04 14:45:02 -0800
committer	Josh Rosen <joshrosen@databricks.com>	2015-11-04 14:45:02 -0800
commit	1b6a5d4af9691c3f7f3ebee3146dc13d12a0e047 (patch)
tree	387ed802d0acb7508cbf1d6ffaf59e40b2a41cff /mllib/src/main/scala/org
parent	701fb5052080fa8c0a79ad7c1e65693ccf444787 (diff)
download	spark-1b6a5d4af9691c3f7f3ebee3146dc13d12a0e047.tar.gz spark-1b6a5d4af9691c3f7f3ebee3146dc13d12a0e047.tar.bz2 spark-1b6a5d4af9691c3f7f3ebee3146dc13d12a0e047.zip

[SPARK-11493] remove bitset from BytesToBytesMap

Since we have 4 bytes as number of records in the beginning of a page, the address can not be zero, so we do not need the bitset. For performance concerns, the bitset could help speed up false lookup if the slot is empty (because bitset is smaller than longArray, cache hit rate will be higher). In practice, the map is filled with 35% - 70% (use 50% as average), so only half of the false lookups can benefit of it, all others will pay the cost of load the bitset (still need to access the longArray anyway). For aggregation, we always need to access the longArray (insert a new key after false lookup), also confirmed by a benchmark. For broadcast hash join, there could be a regression, but a simple benchmark showed that it may not (most of lookup are false): ``` sqlContext.range(1<<20).write.parquet("small") df = sqlContext.read.parquet('small') for i in range(3): t = time.time() df2 = sqlContext.range(1<<26).selectExpr("id * 1111111111 % 987654321 as id2") df2.join(df, df.id == df2.id2).count() print time.time() -t ``` Having bitset (used time in seconds): ``` 17.5404241085 10.2758829594 10.5786800385 ``` After removing bitset (used time in seconds): ``` 21.8939979076 12.4132959843 9.97224712372 ``` cc rxin nongli Author: Davies Liu <davies@databricks.com> Closes #9452 from davies/remove_bitset.

Diffstat (limited to 'mllib/src/main/scala/org')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: