aboutsummaryrefslogtreecommitdiff
path: root/mllib/src/main/scala/org
diff options
context:
space:
mode:
authorDavies Liu <davies@databricks.com>2015-11-04 14:45:02 -0800
committerJosh Rosen <joshrosen@databricks.com>2015-11-04 14:45:02 -0800
commit1b6a5d4af9691c3f7f3ebee3146dc13d12a0e047 (patch)
tree387ed802d0acb7508cbf1d6ffaf59e40b2a41cff /mllib/src/main/scala/org
parent701fb5052080fa8c0a79ad7c1e65693ccf444787 (diff)
downloadspark-1b6a5d4af9691c3f7f3ebee3146dc13d12a0e047.tar.gz
spark-1b6a5d4af9691c3f7f3ebee3146dc13d12a0e047.tar.bz2
spark-1b6a5d4af9691c3f7f3ebee3146dc13d12a0e047.zip
[SPARK-11493] remove bitset from BytesToBytesMap
Since we have 4 bytes as number of records in the beginning of a page, the address can not be zero, so we do not need the bitset. For performance concerns, the bitset could help speed up false lookup if the slot is empty (because bitset is smaller than longArray, cache hit rate will be higher). In practice, the map is filled with 35% - 70% (use 50% as average), so only half of the false lookups can benefit of it, all others will pay the cost of load the bitset (still need to access the longArray anyway). For aggregation, we always need to access the longArray (insert a new key after false lookup), also confirmed by a benchmark. For broadcast hash join, there could be a regression, but a simple benchmark showed that it may not (most of lookup are false): ``` sqlContext.range(1<<20).write.parquet("small") df = sqlContext.read.parquet('small') for i in range(3): t = time.time() df2 = sqlContext.range(1<<26).selectExpr("id * 1111111111 % 987654321 as id2") df2.join(df, df.id == df2.id2).count() print time.time() -t ``` Having bitset (used time in seconds): ``` 17.5404241085 10.2758829594 10.5786800385 ``` After removing bitset (used time in seconds): ``` 21.8939979076 12.4132959843 9.97224712372 ``` cc rxin nongli Author: Davies Liu <davies@databricks.com> Closes #9452 from davies/remove_bitset.
Diffstat (limited to 'mllib/src/main/scala/org')
0 files changed, 0 insertions, 0 deletions