[SPARK-11410][SQL] Add APIs to provide functionality similar to Hive's DISTRIBUTE BY and SORT BY. - spark

diff options

author	Nong Li <nongli@gmail.com>	2015-11-01 14:32:21 -0800
committer	Yin Huai <yhuai@databricks.com>	2015-11-01 14:34:06 -0800
commit	046e32ed8467e0f46ffeca1a95d4d40017eb5bdb (patch)
tree	4b981c567dec32cd088d277541f63ae3cdd7b647 /core
parent	dc7e399fc01e74f2ba28ebd945785cc0f7759ccd (diff)
download	spark-046e32ed8467e0f46ffeca1a95d4d40017eb5bdb.tar.gz spark-046e32ed8467e0f46ffeca1a95d4d40017eb5bdb.tar.bz2 spark-046e32ed8467e0f46ffeca1a95d4d40017eb5bdb.zip

[SPARK-11410][SQL] Add APIs to provide functionality similar to Hive's DISTRIBUTE BY and SORT BY.

DISTRIBUTE BY allows the user to hash partition the data by specified exprs. It also allows for optioning sorting within each resulting partition. There is no required relationship between the exprs for partitioning and sorting (i.e. one does not need to be a prefix of the other). This patch adds to APIs to DataFrames which can be used together to provide this functionality: 1. distributeBy() which partitions the data frame into a specified number of partitions using the partitioning exprs. 2. localSort() which sorts each partition using the provided sorting exprs. To get the DISTRIBUTE BY functionality, the user simply does: df.distributeBy(...).localSort(...) Author: Nong Li <nongli@gmail.com> Closes #9364 from nongli/spark-11410.

Diffstat (limited to 'core')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: