diff options
author | Nong Li <nongli@gmail.com> | 2015-11-01 14:32:21 -0800 |
---|---|---|
committer | Yin Huai <yhuai@databricks.com> | 2015-11-01 14:34:06 -0800 |
commit | 046e32ed8467e0f46ffeca1a95d4d40017eb5bdb (patch) | |
tree | 4b981c567dec32cd088d277541f63ae3cdd7b647 /core | |
parent | dc7e399fc01e74f2ba28ebd945785cc0f7759ccd (diff) | |
download | spark-046e32ed8467e0f46ffeca1a95d4d40017eb5bdb.tar.gz spark-046e32ed8467e0f46ffeca1a95d4d40017eb5bdb.tar.bz2 spark-046e32ed8467e0f46ffeca1a95d4d40017eb5bdb.zip |
[SPARK-11410][SQL] Add APIs to provide functionality similar to Hive's DISTRIBUTE BY and SORT BY.
DISTRIBUTE BY allows the user to hash partition the data by specified exprs. It also allows for
optioning sorting within each resulting partition. There is no required relationship between the
exprs for partitioning and sorting (i.e. one does not need to be a prefix of the other).
This patch adds to APIs to DataFrames which can be used together to provide this functionality:
1. distributeBy() which partitions the data frame into a specified number of partitions using the
partitioning exprs.
2. localSort() which sorts each partition using the provided sorting exprs.
To get the DISTRIBUTE BY functionality, the user simply does: df.distributeBy(...).localSort(...)
Author: Nong Li <nongli@gmail.com>
Closes #9364 from nongli/spark-11410.
Diffstat (limited to 'core')
0 files changed, 0 insertions, 0 deletions