aboutsummaryrefslogtreecommitdiff
path: root/examples/src/main
diff options
context:
space:
mode:
authorMichael Armbrust <michael@databricks.com>2015-11-12 17:20:30 -0800
committerMichael Armbrust <michael@databricks.com>2015-11-12 17:20:30 -0800
commit41bbd2300472501d69ed46f0407d5ed7cbede4a8 (patch)
treed158a4f04eda93959aeb8ebb1b69b186fe1a8ee7 /examples/src/main
parentdcb896fd8cec83483f700ee985c352be61cdf233 (diff)
downloadspark-41bbd2300472501d69ed46f0407d5ed7cbede4a8.tar.gz
spark-41bbd2300472501d69ed46f0407d5ed7cbede4a8.tar.bz2
spark-41bbd2300472501d69ed46f0407d5ed7cbede4a8.zip
[SPARK-11654][SQL] add reduce to GroupedDataset
This PR adds a new method, `reduce`, to `GroupedDataset`, which allows similar operations to `reduceByKey` on a traditional `PairRDD`. ```scala val ds = Seq("abc", "xyz", "hello").toDS() ds.groupBy(_.length).reduce(_ + _).collect() // not actually commutative :P res0: Array(3 -> "abcxyz", 5 -> "hello") ``` While implementing this method and its test cases several more deficiencies were found in our encoder handling. Specifically, in order to support positional resolution, named resolution and tuple composition, it is important to keep the unresolved encoder around and to use it when constructing new `Datasets` with the same object type but different output attributes. We now divide the encoder lifecycle into three phases (that mirror the lifecycle of standard expressions) and have checks at various boundaries: - Unresoved Encoders: all users facing encoders (those constructed by implicits, static methods, or tuple composition) are unresolved, meaning they have only `UnresolvedAttributes` for named fields and `BoundReferences` for fields accessed by ordinal. - Resolved Encoders: internal to a `[Grouped]Dataset` the encoder is resolved, meaning all input has been resolved to a specific `AttributeReference`. Any encoders that are placed into a logical plan for use in object construction should be resolved. - BoundEncoder: Are constructed by physical plans, right before actual conversion from row -> object is performed. It is left to future work to add explicit checks for resolution and provide good error messages when it fails. We might also consider enforcing the above constraints in the type system (i.e. `fromRow` only exists on a `ResolvedEncoder`), but we should probably wait before spending too much time on this. Author: Michael Armbrust <michael@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #9673 from marmbrus/pr/9628.
Diffstat (limited to 'examples/src/main')
0 files changed, 0 insertions, 0 deletions