[SPARK-11347] [SQL] Support for joinWith in Datasets - spark

diff options

author	Michael Armbrust <michael@databricks.com>	2015-10-27 13:28:52 -0700
committer	Yin Huai <yhuai@databricks.com>	2015-10-27 13:28:52 -0700
commit	5a5f65905a202e59bc85170b01c57a883718ddf6 (patch)
tree	dcd1f9958573a0e3b419805609495fc1380b1565 /python/pyspark/ml/classification.py
parent	3bdbbc6c972567861044dd6a6dc82f35cd12442d (diff)
download	spark-5a5f65905a202e59bc85170b01c57a883718ddf6.tar.gz spark-5a5f65905a202e59bc85170b01c57a883718ddf6.tar.bz2 spark-5a5f65905a202e59bc85170b01c57a883718ddf6.zip

[SPARK-11347] [SQL] Support for joinWith in Datasets

This PR adds a new operation `joinWith` to a `Dataset`, which returns a `Tuple` for each pair where a given `condition` evaluates to true. ```scala case class ClassData(a: String, b: Int) val ds1 = Seq(ClassData("a", 1), ClassData("b", 2)).toDS() val ds2 = Seq(("a", 1), ("b", 2)).toDS() > ds1.joinWith(ds2, $"_1" === $"a").collect() res0: Array((ClassData("a", 1), ("a", 1)), (ClassData("b", 2), ("b", 2))) ``` This operation is similar to the relation `join` function with one important difference in the result schema. Since `joinWith` preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names `_1` and `_2`. This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common. ## Required Changes to Encoders In the process of working on this patch, several deficiencies to the way that we were handling encoders were discovered. Specifically, it turned out to be very difficult to `rebind` the non-expression based encoders to extract the nested objects from the results of joins (and also typed selects that return tuples). As a result the following changes were made. - `ClassEncoder` has been renamed to `ExpressionEncoder` and has been improved to also handle primitive types. Additionally, it is now possible to take arbitrary expression encoders and rewrite them into a single encoder that returns a tuple. - All internal operations on `Dataset`s now require an `ExpressionEncoder`. If the users tries to pass a non-`ExpressionEncoder` in, an error will be thrown. We can relax this requirement in the future by constructing a wrapper class that uses expressions to project the row to the expected schema, shielding the users code from the required remapping. This will give us a nice balance where we don't force user encoders to understand attribute references and binding, but still allow our native encoder to leverage runtime code generation to construct specific encoders for a given schema that avoid an extra remapping step. - Additionally, the semantics for different types of objects are now better defined. As stated in the `ExpressionEncoder` scaladoc: - Classes will have their sub fields extracted by name using `UnresolvedAttribute` expressions and `UnresolvedExtractValue` expressions. - Tuples will have their subfields extracted by position using `BoundReference` expressions. - Primitives will have their values extracted from the first ordinal with a schema that defaults to the name `value`. - Finally, the binding lifecycle for `Encoders` has now been unified across the codebase. Encoders are now `resolved` to the appropriate schema in the constructor of `Dataset`. This process replaces an unresolved expressions with concrete `AttributeReference` expressions. Binding then happens on demand, when an encoder is going to be used to construct an object. This closely mirrors the lifecycle for standard expressions when executing normal SQL or `DataFrame` queries. Author: Michael Armbrust <michael@databricks.com> Closes #9300 from marmbrus/datasets-tuples.

Diffstat (limited to 'python/pyspark/ml/classification.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: