diff options
author | Tarek Auel <tarek.auel@googlemail.com> | 2015-07-04 01:10:52 -0700 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2015-07-04 01:10:52 -0700 |
commit | 6b3574e68704d58ba41efe0ea4fe928cc166afcd (patch) | |
tree | c8dc9f32d4081d94063df0d7cf6665d99e797641 /python/pyspark | |
parent | f35b0c3436898f22860d2c6c1d12f3a661005201 (diff) | |
download | spark-6b3574e68704d58ba41efe0ea4fe928cc166afcd.tar.gz spark-6b3574e68704d58ba41efe0ea4fe928cc166afcd.tar.bz2 spark-6b3574e68704d58ba41efe0ea4fe928cc166afcd.zip |
[SPARK-8270][SQL] levenshtein distance
Jira: https://issues.apache.org/jira/browse/SPARK-8270
Info: I can not build the latest master, it stucks during the build process: `[INFO] Dependency-reduced POM written at: /Users/tarek/test/spark/bagel/dependency-reduced-pom.xml`
Author: Tarek Auel <tarek.auel@googlemail.com>
Closes #7214 from tarekauel/SPARK-8270 and squashes the following commits:
ab348b9 [Tarek Auel] Merge branch 'master' into SPARK-8270
a2ad318 [Tarek Auel] [SPARK-8270] changed order of fields
d91b12c [Tarek Auel] [SPARK-8270] python fix
adbd075 [Tarek Auel] [SPARK-8270] fixed typo
23185c9 [Tarek Auel] [SPARK-8270] levenshtein distance
Diffstat (limited to 'python/pyspark')
-rw-r--r-- | python/pyspark/sql/functions.py | 14 |
1 files changed, 14 insertions, 0 deletions
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 69e563ef36..49dd0332af 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -325,6 +325,20 @@ def explode(col): @ignore_unicode_prefix @since(1.5) +def levenshtein(left, right): + """Computes the Levenshtein distance of the two given strings. + + >>> df0 = sqlContext.createDataFrame([('kitten', 'sitting',)], ['l', 'r']) + >>> df0.select(levenshtein('l', 'r').alias('d')).collect() + [Row(d=3)] + """ + sc = SparkContext._active_spark_context + jc = sc._jvm.functions.levenshtein(_to_java_column(left), _to_java_column(right)) + return Column(jc) + + +@ignore_unicode_prefix +@since(1.5) def md5(col): """Calculates the MD5 digest and returns the value as a 32 character hex string. |