path: root/
blob: 48864127de992e8d6cae6a0666021a53bd7285b0 (plain) (tree)















#stringmetric [![Build Status](](
A small library of string metrics and phonetic algorithms. Each has a command line interface, is thoroughly unit tested, and performant (verified via microbenchmark suites). 

* __Phonetic metrics__ determine if two arguments sound the same phonetically. 
* __Phonetic algorithms__ determine the phonetic representation of the argument passed. All phonetic metrics have a standalone algorithm counterpart. 
* __Similarity metrics__ determine the distance or coefficient between two arguments.
* __Filters__, which can optionally be applied to metrics and algorithms, clean up arguments prior to evaluation. Filters can be combined via trait stacking.

## Metrics and Algorithms
* __[Dice / Sorensen]( (Similarity metric)
* __[Hamming]( (Similarity metric)
* __[Jaro]( (Similarity metric)
* __[Jaro-Winkler]( (Similarity metric)
* __[Levenshtein]( (Similarity metric)
* __[Metaphone]( (Phonetic metric and algorithm)
* __[N-Gram]( (Similarity metric and algorithm)
* __[NYSIIS]( (Phonetic metric and algorithm)
* __[Refined NYSIIS]( (Phonetic metric and algorithm)
* __[Refined Soundex]( (Phonetic metric and algorithm)
* __[Soundex]( (Phonetic metric and algorithm)
* __Weighted Levenshtein__ (Similarity metric)

## Using the API
Basic example with no filtering.
import org.hashtree.stringmetric.similarity.JaroWinklerMetric  
val distance ="string1", "string2")

if (distance >= 0.9) println("It's likely you're a match!")

Basic example with single filter.
import org.hashtree.stringmetric.filter.{ AsciiLetterCaseStringFilter, StringFilterDelegate }
import org.hashtree.stringmetric.similarity.JaroWinklerMetric

val distance ="string1", "string2")
    (new StringFilterDelegate with AsciiLetterCaseStringFilter)

if (distance >= 0.9) println("It's likely you're a match!")

Basic example with stacked filter. Filters are applied in reverse order.
import org.hashtree.stringmetric.filter.{ AsciiLetterCaseStringFilter, AsciiLetterOnlyStringFilter, StringFilterDelegate }
import org.hashtree.stringmetric.similarity.JaroWinklerMetric

val distance ="string1", "string2")
    (new StringFilterDelegate with AsciiLetterCaseStringFilter with AsciiLetterOnlyStringFilter)

if (distance >= 0.9) println("It's likely you're a match!")

You can also use the StringMetric, StringAlgorithm, and StringFilter convenience objects.
import org.hashtree.stringmetric.{ StringAlgorithm, StringFilter, StringMetric}
if (StringMetric.compareWithJaroWinkler("string1", "string2") >= 0.9) 
    println("It's likely you're a match!")
if (StringMetric.compareWithJaroWinkler("string1", "string2")(StringFilter.asciiLetterCase) >= 0.9) 
    println("It's likely you're a match!")

## Using the CLI
Uncompress the built tar and ensure you have ability to execute the commands. Execute the metric of choice via the command line:

The help option prints command syntax and usage.
jaroWinklerMetric --help
metaphoneMetric --help
metaphoneAlgorithm --help

Compare "abc" to "xyz" using the Jaro-Winkler metric.
jaroWinklerMetric abc xyz

Compare "abc "to "xyz" using the Metaphone metric.
metaphoneMetric abc xyz

Get the phonetic representation of "abc" using the Metaphone phonetic algorithm.
metaphoneAlgorithm abc

## Depending on the API (via the [Maven Central Repository](
* __groupId__: org.hashtree.stringmetric
* __artifactId__: stringmetric-core

## Depending on the CLI (via the [Maven Central Repository](
* __groupId__: org.hashtree.stringmetric
* __artifactId__: stringmetric-cli

## Building the API (via Gradle)
gradle :stringmetric-core:jar

## Building the CLI (via Gradle)
gradle :stringmetric-cli:tar

## Requirements
* Scala 2.9.x
* Gradle 1.x

## Todo
* SmithWaterman
* MongeElkan
* NeedlemanWunch
* Jaccard
* Double Metaphone
* Memoization decorator

## Versioning
[Semantic Versioning v2.0](

## License
[Apache License v2.0](