summaryrefslogtreecommitdiff
path: root/readme.md
blob: 2be22056e92914d901d06f457c201e6cfe4ccbf0 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
#stringmetric [![Build Status](https://secure.travis-ci.org/rockymadden/stringmetric.png)](http://travis-ci.org/rockymadden/stringmetric)
A small library of string metrics and phonetic algorithms. Each has a command line interface and [Google Caliper](http://code.google.com/p/caliper/) microbenchmark suite.

* __Phonetic metrics__ determine if two arguments sound the same phonetically. 
* __Phonetic algorithms__ determine the phonetic representation of the argument passed. All phonetic metrics have a standalone algorithm counterpart. 
* __Similarity metrics__ determine the distance or coefficient between two arguments.
* __Filters__, which can optionally be applied to metrics and algorithms, clean up arguments prior to evaluation. Filters can be combined via trait stacking.

## Metrics and Algorithms
* __[Dice / Sorensen](http://en.wikipedia.org/wiki/Dice%27s_coefficient)__ (Similarity metric)
* __[Hamming](http://en.wikipedia.org/wiki/Hamming_distance)__ (Similarity metric)
* __[Jaro](http://en.wikipedia.org/wiki/Jaro-Winkler_distance)__ (Similarity metric)
* __[Jaro-Winkler](http://en.wikipedia.org/wiki/Jaro-Winkler_distance)__ (Similarity metric)
* __[Levenshtein](http://en.wikipedia.org/wiki/Levenshtein_distance)__ (Similarity metric)
* __[Metaphone](http://en.wikipedia.org/wiki/Metaphone)__ (Phonetic metric and algorithm)
* __[N-Gram](http://en.wikipedia.org/wiki/N-gram)__ (Similarity metric and algorithm)
* __[NYSIIS](http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System)__ (Phonetic metric and algorithm)
* __[Refined NYSIIS](http://www.markcrocker.com/rexxtipsntricks/rxtt28.2.0482.html)__ (Phonetic metric and algorithm)
* __[Refined Soundex](http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html)__ (Phonetic metric and algorithm)
* __[Soundex](http://en.wikipedia.org/wiki/Soundex)__ (Phonetic metric and algorithm)
* __Weighted Levenshtein__ (Similarity metric)

## Building the API
```shell
gradle :stringmetric-core:jar
```

## Building the CLI
```shell
gradle :stringmetric-cli:tar
```

## Using the API
Basic example with no filtering.
```scala
import org.hashtree.stringmetric.similarity.JaroWinklerMetric  
  
val distance = JaroWinklerMetric.compare("string1", "string2")

if (distance >= 0.9) println("It's likely you're a match!")
```

Basic example with single filter.
```scala
import org.hashtree.stringmetric.filter.{ AsciiLetterCaseStringFilter, StringFilterDelegate }
import org.hashtree.stringmetric.similarity.JaroWinklerMetric

val distance = JaroWinklerMetric.compare("string1", "string2")
    (new StringFilterDelegate with AsciiLetterCaseStringFilter)

if (distance >= 0.9) println("It's likely you're a match!")
```

Basic example with stacked filter. Filters are applied in reverse order.
```scala
import org.hashtree.stringmetric.filter.{ AsciiLetterCaseStringFilter, AsciiLetterOnlyStringFilter, StringFilterDelegate }
import org.hashtree.stringmetric.similarity.JaroWinklerMetric

val distance = JaroWinklerMetric.compare("string1", "string2")
    (new StringFilterDelegate with AsciiLetterCaseStringFilter with AsciiLetterOnlyStringFilter)

if (distance >= 0.9) println("It's likely you're a match!")
```

You can also use the StringMetric, StringAlgorithm, and StringFilter convenience objects.
```scala
import org.hashtree.stringmetric.{ StringAlgorithm, StringFilter, StringMetric}
  
if (StringMetric.compareWithJaroWinkler("string1", "string2") >= 0.9) 
    println("It's likely you're a match!")
 
if (StringMetric.compareWithJaroWinkler("string1", "string2")(StringFilter.asciiLetterCase) >= 0.9) 
    println("It's likely you're a match!")
```

## Using the CLI
Uncompress the built tar and ensure you have ability to execute the commands. Execute the metric of choice via the command line:

The help option prints command syntax and usage.
```shell
jaroWinklerMetric --help
metaphoneMetric --help
metaphoneAlgorithm --help
```

Compare "abc" to "xyz" using the Jaro-Winkler metric.
```shell
jaroWinklerMetric abc xyz
```

Compare "abc "to "xyz" using the Metaphone metric.
```shell
metaphoneMetric abc xyz
```

Get the phonetic representation of "abc" using the Metaphone phonetic algorithm.
```shell 
metaphoneAlgorithm abc
```

## Requirements
* Scala 2.9.2
* Gradle 1.0 or above

## Todo
* SmithWaterman
* MongeElkan
* NeedlemanWunch
* Jaccard
* Double Metaphone
* Memoization decorator

## Versioning
[Semantic Versioning 2.0.0](http://semver.org/)

## License
[Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)