summaryrefslogtreecommitdiff
path: root/readme.md
blob: 85fc91dc27914306841e91497ddcc695e7809b21 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
#stringmetric
A collection of string metrics and phonetic algorithms implemented in Scala. All phonetic string metrics have a standalone algorithm counterpart. They provide a means to determine the phonetic representation of the argument passed, rather than evaluating if two arguments sound the same phonetically. __Each metric and algorithm has a CLI.__

## Metrics and Phonetic Algorithms
* __[Dice / Sorensen](http://en.wikipedia.org/wiki/Dice%27s_coefficient)__
	* API: org.hashtree.stringmetric.similarity.DiceSorensenMetric
	* CLI: diceSorensenMetric
* __[Hamming](http://en.wikipedia.org/wiki/Hamming_distance)__
	* API: org.hashtree.stringmetric.similarity.HammingMetric
	* CLI: hammingMetric
* __[Jaro](http://en.wikipedia.org/wiki/Jaro-Winkler_distance)__
	* API: org.hashtree.stringmetric.similarity.JaroMetric
	* CLI: jaroMetric
* __[Jaro-Winkler](http://en.wikipedia.org/wiki/Jaro-Winkler_distance)__
	* API: org.hashtree.stringmetric.similarity.JaroWinklerMetric
	* CLI: jaroWinklerMetric
* __[Levenshtein](http://en.wikipedia.org/wiki/Levenshtein_distance)__
	* API: org.hashtree.stringmetric.similarity.LevenshteinMetric
	* CLI: levenshteinMetric
* __[Metaphone](http://en.wikipedia.org/wiki/Metaphone)__
	* API: org.hashtree.stringmetric.phonetic.MetaphoneMetric and org.hashtree.stringmetric.phonetic.MetaphoneAlgorithm
	* CLI: metaphoneMetric and metaphoneAlgorithm
* __[NYSIIS](http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System)__
	* API: org.hashtree.stringmetric.phonetic.NysiisMetric and org.hashtree.stringmetric.phonetic.NysiisAlgorithm
	* CLI: nysiisMetric and nysiisAlgorithm
* __[Refined Soundex](http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html)__
	* API: org.hashtree.stringmetric.phonetic.RefinedSoundexMetric and org.hashtree.stringmetric.phonetic.RefinedSoundexAlgorithm
	* CLI: refinedSoundexMetric and refinedSoundexAlgorithm
* __[Soundex](http://en.wikipedia.org/wiki/Soundex)__
	* API: org.hashtree.stringmetric.phonetic.SoundexMetric and org.hashtree.stringmetric.phonetic.SoundexAlgorithm
	* CLI: soundexMetric and soundexAlgorithm

## Filters
Filters, which can optionally be applied, clean up arguments prior to evaluation. Filtering rules can be composed via trait decoration.

* __Ensures only ASCII control characters matter__
	* API: org.hashtree.stringmetric.filter.AsciiControlOnlyStringFilter
* __Ensures ASCII controls do not matter__
	* API: org.hashtree.stringmetric.filter.AsciiControlStringFilter
* __Ensures ASCII letter case-sensitivity does not matter__
	* API: org.hashtree.stringmetric.filter.AsciiLetterCaseStringFilter
* __Ensures only ASCII letters and numbers matter__
	* API: org.hashtree.stringmetric.filter.AsciiLetterNumberOnlyStringFilter
* __Ensures ASCII letters and numbers do not matter__
	* API: org.hashtree.stringmetric.filter.AsciiLetterNumberStringFilter
* __Ensures only ASCII letters matter__
	* API: org.hashtree.stringmetric.filter.AsciiLetterOnlyStringFilter
* __Ensures ASCII letters do not matter__
	* API: org.hashtree.stringmetric.filter.AsciiLetterStringFilter
* __Ensures only ASCII numbers matter__
	* API: org.hashtree.stringmetric.filter.AsciiNumberOnlyStringFilter
* __Ensures ASCII numbers do not matter__
	* API: org.hashtree.stringmetric.filter.AsciiNumberStringFilter
* __Ensures ASCII spaces do not matter__
	* API: org.hashtree.stringmetric.filter.AsciiSpaceStringFilter
* __Ensures only ASCII symbols matter__
	* API: org.hashtree.stringmetric.filter.AsciiSymbolOnlyStringFilter
* __Ensures ASCII symbols do not matter__
	* API: org.hashtree.stringmetric.filter.AsciiSymbolStringFilter

## Building the API
```shell
gradle :stringmetric-core:jar
```

## Building the CLI
```shell
gradle :stringmetric-cli:tar
```

## Using the API
The absolute easiest example is to use the StringMetric convenience object.
```scala
import org.hashtree.stringmetric.StringMetric
  
if (StringMetric.compareJaroWinkler("string1", "string2") >= 0.9) 
    println("It's likely you're a match!")
```

The absolute easiest example with one filter is to use the StringMetric and StringFilter convenience objects.
```scala
import org.hashtree.stringmetric.{ StringFilter, StringMetric }
  
if (StringMetric.compareJaroWinkler("string1", "string2")(StringFilter.asciiLetterCase) >= 0.9) 
    println("It's likely you're a match!")
```

Simple example. Import metric, compare, do something with result. 
```scala
import org.hashtree.stringmetric.similarity.JaroWinklerMetric  
  
val distance = JaroWinklerMetric.compare("string1", "string2")

if (distance >= 0.9) println("It's likely you're a match!")
```

One filter example. Import metric, compare with one filter, do something with result.
```scala
import org.hashtree.stringmetric.similarity.{ JaroWinklerMetric, StringFilterDelegate }
import org.hashtree.stringmetric.filter.AsciiLetterCaseStringFilter

val distance = JaroWinklerMetric.compare("string1", "string2")
    (new StringFilterDelegate with AsciiLetterCaseStringFilter)

if (distance >= 0.9) println("It's likely you're a match!")
```

Compound filter example. Import metric, compare with two filters, do something with result. Filters are applied in reverse order!
```scala
import org.hashtree.stringmetric.similarity.{ JaroWinklerMetric, StringFilterDelegate }
import org.hashtree.stringmetric.filter.{ AsciiLetterCaseStringFilter, AsciiLetterOnlyStringFilter }

val distance = JaroWinklerMetric.compare("string1", "string2")
    (new StringFilterDelegate with AsciiLetterCaseStringFilter with AsciiLetterOnlyStringFilter)

if (distance >= 0.9) println("It's likely you're a match!")`
```

All string metrics and algorithms have overloaded methods which accept character arrays.
```scala
import org.hashtree.stringmetric.similarity.JaroWinklerMetric
  
val distance = JaroWinklerMetric.compare("string1".toCharArray, "string2".toCharArray)

if (distance >= 0.9) println("It's likely you're a match!")
```

## Using the CLI
Uncompress the built tar and ensure you have ability to execute the commands. Execute the metric of choice via the command line:

The help option prints command syntax and usage.
```shell
jaroWinklerMetric --help
metaphoneMetric --help
metaphoneAlgorithm --help
```

Compare "abc" to "xyz" using the Jaro-Winkler metric.
```shell
jaroWinklerMetric abc xyz
```

Compare "abc "to "xyz" using the Metaphone metric.
```shell
metaphoneMetric abc xyz
```

Get the phonetic representation of "abc" via the metaphone phonetic algorithm.
```shell 
metaphoneAlgorithm abc
```

## Requirements
* Scala 2.9.2
* Gradle 1.0 or above

## Versioning
[Semantic Versioning 2.0.0](http://semver.org/)

## License
[Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)