summaryrefslogtreecommitdiff
path: root/readme.md
blob: 1e8bd572774155e4501d2ffad336fd13727da3d3 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
#stringmetric
A collection of string metrics and phonetic algorithms implemented in Scala. All phonetic string metrics have a standalone algorithm counterpart. They provide a means to determine the phonetic representation of the argument passed, rather than evaluating if two arguments sound the same phonetically. __Each metric and algorithm has a CLI.__

## Metrics and Phonetic Algorithms
* __[Dice / Sorensen](http://en.wikipedia.org/wiki/Dice%27s_coefficient)__
	* API: org.hashtree.stringmetric.similarity.DiceSorensenMetric
	* CLI: diceSorensenMetric
* __[Hamming](http://en.wikipedia.org/wiki/Hamming_distance)__
	* API: org.hashtree.stringmetric.similarity.HammingMetric
	* CLI: hammingMetric
* __[Jaro](http://en.wikipedia.org/wiki/Jaro-Winkler_distance)__
	* API: org.hashtree.stringmetric.similarity.JaroMetric
	* CLI: jaroMetric
* __[Jaro-Winkler](http://en.wikipedia.org/wiki/Jaro-Winkler_distance)__
	* API: org.hashtree.stringmetric.similarity.JaroWinklerMetric
	* CLI: jaroWinklerMetric
* __[Levenshtein](http://en.wikipedia.org/wiki/Levenshtein_distance)__
	* API: org.hashtree.stringmetric.similarity.LevenshteinMetric
	* CLI: levenshteinMetric
* __[Metaphone](http://en.wikipedia.org/wiki/Metaphone)__
	* API: org.hashtree.stringmetric.phonetic.MetaphoneMetric and org.hashtree.stringmetric.phonetic.MetaphoneAlgorithm
	* CLI: metaphoneMetric and metaphoneAlgorithm
* __[NYSIIS](http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System)__
	* API: org.hashtree.stringmetric.phonetic.NysiisMetric and org.hashtree.stringmetric.phonetic.NysiisAlgorithm
	* CLI: nysiisMetric and nysiisAlgorithm
* __[Refined Soundex](http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html)__
	* API: org.hashtree.stringmetric.phonetic.RefinedSoundexMetric and org.hashtree.stringmetric.phonetic.RefinedSoundexAlgorithm
	* CLI: refinedSoundexMetric and refinedSoundexAlgorithm
* __[Soundex](http://en.wikipedia.org/wiki/Soundex)__
	* API: org.hashtree.stringmetric.phonetic.SoundexMetric and org.hashtree.stringmetric.phonetic.SoundexAlgorithm
	* CLI: soundexMetric and soundexAlgorithm

## Filters
Filters, which can optionally be applied, clean up arguments prior to evaluation. Filtering rules can be composed via trait decoration.

* __Ensures only ASCII control characters matter__
	* API: org.hashtree.stringmetric.filter.AsciiControlOnlyStringFilter
* __Ensures ASCII controls do not matter__
	* API: org.hashtree.stringmetric.filter.AsciiControlStringFilter
* __Ensures ASCII letter case-sensitivity does not matter__
	* API: org.hashtree.stringmetric.filter.AsciiLetterCaseStringFilter
* __Ensures only ASCII letters and numbers matter__
	* API: org.hashtree.stringmetric.filter.AsciiLetterNumberOnlyStringFilter
* __Ensures ASCII letters and numbers do not matter__
	* API: org.hashtree.stringmetric.filter.AsciiLetterNumberStringFilter
* __Ensures only ASCII letters matter__
	* API: org.hashtree.stringmetric.filter.AsciiLetterOnlyStringFilter
* __Ensures ASCII letters do not matter__
	* API: org.hashtree.stringmetric.filter.AsciiLetterStringFilter
* __Ensures only ASCII numbers matter__
	* API: org.hashtree.stringmetric.filter.AsciiNumberOnlyStringFilter
* __Ensures ASCII numbers do not matter__
	* API: org.hashtree.stringmetric.filter.AsciiNumberStringFilter
* __Ensures ASCII spaces do not matter__
	* API: org.hashtree.stringmetric.filter.AsciiSpaceStringFilter
* __Ensures only ASCII symbols matter__
	* API: org.hashtree.stringmetric.filter.AsciiSymbolOnlyStringFilter
* __Ensures ASCII symbols do not matter__
	* API: org.hashtree.stringmetric.filter.AsciiSymbolStringFilter

## Versioning
[Semantic Versioning 2.0.0](http://semver.org/)

## Building the API
    gradle :stringmetric-core:jar

## Building the CLI
    gradle :stringmetric-cli:tar

## Using the API
    // Simple example. Import metric, compare, do something with result. 
    import org.hashtree.stringmetric.similarity.JaroWinklerMetric  
  
    val distance = JaroWinklerMetric.compare("string1", "string2")

    if (distance >= 0.9) println("It's likely you're a match!")

*****

    // One filter example. Import metric, compare with one filter, do something with result.
    import org.hashtree.stringmetric.similarity.{ JaroWinklerMetric, StringFilterDelegate }
    import org.hashtree.stringmetric.filter.AsciiLetterCaseStringFilter

    val distance = JaroWinklerMetric.compare("string1", "string2")
        (new StringFilterDelegate with AsciiLetterCaseStringFilter)

    if (distance >= 0.9) println("It's likely you're a match!")

*****

    // Compound filter example. Import metric, compare with two filters, do something with result. Filters are applied in reverse order!
    import org.hashtree.stringmetric.similarity.{ JaroWinklerMetric, StringFilterDelegate }
    import org.hashtree.stringmetric.filter.{ AsciiLetterCaseStringFilter, AsciiLetterOnlyStringFilter }

    val distance = JaroWinklerMetric.compare("string1", "string2")
        (new StringFilterDelegate with AsciiLetterCaseStringFilter with AsciiLetterOnlyStringFilter)

    if (distance >= 0.9) println("It's likely you're a match!")`

*****

    // All string metrics and algorithms have overloaded methods which accept character arrays.
    import org.hashtree.stringmetric.similarity.JaroWinklerMetric
  
    val distance = JaroWinklerMetric.compare("string1".toCharArray, "string2".toCharArray)

    if (distance >= 0.9) println("It's likely you're a match!")

## Using the CLI
Uncompress the built tar and ensure you have ability to execute the commands. Execute the metric of choice via the command line:

    // The help option prints command syntax and usage.
    jaroWinklerMetric --help
    metaphoneMetric --help
    metaphoneAlgorithm --help

*****

    // Compare "abc" to "xyz" using the Jaro-Winkler metric.
    jaroWinklerMetric abc xyz`  

*****

    // Compare "abc "to "xyz" using the Metaphone metric.
    metaphoneMetric abc xyz

*****

    // Get the phonetic representation of "abc" via the metaphone phonetic algorithm. 
    metaphoneAlgorithm abc

## Requirements
* Scala 2.9.2
* Gradle 1.0 or above

## License
[Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)