summaryrefslogblamecommitdiff
path: root/params.json
blob: 7916b12283220d953497b6e6b6fae3c14d60793d (plain) (tree)
1

{"name":"stringmetric","tagline":"String metrics and phonetic algorithms for Scala.","body":"The library provides facilities to perform approximate string matching, measurement of string similarity/distance, indexing by word pronunciation, and sounds-like comparisons. In addition to the core library, each metric and algorithm has a command line interface.\r\n\r\n* __Requirements:__ Scala 2.10+\r\n* __Documentation:__ [Scaladoc](http://rockymadden.com/stringmetric/scaladoc/)\r\n* __Issues:__ [Enhancements](https://github.com/rockymadden/stringmetric/issues?labels=accepted%2Cenhancement&page=1&state=open), [Questions](https://github.com/rockymadden/stringmetric/issues?labels=accepted%2Cquestion&page=1&state=open), [Bugs](https://github.com/rockymadden/stringmetric/issues?labels=accepted%2Cbug&page=1&state=open)\r\n* __Versioning:__ [Semantic Versioning v2.0](http://semver.org/)\r\n\r\n## Metrics and algorithms\r\n* __[Dice / Sorensen](http://en.wikipedia.org/wiki/Dice%27s_coefficient)__ (Similarity metric)\r\n* __[Double Metaphone](http://en.wikipedia.org/wiki/Metaphone)__ ([Queued](https://github.com/rockymadden/stringmetric/issues/6) phonetic metric and algorithm)\r\n* __[Hamming](http://en.wikipedia.org/wiki/Hamming_distance)__ (Similarity metric)\r\n* __[Jaccard](http://en.wikipedia.org/wiki/Jaccard_index)__ (Similarity metric)\r\n* __[Jaro](http://en.wikipedia.org/wiki/Jaro-Winkler_distance)__ (Similarity metric)\r\n* __[Jaro-Winkler](http://en.wikipedia.org/wiki/Jaro-Winkler_distance)__ (Similarity metric)\r\n* __[Levenshtein](http://en.wikipedia.org/wiki/Levenshtein_distance)__ (Similarity metric)\r\n* __[Metaphone](http://en.wikipedia.org/wiki/Metaphone)__ (Phonetic metric and algorithm)\r\n* __[Monge-Elkan](http://www.cs.cmu.edu/~pradeepr/papers/ijcai03.pdf)__ ([Queued](https://github.com/rockymadden/stringmetric/issues/7) similarity metric)\r\n* __[Match Rating Approach](http://en.wikipedia.org/wiki/Match_rating_approach)__ ([Queued](https://github.com/rockymadden/stringmetric/issues/8) phonetic metric and algorithm)\r\n* __[Needleman-Wunch](http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm)__ ([Queued](https://github.com/rockymadden/stringmetric/issues/9) similarity metric)\r\n* __[N-Gram](http://en.wikipedia.org/wiki/N-gram)__ (Similarity metric)\r\n* __[NYSIIS](http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System)__ (Phonetic metric and algorithm)\r\n* __[Overlap](http://en.wikipedia.org/wiki/Overlap_coefficient)__ (Similarity metric)\r\n* __[Ratcliff-Obershelp](http://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html)__ (Similarity metric)\r\n* __[Refined NYSIIS](http://www.markcrocker.com/rexxtipsntricks/rxtt28.2.0482.html)__ (Phonetic metric and algorithm)\r\n* __[Refined Soundex](http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html)__ (Phonetic metric and algorithm)\r\n* __[Tanimoto](http://en.wikipedia.org/wiki/Tanimoto_coefficient)__ ([Queued](https://github.com/rockymadden/stringmetric/issues/10) similarity metric)\r\n* __[Tversky](http://en.wikipedia.org/wiki/Tversky_index)__ ([Queued](https://github.com/rockymadden/stringmetric/issues/16) similarity metric)\r\n* __[Smith-Waterman](http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm)__ ([Queued](https://github.com/rockymadden/stringmetric/issues/11) similarity metric)\r\n* __[Soundex](http://en.wikipedia.org/wiki/Soundex)__ (Phonetic metric and algorithm)\r\n* __Weighted Levenshtein__ (Similarity metric)\r\n\r\n\r\n## Depending upon\r\n\r\n__SBT:__\r\n```scala\r\nlibraryDependencies += \"com.rockymadden.stringmetric\" %% \"stringmetric-core\" % \"0.27.4\"\r\n```\r\n\r\n---\r\n\r\n__Gradle:__\r\n```groovy\r\ncompile 'com.rockymadden.stringmetric:stringmetric-core_2.10:0.27.4'\r\n```\r\n\r\n---\r\n\r\n__Maven:__\r\n```xml\r\n<dependency>\r\n\t<groupId>com.rockymadden.stringmetric</groupId>\r\n\t<artifactId>stringmetric-core_2.10</artifactId>\r\n\t<version>0.27.4</version>\r\n</dependency>\r\n```\r\n\r\n---\r\n\r\n## Similarity package\r\nUseful for approximate string matching and measurement of string distance. Most metrics calculate the similarity of two strings as a double with a value between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.\r\n\r\n---\r\n\r\n__Dice / Sorensen Metric:__\r\n```scala\r\nDiceSorensenMetric(1).compare(\"night\", \"nacht\") // 0.6\r\nDiceSorensenMetric(1).compare(\"context\", \"contact\") // 0.7142857142857143\r\n```\r\n<sup>Note you must specify the size of the n-gram you wish to use.</sup>\r\n\r\n---\r\n\r\n__Hamming Metric:__\r\n```scala\r\nHammingMetric.compare(\"toned\", \"roses\") // 3\r\nHammingMetric.compare(\"1011101\", \"1001001\") // 2\r\n```\r\n<sup>Note the exception of integers, rather than doubles, being returned.</sup>\r\n\r\n---\r\n\r\n\r\n__Jaccard Metric:__\r\n```scala\r\nJaccardMetric(1).compare(\"night\", \"nacht\") // 0.3\r\nJaccardMetric(1).compare(\"context\", \"contact\") // 0.35714285714285715\r\n```\r\n<sup>Note you must specify the size of the n-gram you wish to use.</sup>\r\n\r\n\r\n---\r\n\r\n__Jaro Metric:__\r\n```scala\r\nJaroMetric.compare(\"dwayne\", \"duane\") // 0.8222222222222223\r\nJaroMetric.compare(\"jones\", \"johnson\") // 0.7904761904761904\r\nJaroMetric.compare(\"fvie\", \"ten\") // 0.0\r\n```\r\n\r\n---\r\n\r\n__Jaro-Winkler Metric:__\r\n```scala\r\nJaroWinklerMetric.compare(\"dwayne\", \"duane\") // 0.8400000000000001\r\nJaroWinklerMetric.compare(\"jones\", \"johnson\") // 0.8323809523809523\r\nJaroWinklerMetric.compare(\"fvie\", \"ten\") // 0.0\r\n```\r\n\r\n---\r\n\r\n__Levenshtein Metric:__\r\n```scala\r\nLevenshteinMetric.compare(\"sitting\", \"kitten\") // 3\r\nLevenshteinMetric.compare(\"cake\", \"drake\") // 2\r\n```\r\n<sup>Note the exception of integers, rather than doubles, being returned.</sup>\r\n\r\n---\r\n\r\n\r\n__N-Gram Metric:__\r\n```scala\r\nNGramMetric(1).compare(\"night\", \"nacht\") // 0.6\r\nNGramMetric(2).compare(\"night\", \"nacht\") // 0.25\r\nNGramMetric(2).compare(\"context\", \"contact\") // 0.5\r\n```\r\n<sup>Note you must specify the size of the n-gram you wish to use.</sup>\r\n\r\n---\r\n\r\n__Overlap Metric:__\r\n```scala\r\nOverlapMetric(1).compare(\"night\", \"nacht\") // 0.6\r\nOverlapMetric(1).compare(\"context\", \"contact\") // 0.7142857142857143\r\n```\r\n<sup>Note you must specify the size of the n-gram you wish to use.</sup>\r\n\r\n---\r\n\r\n__Ratcliff/Obershelp Metric:__\r\n```scala\r\nRatcliffObershelpMetric.compare(\"aleksander\", \"alexandre\") // 0.7368421052631579\r\nRatcliffObershelpMetric.compare(\"pennsylvania\", \"pencilvaneya\") // 0.6666666666666666\r\n```\r\n\r\n---\r\n\r\n__Weighted Levenshtein Metric:__\r\n```scala\r\nWeightedLevenshteinMetric(10, 0.1, 1).compare(\"book\", \"back\") // 2\r\nWeightedLevenshteinMetric(10, 0.1, 1).compare(\"hosp\", \"hospital\") // 0.4\r\nWeightedLevenshteinMetric(10, 0.1, 1).compare(\"hospital\", \"hosp\") // 40\r\n```\r\n<sup>Note you must specify the weight of each operation. Delete, insert, and then substitute. Note that while a double is returned, it can be outside the range of 0 to 1, based upon the weights used.</sup>\r\n\r\n---\r\n\r\n## Phonetic package\r\nUseful for indexing by word pronunciation and performing sounds-like comparisons. All metrics return a boolean value indicating if the two strings sound the same, per the algorithm used. All metrics have an algorithm counterpart which provide the means to perform indexing by word pronunciation.\r\n\r\n---\r\n\r\n__Metaphone Metric:__\r\n```scala\r\nMetaphoneMetric.compare(\"merci\", \"mercy\") // true\r\nMetaphoneMetric.compare(\"dumb\", \"gum\") // false\r\n```\r\n---\r\n\r\n__Metaphone Algorithm:__\r\n```scala\r\nMetaphoneAlgorithm.compute(\"dumb\") // tm\r\nMetaphoneAlgorithm.compute(\"knuth\") // n0\r\n```\r\n\r\n---\r\n\r\n__NYSIIS Metric:__\r\n```scala\r\nNysiisMetric.compare(\"ham\", \"hum\") // true\r\nNysiisMetric.compare(\"dumb\", \"gum\") // false\r\n```\r\n\r\n---\r\n\r\n__NYSIIS Algorithm:__\r\n```scala\r\nNysiisAlgorithm.compute(\"macintosh\") // mcant\r\nNysiisAlgorithm.compute(\"knuth\") // nnat\r\n```\r\n\r\n---\r\n\r\n__Refined NYSIIS Metric:__\r\n```scala\r\nRefinedNysiisMetric.compare(\"ham\", \"hum\") // true\r\nRefinedNysiisMetric.compare(\"dumb\", \"gum\") // false\r\n```\r\n\r\n---\r\n\r\n__Refined NYSIIS Algorithm:__\r\n```scala\r\nRefinedNysiisAlgorithm.compute(\"macintosh\") // mcantas\r\nRefinedNysiisAlgorithm.compute(\"westerlund\") // wastarlad\r\n```\r\n\r\n---\r\n\r\n__Refined Soundex Metric:__\r\n```scala\r\nRefinedSoundexMetric.compare(\"robert\", \"rupert\") // true\r\nRefinedSoundexMetric.compare(\"robert\", \"rubin\") // false\r\n```\r\n\r\n---\r\n\r\n__Refined Soundex Algorithm:__\r\n```scala\r\nRefinedSoundexAlgorithm.compute(\"hairs\") // h093\r\nRefinedSoundexAlgorithm.compute(\"lambert\") // l7081096\r\n```\r\n\r\n---\r\n\r\n__Soundex Metric:__\r\n```scala\r\nSoundexMetric.compare(\"robert\", \"rupert\") // true\r\nSoundexMetric.compare(\"robert\", \"rubin\") // false\r\n```\r\n\r\n---\r\n\r\n__Soundex Algorithm:__\r\n```scala\r\nSoundexAlgorithm.compute(\"rupert\") // r163\r\nSoundexAlgorithm.compute(\"lukasiewicz\") // l222\r\n```\r\n\r\n---\r\n\r\n## Convenience objects\r\n\r\n__StringAlgorithm:__\r\n```scala\r\nStringAlgorithm.computeWithMetaphone(\"abcdef\")\r\nStringAlgorithm.computeWithNysiis(\"abcdef\")\r\n```\r\n\r\n---\r\n\r\n__StringMetric:__\r\n```scala\r\nStringMetric.compareWithJaccard(1)(\"abcdef\", \"abcxyz\")\r\nStringMetric.compareWithJaroWinkler(\"abcdef\", \"abcxyz\")\r\n```\r\n\r\n---\r\n\r\n## Decorating\r\nIt is possible to decorate algorithms and metrics with additional functionality, which you can mix and match. Decorations include:\r\n\r\n* __[withMemoization](https://en.wikipedia.org/wiki/Memoization):__ Computations and comparisons are cached. Future calls made with identical arguments will be looked up, rather than computed.\r\n\r\n* __withTransform:__ Transform arguments prior to computation/comparison. A handful of pre-built transforms are located in the [transform module](https://github.com/rockymadden/stringmetric/blob/master/core/src/main/scala/com/rockymadden/stringmetric/transform.scala).\r\n\r\n---\r\n\r\nNon-decorated:\r\n```scala\r\nMetaphoneAlgorithm.compute(\"abcdef\")\r\nMetaphoneMetric.compare(\"abcdef\", \"abcxyz\")\r\n```\r\n\r\n---\r\n\r\nUsing memoization:\r\n```scala\r\n(MetaphoneAlgorithm withMemoization).compute(\"abcdef\")\r\n```\r\n\r\n---\r\n\r\nUsing a transform so that we only examine alphabetical characters:\r\n```scala\r\n(MetaphoneAlgorithm withTransform filterAlpha).compute(\"abcdef\")\r\n(MetaphoneMetric withTransform filterAlpha).compare(\"abcdef\", \"abcxyz\")\r\n```\r\n\r\n---\r\n\r\nUsing a functionally composed transform so that we only examine alphabetical characters, but the case will not matter:\r\n```scala\r\nval composedTransform = (filterAlpha andThen ignoreAlphaCase)\r\n\r\n(MetaphoneAlgorithm withTransform composedTransform).compute(\"abcdef\")\r\n(MetaphoneMetric withTransform composedTransform).compare(\"abcdef\", \"abcxyz\")\r\n```\r\n\r\n---\r\n\r\nMaking your own transform:\r\n```scala\r\nval myTransform: StringTransform = (ca) => ca.filter(_ == 'x')\r\n\r\n(MetaphoneAlgorithm withTransform myTransform).compute(\"abcdef\")\r\n(MetaphoneMetric withTransform myTransform).compare(\"abcdef\", \"abcxyz\")\r\n```\r\n\r\n---\r\n\r\nUsing memoization and a transform:\r\n```scala\r\n((MetaphoneAlgorithm withMemoization) withTransform filterAlpha).compute(\"abcdef\")\r\n```\r\n\r\n---\r\n\r\n## Building the CLIs\r\n```shell\r\n$ git clone https://github.com/rockymadden/stringmetric.git\r\n$ cd stringmetric\r\n$ sbt clean package\r\n$ ./project/build.sh\r\n$ ./target/cli/jarometric abc xyz\r\n```\r\n\r\n---\r\n\r\n## Using the CLIs\r\nGet help:\r\n```shell\r\n$ metaphonemetric --help\r\nCompares two strings to determine if they are phonetically similarly, per the Metaphone algorithm.\r\n\r\nSyntax:\r\n  metaphonemetric [Options] string1 string2...\r\n\r\nOptions:\r\n  -h, --help\r\n    Outputs description, syntax, and options.\r\n```\r\n\r\n---\r\n\r\nGet comparison value with metrics:\r\n```shell\r\n$ jarowinklermetric dog dawg\r\n0.75\r\n```\r\n\r\n---\r\n\r\nGet representation value with phonetic algorithms:\r\n```shell\r\n$ metaphonealgorithm dog\r\ntk\r\n```\r\n\r\n---\r\n\r\n## License\r\n```\r\nThe MIT License (MIT)\r\n\r\nCopyright (c) 2013 Rocky Madden (http://rockymadden.com/)\r\n\r\nPermission is hereby granted, free of charge, to any person obtaining a copy\r\nof this software and associated documentation files (the \"Software\"), to deal\r\nin the Software without restriction, including without limitation the rights\r\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\r\ncopies of the Software, and to permit persons to whom the Software is\r\nfurnished to do so, subject to the following conditions:\r\n\r\nThe above copyright notice and this permission notice shall be included in\r\nall copies or substantial portions of the Software.\r\n\r\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\r\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\r\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\r\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\r\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\r\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\r\nTHE SOFTWARE.\r\n```\r\n","google":"","note":"Don't delete this file! It's used internally to help with page regeneration."}