#java-string-similarity
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-string-similarity/badge.svg)](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-string-similarity) [![Build Status](https://travis-ci.org/tdebatty/java-string-similarity.svg?branch=master)](https://travis-ci.org/tdebatty/java-string-similarity) [![Coverage Status](https://coveralls.io/repos/tdebatty/java-string-similarity/badge.svg?branch=master&service=github)](https://coveralls.io/github/tdebatty/java-string-similarity?branch=master) [![API Documentation](http://api123.web-d.be/api123-head.svg)](http://api123.web-d.be/api/java-string-similarity/head/index.html)
A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Jaro-Winkler, Longest Common Subsequence, cosine similarity etc.) are currently implemented. Check the summary table below for the complete list...
* [Download](#download)
* [Overview](#overview)
* [Normalized, metric, similarity and distance](#normalized-metric-similarity-and-distance)
* [Shingles (n-gram) based similarity and distance](#shingles-n-gram-based-similarity-and-distance)
* [Levenshtein](#levenshtein)
* [Normalized Levenshtein](#normalized-levenshtein)
* [Weighted Levenshtein](#weighted-levenshtein)
* [Damerau-Levenshtein](#damerau-levenshtein)
* [Optimal String Alignment](#optimal-string-alignment)
* [Jaro-Winkler](#jaro-winkler)
* [Longest Common Subsequence](#longest-common-subsequence)
* [Metric Longest Common Subsequence](#metric-longest-common-subsequence)
* [N-Gram](#n-gram)
* [Shingle (n-gram) based algorithms](#shingle-n-gram-based-algorithms)
* [Q-Gram](#shingle-n-gram-based-algorithms)
* [Cosine similarity](#shingle-n-gram-based-algorithms)
* [Jaccard index](#shingle-n-gram-based-algorithms)
* [Sorensen-Dice coefficient](#shingle-n-gram-based-algorithms)
* [Experimental](#experimental)
* [SIFT4](#sift4)
* [Users](#users)
## Download
Using maven:
```
<dependency>
<groupId>info.debatty</groupId>
<artifactId>java-string-similarity</artifactId>
<version>RELEASE</version>
</dependency>
```
Or check the [releases](https://github.com/tdebatty/java-string-similarity/releases).
## Overview
The main characteristics of each implemented algorithm are presented below. The "cost" column gives an estimation of the computational cost to compute the similarity between two strings of length m and n respectively.
| | | Normalized? | Metric? | Type | Cost |
|-------- |------- |------------- |---------- | ------ | ---- |
| [Levenshtein](#levenshtein) |distance | No | Yes | | O(m*n) <sup>1</sup> |
| [Normalized Levenshtein](#normalized-levenshtein) |distance<br>similarity | Yes | No | | O(m*n) <sup>1</sup> |
| [Weighted Levenshtein](#weighted-levenshtein) |distance | No | No | | O(m*n) <sup>1</sup> |
| [Damerau-Levenshtein](#damerau-levenshtein) <sup>3</sup> |distance | No | Yes | | O(m*n) <sup>1</sup> |
| [Optimal String Alignment](#optimal-string-alignment) <sup>3</sup> |distance | No | No | | O(m*n) <sup>1</sup> |
| [Jaro-Winkler](#jaro-winkler) |similarity<br>distance | Yes | No | | O(m*n) |
| [Longest Common Subsequence](#longest-common-subsequence) |distance | No | No | | O(m*n) <sup>1,2</sup> |
| [Metric Longest Common Subsequence](#metric-longest-common-subsequence) |distance | Yes | Yes | | O(m*n) <sup>1,2</sup> |
| [N-Gram](#n-gram) |distance | Yes | No | | O(m*n) |
| [Q-Gram](#q-gram) |distance | No | No | Profile | O(m+n) |
| [Cosine similarity](#cosine-similarity) |similarity<br>distance | Yes | No | Profile | O(m+n) |
| [Jaccard index](#jaccard-index) |similarity<br>distance | Yes | Yes | Set | O(m+n) |
| [Sorensen-Dice coefficient](#sorensen-dice-coefficient) |similarity<br>distance | Yes | No | Set | O(m+n) |
[1] In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the **dynamic programming** method, which has a cost O(m.n). For Levenshtein distance, the algorithm is sometimes called **Wagner-Fischer algorithm** ("The string-to-string correction problem", 1974). The original algorithm uses a matrix of size m x n to store the Levenshtein distance between string prefixes.
If the alphabet is finite, it is possible to use the **method of four russians** (Arlazarov et al. "On economic construction of the transitive closure of a directed graph", 1970) to speedup computation. This was published by Masek in 1980 ("A Faster Algorithm Computing String Edit Distances"). This method splits the matrix in blocks of size t x t. Each possible block is precomputed to produce a lookup table. This lookup table can then be used to compute the string similarity (or distance) in O(nm/t). Usually, t is choosen as log(m) if m > n. The resulting computation cost is thus O(mn/log(m)). This method has not been implemented (yet).
[2] In "Length of Maximal Common Subsequences", K.S. Larsen proposed an algorithm that computes the length of LCS in time O(log(m).log(n)). But the algorithm has a memory requirement O(m.n²) and was thus not implemented here.
[3] There are two variants of Damerau-Levenshtein string distance: Damerau-Levenshtein with adjacent transpositions (also sometimes called unrestricted Damerau–Levenshtein distance) and Optimal String Alignment (also sometimes called restricted edit distance). For Optimal String Alignment, no substring can be edited more than once.
## Normalized, metric, similarity and distance
Although the topic might seem simple, a lot of different algorithms exist to measure text similarity or distance. Therefore the library defines some interfaces to categorize them.
### (Normalized) similarity and distance
- StringSimilarity : Implementing algorithms define a similarity between strings (0 means strings are completely different).
- NormalizedStringSimilarity : Implementing algorithms define a similarity between 0.0 and 1.0, like Jaro-Winkler for example.
- StringDistance : Implementing algorithms define a distance between strings (0 means strings are identical), like Levenshtein for example. The maximum distance value depends on the algorithm.
- NormalizedStringDistance : This interface extends StringDistance. For implementing classes, the computed distance value is between 0.0 and 1.0. NormalizedLevenshtein is an example of NormalizedStringDistance.
Generally, algorithms that implement NormalizedStringSimilarity also implement NormalizedStringDistance, and similarity = 1 - distance. But there are a few exceptions, like N-Gram similarity and distance (Kondrak)...
### Metric distances
The MetricStringDistance interface : A few of the distances are actually metric distances, which means that verify the triangle inequality d(x, y) <= d(x,z) + d(z,y). For example, Levenshtein is a metric distance, but NormalizedLevenshtein is not.
A lot of nearest-neighbor search algorithms and indexing structures rely on the triangle inequality. You can check "Similarity Search, The Metric Space Approach" by Zezula et al. for a survey. These cannot be used with non metric similarity measures.
[Read Javadoc for a detailed description](http://api123.web-d.be/api/java-string-similarity/head/index.html)
## Shingles (n-gram) based similarity and distance
A few algorithms work by converting strings into sets of n-grams (sequences of n characters, also sometimes called k-shingles). The similarity or distance between the strings is then the similarity or distance between the sets.
Some ot them, like jaccard, consider strings as sets of shingles, and don't consider the number of occurences of each shingle. Others, like cosine similarity, work using what is sometimes called the profile of the strings, which takes into
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
code for finding similarity between programs the text that has been entered so can find similar items
资源详情
资源评论
资源推荐
收起资源包目录
java-string-similarity-master.zip (49个子文件)
java-string-similarity-master
.travis.yml 162B
pom.xml 6KB
src
test
resources
11328-1.txt 2.33MB
71816-2.txt 5KB
java
info
debatty
java
stringsimilarity
OptimalStringAlignmentTest.java 3KB
experimental
Sift4Test.java 2KB
LongestCommonSubsequenceTest.java 2KB
SorensenDiceTest.java 2KB
JaccardTest.java 2KB
LevenshteinTest.java 2KB
CosineTest.java 3KB
NGramTest.java 2KB
DamerauTest.java 2KB
JaroWinklerTest.java 2KB
QGramTest.java 2KB
utils
SparseIntegerVectorTest.java 3KB
SparseDoubleVectorTest.java 7KB
main
java
info
debatty
java
stringsimilarity
LongestCommonSubsequence.java 3KB
NormalizedLevenshtein.java 2KB
MetricLCS.java 2KB
OptimalStringAlignment.java 4KB
experimental
Sift4.java 7KB
CharacterSubstitutionInterface.java 2KB
QGram.java 3KB
SorensenDice.java 4KB
Damerau.java 4KB
Cosine.java 5KB
Jaccard.java 4KB
examples
MetricLCS.java 2KB
PrecomputedCosine.java 2KB
Examples.java 8KB
SparseDoubleVectorExample.java 3KB
interfaces
NormalizedStringSimilarity.java 1KB
MetricStringDistance.java 2KB
NormalizedStringDistance.java 1KB
StringSimilarity.java 434B
StringDistance.java 1KB
JaroWinkler.java 4KB
NGram.java 4KB
WeightedLevenshtein.java 4KB
Levenshtein.java 3KB
ShingleBased.java 4KB
utils
SparseBooleanVector.java 4KB
SparseIntegerVector.java 8KB
SparseDoubleVector.java 10KB
LICENSE.md 1KB
.gitignore 43B
README.md 22KB
nbactions.xml 2KB
共 49 条
- 1
四散
- 粉丝: 49
- 资源: 1万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0