没有合适的资源?快使用搜索试试~ 我知道了~
编辑距离(外文资料)1
资源详情
资源评论
资源推荐
编辑距离(Levenshtein Distance)
搞自然语言处理的应该不会对这个概念感到陌生,编辑距离就是用来计算从原串(s)转换到目标串(t)所
需要的最少的插入,删除和替换的数目,在 NLP 中应用比较广泛,如一些评测方法中就用到了(wer,mWer
等),同时也常用来计算你对原文本所作的改动数。
编辑距离的算法是首先由俄国科学家 Levenshtein 提出的,故又叫 Levenshtein Distance。
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will
refer to as the source string (s) and the target string (t). The distance is the number of
deletions, insertions, or substitutions required to transform s into t. For example,
� If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The
strings are already identical.
� If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is
sufficient to transform s into t.
The greater the Levenshtein distance, the more different the strings are.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised
the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also
sometimes called edit distance.
The Levenshtein distance algorithm has been used in:
� Spell checking
� Speech recognition
� DNA analysis
� Plagiarism detection
The Algorithm
Steps
Step
Description
1
Set n to be the length of s.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.
2
Initialize the first row to 0..n.
Initialize the first column to 0..m.
3
Examine each character of s (i from 1 to n).
4
Examine each character of t (j from 1 to m).
5
If s[i] equals t[j], the cost is 0.
If s[i] doesn't equal t[j], the cost is 1.
黄浦江畔的夏先生
- 粉丝: 12
- 资源: 299
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0