没有合适的资源?快使用搜索试试~ 我知道了~
The one about moss
4星 · 超过85%的资源 需积分: 14 8 下载量 135 浏览量
2013-07-19
08:24:01
上传
评论
收藏 152KB PDF 举报
温馨提示
试读
10页
moss是斯坦福大学开发的软件,用来查多个program中的相似程度从而避免抄袭现象。
资源推荐
资源详情
资源评论
Winnowing: Local Algorithms for Document Fingerprinting
Saul Schleimer
MSCS
University of Illinois, Chicago
saul@math.uic.edu
Daniel S. Wilkerson
Computer Science Division
UC Berkeley
dsw@cs.berkeley.edu
Alex Aiken
Computer Science Division
UC Berkeley
aiken@cs.berkeley.edu
ABSTRACT
Digital content is for copying: quotation, revision, plagiarism, and
file sharing all create copies. Document fingerprinting is concerned
with accurately identifying copying, including small partial copies,
within large sets of documents.
We introduce the class of local document fingerprinting algo-
rithms, which seems to capture an essential property of any finger-
printing technique guaranteed to detect copies. We prove a novel
lower bound on the performance of any local algorithm. We also
develop winnowing, an efficient local fingerprinting algorithm, and
show that winnowing’s performance is within 33% of the lower
bound. Finally, we also give experimental results on Web data, and
report experience with M
OSS, a widely-used plagiarism detection
service.
1. INTRODUCTION
Digital documents are easily copied. A bit less obvious, perhaps,
is the wide variety of different reasons for which digital documents
are either completely or partially duplicated. People quote from
each other’s email and news postings in their replies. Collaborators
create multiple versions of documents, each of which is closely
related to its immediate predecessor. Important Web sites are mir-
rored. More than a few students plagiarize their homework from
the Web. Many authors of conference papers engage in a similar
but socially more acceptable form of text reuse in preparing journal
versions of their work. Many businesses, notably in the software
and entertainment industries, are based on charging for each digital
copy sold.
Comparing whole document checksums is simple and suffices
for reliably detecting exact copies; however, detecting partial copies
is subtler. Because of its many potential applications, this second
problem has received considerable attention.
Most previous techniques for detecting partial copies, which we
discuss in more detail in Section 2, make use of the following idea.
A k-gram is a contiguous substring of length k. Divide a docu-
ment into k-grams, where k is a parameter chosen by the user. For
example, Figure 1(c) contains all the 5-grams of the string of char-
acters in Figure 1(b). Note that there are almost as many k-grams
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD 2003, June 9-12, 2003, San Diego, CA.
Copyright 2003 ACM 1-58113-634-X/03/06 ...
$5.00.
A do run run run, a do run run
(a) Some text from [7].
adorunrunrunadorunrun
(b) The text with irrelevant features removed.
adoru dorun orunr runru unrun nrunr runru
unrun nruna runad unado nador adoru dorun
orunr runru unrun
(c) The sequence of 5-grams derived from the text.
77 72 42 17 98 50 17 98 8 88 67 39 77 72 42
17 98
(d) A hypothetical sequence of hashes of the 5-grams.
72 8 88 72
(e) The sequence of hashes selected using 0 mod 4.
Figure 1: Fingerprinting some sample text.
as there are characters in the document, as every position in the
document (except for the last k − 1 positions) marks the begin-
ning of a k-gram. Now hash each k-gram and select some subset
of these hashes to be the document’s fingerprints. In all practical
approaches, the set of fingerprints is a small subset of the set of all
k-gram hashes. A fingerprint also contains positional information,
which we do not show, describing the document and the location
within that document that the fingerprint came from. If the hash
function is chosen so that the probability of collisions is very small,
then whenever two documents share one or more fingerprints, it is
extremely likely that they share a k-gram as well.
For efficiency, only a subset of the hashes should retained as
the document’s fingerprints. One popular approach is to choose all
hashes that are 0 mod p, for some fixed p. This approach is easy to
implement and retains only 1/p of all hashes as fingerprints (Sec-
tion 2). Meaningful measures of document similarity can also be
derived from the number of fingerprints shared between documents
[5].
A disadvantage of this method is that it gives no guarantee that
matches between documents are detected: a k-gram shared be-
tween documents is detected only if its hash is 0 mod p. Consider
the sequence of hashes generated by hashing all k-grams of a file
in order. Call the distance between consecutive selected finger-
prints the gap between them. If fingerprints are selected 0 mod p,
the maximum gap between two fingerprints is unbounded and any
资源评论
- 小史哥2013-09-22论文很清晰,就是不好懂
thunderink
- 粉丝: 0
- 资源: 4
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功