Theoneaboutmoss资源-CSDN文库

moss

4星 · 超过85%的资源需积分: 14 135 浏览量 2013-07-19 08:24:01 上传评论收藏 152KB PDF 举报

资源推荐

资源详情

资源评论

Winnowing: Local Algorithms for Document Fingerprinting

Saul Schleimer

MSCS

University of Illinois, Chicago

saul@math.uic.edu

Daniel S. Wilkerson

Computer Science Division

UC Berkeley

dsw@cs.berkeley.edu

Alex Aiken

Computer Science Division

UC Berkeley

aiken@cs.berkeley.edu

ABSTRACT

Digital content is for copying: quotation, revision, plagiarism, and

ﬁle sharing all create copies. Document ﬁngerprinting is concerned

with accurately identifying copying, including small partial copies,

within large sets of documents.

We introduce the class of local document ﬁngerprinting algo-

rithms, which seems to capture an essential property of any ﬁnger-

printing technique guaranteed to detect copies. We prove a novel

lower bound on the performance of any local algorithm. We also

develop winnowing, an efﬁcient local ﬁngerprinting algorithm, and

show that winnowing’s performance is within 33% of the lower

bound. Finally, we also give experimental results on Web data, and

report experience with M

OSS, a widely-used plagiarism detection

service.

1. INTRODUCTION

Digital documents are easily copied. A bit less obvious, perhaps,

is the wide variety of different reasons for which digital documents

are either completely or partially duplicated. People quote from

each other’s email and news postings in their replies. Collaborators

create multiple versions of documents, each of which is closely

related to its immediate predecessor. Important Web sites are mir-

rored. More than a few students plagiarize their homework from

the Web. Many authors of conference papers engage in a similar

but socially more acceptable form of text reuse in preparing journal

versions of their work. Many businesses, notably in the software

and entertainment industries, are based on charging for each digital

copy sold.

Comparing whole document checksums is simple and sufﬁces

for reliably detecting exact copies; however, detecting partial copies

is subtler. Because of its many potential applications, this second

problem has received considerable attention.

Most previous techniques for detecting partial copies, which we

discuss in more detail in Section 2, make use of the following idea.

A k-gram is a contiguous substring of length k. Divide a docu-

ment into k-grams, where k is a parameter chosen by the user. For

example, Figure 1(c) contains all the 5-grams of the string of char-

acters in Figure 1(b). Note that there are almost as many k-grams

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SIGMOD 2003, June 9-12, 2003, San Diego, CA.

$5.00.

A do run run run, a do run run

(a) Some text from [7].

adorunrunrunadorunrun

(b) The text with irrelevant features removed.

adoru dorun orunr runru unrun nrunr runru

unrun nruna runad unado nador adoru dorun

orunr runru unrun

77 72 42 17 98 50 17 98 8 88 67 39 77 72 42

17 98

(d) A hypothetical sequence of hashes of the 5-grams.

72 8 88 72

(e) The sequence of hashes selected using 0 mod 4.

Figure 1: Fingerprinting some sample text.

as there are characters in the document, as every position in the

document (except for the last k − 1 positions) marks the begin-

ning of a k-gram. Now hash each k-gram and select some subset

of these hashes to be the document’s ﬁngerprints. In all practical

approaches, the set of ﬁngerprints is a small subset of the set of all

k-gram hashes. A ﬁngerprint also contains positional information,

which we do not show, describing the document and the location

within that document that the ﬁngerprint came from. If the hash

function is chosen so that the probability of collisions is very small,

then whenever two documents share one or more ﬁngerprints, it is

extremely likely that they share a k-gram as well.

For efﬁciency, only a subset of the hashes should retained as

the document’s ﬁngerprints. One popular approach is to choose all

hashes that are 0 mod p, for some ﬁxed p. This approach is easy to

implement and retains only 1/p of all hashes as ﬁngerprints (Sec-

tion 2). Meaningful measures of document similarity can also be

derived from the number of ﬁngerprints shared between documents

[5].

A disadvantage of this method is that it gives no guarantee that

matches between documents are detected: a k-gram shared be-

tween documents is detected only if its hash is 0 mod p. Consider

the sequence of hashes generated by hashing all k-grams of a ﬁle

in order. Call the distance between consecutive selected ﬁnger-

prints the gap between them. If ﬁngerprints are selected 0 mod p,

the maximum gap between two ﬁngerprints is unbounded and any

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余9页未读，立即下载

评论收藏

内容反馈

小史哥

2013-09-22

论文很清晰，就是不好懂

thunderink

粉丝: 0
资源: 4

The one about moss

最新资源

The one about moss

大名鼎鼎的斯坦福搜索引擎原理

moss管理工具

matlab最简单的代码-cl-moss:斯坦福大学MOSS相似性检测系统的通用Lisp提交机制

MOSS 解决方案.rar

about

thinkphp of about

about.css

Sb Pollution in the Soil, Moss and Sediments of Ny-Alesund, Svalbard, Arctic

MOSS 2007 Search Planning Best Practices

moss 2007开发教程(1).MOSS2007之概述

MOSS QueryTool

about static

about androidPN

about swing

about源代码

about_us

MOSS基础教程.doc

Moss 2007 2010 面试题

MOSS2007的定制

MOSS 2007 Search

moss2007环境搭建大全

about.html

about_as

the book about c

about.dat

about-me

moss 2007 介绍

最新资源