Abstract. We present the tool Kato which is, to the best of our knowledge,
the first tool for plagiarism detection that is directly tailored for answer-set programming
(ASP). Kato aims at finding similarities between (segments of) logic
programs to help detecting cases of plagiarism. Currently, the tool is realised for
DLV programs but it is designed to handle various logic-programming syntax
versions. We review basic features and the underlying methodology of the tool.
1 Background
With the rise of the Internet and its easy access of information, plagiarism is a growing
problem not only in academia but also in science and technology in general. In software
development, plagiarism involves copying (parts of) a program without revealing
the source where it was copied from. The relevance of plagiarism detection for conventional
program development is well acknowledged [1]〞it is not only motivated by an
academic setting to prevent students from violating good academic standards, but also
by the urge to retain the control of program code in industrial software development
projects.
We are concerned with plagiarism detection in the context of answer-set programming
(ASP) [2]. In particular, we deal with disjunctive logic programs under the answerset
semantics [3]. Answer-set programming is characterised by the feature that problems
are encoded in term of theories such that the solutions of a problem instance correspond
to certain models (the ※answer sets§) of the corresponding logic program. It differs from
imperative languages like C++ or Java (and also to some extent from Prolog) because
of its genuine declarative nature: a logic program is a specification rather than an instruction
of how to solve a problem; the order of the rules and the order of the literals
within the heads and bodies of the rules have no effect on the semantics of a program.
Hence, someone who copies code has other means to disguise the deed.
For conventional programming languages, sophisticated tools for plagiarism exist
(see, e.g., [4每8]). However, most techniques are not adequate for ASP. The reason is the
declarative nature of ASP as well as the lack of a control flow. Especially the fact that
the order of statements (and of the literals of a statement) is not relevant for a program
causes that existing techniques are unsuitable in general. As well, many commonly used
? This work was partially supported by the Austrian Science Fund (FWF) under project P21698.
tools work with a tokenisation: the source code is translated into a token string where
code words are replaced by predefined keywords. These token strings are used for the
further comparison by searching for common substrings. However, DLV does not have
many built-in predicates which makes these methods inefficient for detecting copies.
The need for tools for plagiarism detection in ASP is clearly motivated by the growing
application in academia and industry but our primary interest to have such a tool is
to apply it in the context of a laboratory course at our university.We thus developed the
tool Kato which, to the best of our knowledge, is the first tool for plagiarism detection
that is directly tailored for ASP.1 Kato aims at finding similarities between (segments
of) logic programs to help detecting cases of plagiarism. Currently, the tool is realised
for DLV programs but it is designed to handle various logic-programming syntax versions
as well.2 In what follows, we review the basic features of Kato and outline its
underlying methodology.
2 Features and Basic Methodology of Kato
Kato was developed to find suspicious pairs of programs stemming from student assignments
in the context of a course on logic programming at our university. Hence,
the tool can perform pairwise similarity tests on a rather large set of relatively small
programs. In what follows, our intention is to provide the reader with basic insights
concerning the implemented features and how they are realised.
Following a hybrid approach, Kato performs four kinds of tests, realising different
layers of granularity: (i) a comparison of the program comments via the longest
common subsequence (LCS) metric (see [9] for an overview), (ii) an LCS test on the
whole program, (iii) a finger-print test, and (iv) a structure test. The LCS of two strings
is the longest set of identical symbols in both strings with the same order. Hence, the
LCS metric tolerates injected non-matching objects. Note that (i) and (ii) are language
independent while (iii) and (iv) need to be adapted for different language dialects.
LCS-Comment Tests. It is surprising what little effort some people spend to mask copied
comments. This test reveals similarities between program comments when interpreted
as simple strings via the LCS metric.
LCS-Program Tests. Similar to the LCS-comment test, the whole programs are interpreted
as two strings which are then tested for their longest common subsequence. This
test represents an efficient method to detect cases of plagiarism where not much time
has been spent to camouflage the plagiarism or parts of it.
Finger-Print Tests. A finger-print of a program is a collection of relevant statistic data
like hash-codes, number of rules, number of predicates, number of constants, program
size, and so on. After finger-prints for all programs are generated, they are compared
pairwise. This gives a simple yet convenient way to collect further evidence for plagiarism.
1 The name of the tool derives, with all due acknowledgements, from Inspector Clouseau*s loyal
servant and side-kick, Kato.
2 See http://www.dlvsystem.com/ for details about DLV.
Structure Tests. This kind of tests gives, by taking the structure of the programs into
account, the most significant information in general. The central similarity function
underlying the structure tests is defined as follows: Let lit(r) be the set of all literals
occurring in a rule r. Then, for two rules r1 and r2, the rule similarity, (r1; r2), is
defined as
(r1; r2) = jlit(r1) \ lit(r2)j
max(jlit(r1)j; jlit(r2)j) :
Furthermore, for two programs P1 and P2, the similarity, S(P1; P2), is given by
S(P1; P2) = Pr2P1 max((r; r0) : r0 2 P2)
jP1j
:
Note that S is not symmetrical in its arguments. For any two programs P1 and P2,
S(P1; P2) is mapped to a value between 0 and 1 which, roughly speaking, expresses to
which extent P1 is subsumed by P2 by similar rules.
By definition, S thwarts disguising strategies like permuting rules or literals within
rules. However, a more advanced plagiarist could also uniformly rename variables
within rules or rename some auxiliary predicates. Therefore, our similarity test comes
with different levels of abstraction to counter these malicious efforts. Such renaming
are handled by finding and applying suitable substitution functions. Without going into
details, the problem of finding such functions is closely related to the homomorphism
problem which is known to be NP-complete. To circumvent the high complexity, we
use an efficient greedy heuristic to obtain our substitutions.
To make the similarity function sensitive to common rule patterns, we also implemented
a context dependent extension: A global occurrence table gives additional
information how specific two rules are. The main idea is that rare rules yield better evidence
for a copy than common ones. Therefore, Kato collects and counts all rules in
the considered corpus of programs and stores this information in an occurrence table.
During the comparison, the rule similarity is then weighted depending on the frequency
of the involved rules.
3 Further Information and Discussion
The tool is entirely developed in Java (version 6.0). The results of the program comparisons
are displayed in tabular form with features like sorting and filtering. For the
structure tests, the tool shows program pairs and highlights similar rules. Current
- 1
- 2
- 3
- 4
前往页