文档中英文词频统计-C++链表的简易使用资源-CSDN文库

共27个文件

txt：3个

ilk：2个

dsw：2个

Frequency

Statistics

5星 · 超过95%的资源需积分: 10 119 浏览量 2010-05-18 21:12:13 上传评论 1 收藏 263KB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

词频统计.rar （27个子文件）

词频统计

out_all_word_frequency.txt 8KB

out_high_frequency.txt 1KB

词频统计.plg 0B

1.dsw 527B

词频统计.ncb 27KB

词频统计.dsw 541B

词频统计.sln 888B

123.txt 11KB

1.ncb 41KB

1.opt 48KB

词频统计.vcproj.DAIJINGBO.DAIJINGBO.user 1KB

1.cpp 3KB

Debug

词频统计.exe 168KB

vc60.pdb 52KB

1.suo 3KB

1.exe 168KB

1.obj 7KB

vc60.idb 33KB

词频统计.ilk 179KB

1.ilk 202KB

1.pdb 449KB

1.pch 221KB

词频统计.dsp 4KB

1.plg 1KB

词频统计.suo 7KB

1.dsp 3KB

词频统计.vcproj 5KB

Abstract. We present the tool Kato which is, to the best of our knowledge, the first tool for plagiarism detection that is directly tailored for answer-set programming (ASP). Kato aims at finding similarities between (segments of) logic programs to help detecting cases of plagiarism. Currently, the tool is realised for DLV programs but it is designed to handle various logic-programming syntax versions. We review basic features and the underlying methodology of the tool. 1 Background With the rise of the Internet and its easy access of information, plagiarism is a growing problem not only in academia but also in science and technology in general. In software development, plagiarism involves copying (parts of) a program without revealing the source where it was copied from. The relevance of plagiarism detection for conventional program development is well acknowledged [1]〞it is not only motivated by an academic setting to prevent students from violating good academic standards, but also by the urge to retain the control of program code in industrial software development projects. We are concerned with plagiarism detection in the context of answer-set programming (ASP) [2]. In particular, we deal with disjunctive logic programs under the answerset semantics [3]. Answer-set programming is characterised by the feature that problems are encoded in term of theories such that the solutions of a problem instance correspond to certain models (the ※answer sets§) of the corresponding logic program. It differs from imperative languages like C++ or Java (and also to some extent from Prolog) because of its genuine declarative nature: a logic program is a specification rather than an instruction of how to solve a problem; the order of the rules and the order of the literals within the heads and bodies of the rules have no effect on the semantics of a program. Hence, someone who copies code has other means to disguise the deed. For conventional programming languages, sophisticated tools for plagiarism exist (see, e.g., [4每8]). However, most techniques are not adequate for ASP. The reason is the declarative nature of ASP as well as the lack of a control flow. Especially the fact that the order of statements (and of the literals of a statement) is not relevant for a program causes that existing techniques are unsuitable in general. As well, many commonly used ? This work was partially supported by the Austrian Science Fund (FWF) under project P21698. tools work with a tokenisation: the source code is translated into a token string where code words are replaced by predefined keywords. These token strings are used for the further comparison by searching for common substrings. However, DLV does not have many built-in predicates which makes these methods inefficient for detecting copies. The need for tools for plagiarism detection in ASP is clearly motivated by the growing application in academia and industry but our primary interest to have such a tool is to apply it in the context of a laboratory course at our university.We thus developed the tool Kato which, to the best of our knowledge, is the first tool for plagiarism detection that is directly tailored for ASP.1 Kato aims at finding similarities between (segments of) logic programs to help detecting cases of plagiarism. Currently, the tool is realised for DLV programs but it is designed to handle various logic-programming syntax versions as well.2 In what follows, we review the basic features of Kato and outline its underlying methodology. 2 Features and Basic Methodology of Kato Kato was developed to find suspicious pairs of programs stemming from student assignments in the context of a course on logic programming at our university. Hence, the tool can perform pairwise similarity tests on a rather large set of relatively small programs. In what follows, our intention is to provide the reader with basic insights concerning the implemented features and how they are realised. Following a hybrid approach, Kato performs four kinds of tests, realising different layers of granularity: (i) a comparison of the program comments via the longest common subsequence (LCS) metric (see [9] for an overview), (ii) an LCS test on the whole program, (iii) a finger-print test, and (iv) a structure test. The LCS of two strings is the longest set of identical symbols in both strings with the same order. Hence, the LCS metric tolerates injected non-matching objects. Note that (i) and (ii) are language independent while (iii) and (iv) need to be adapted for different language dialects. LCS-Comment Tests. It is surprising what little effort some people spend to mask copied comments. This test reveals similarities between program comments when interpreted as simple strings via the LCS metric. LCS-Program Tests. Similar to the LCS-comment test, the whole programs are interpreted as two strings which are then tested for their longest common subsequence. This test represents an efficient method to detect cases of plagiarism where not much time has been spent to camouflage the plagiarism or parts of it. Finger-Print Tests. A finger-print of a program is a collection of relevant statistic data like hash-codes, number of rules, number of predicates, number of constants, program size, and so on. After finger-prints for all programs are generated, they are compared pairwise. This gives a simple yet convenient way to collect further evidence for plagiarism. 1 The name of the tool derives, with all due acknowledgements, from Inspector Clouseau＊s loyal servant and side-kick, Kato. 2 See http://www.dlvsystem.com/ for details about DLV. Structure Tests. This kind of tests gives, by taking the structure of the programs into account, the most significant information in general. The central similarity function underlying the structure tests is defined as follows: Let lit(r) be the set of all literals occurring in a rule r. Then, for two rules r1 and r2, the rule similarity, (r1; r2), is defined as (r1; r2) = jlit(r1) \ lit(r2)j max(jlit(r1)j; jlit(r2)j) : Furthermore, for two programs P1 and P2, the similarity, S(P1; P2), is given by S(P1; P2) = Pr2P1 max((r; r0) : r0 2 P2) jP1j : Note that S is not symmetrical in its arguments. For any two programs P1 and P2, S(P1; P2) is mapped to a value between 0 and 1 which, roughly speaking, expresses to which extent P1 is subsumed by P2 by similar rules. By definition, S thwarts disguising strategies like permuting rules or literals within rules. However, a more advanced plagiarist could also uniformly rename variables within rules or rename some auxiliary predicates. Therefore, our similarity test comes with different levels of abstraction to counter these malicious efforts. Such renaming are handled by finding and applying suitable substitution functions. Without going into details, the problem of finding such functions is closely related to the homomorphism problem which is known to be NP-complete. To circumvent the high complexity, we use an efficient greedy heuristic to obtain our substitutions. To make the similarity function sensitive to common rule patterns, we also implemented a context dependent extension: A global occurrence table gives additional information how specific two rules are. The main idea is that rare rules yield better evidence for a copy than common ones. Therefore, Kato collects and counts all rules in the considered corpus of programs and stores this information in an occurrence table. During the comparison, the rule similarity is then weighted depending on the frequency of the involved rules. 3 Further Information and Discussion The tool is entirely developed in Java (version 6.0). The results of the program comparisons are displayed in tabular form with features like sorting and filtering. For the structure tests, the tool shows program pairs and highlights similar rules. Current

评论收藏

内容反馈