SIM(1) SIM(1)
NAME
sim − find similarities in C, Java, Pascal, Modula-2, Lisp, Miranda, or text files
SYNOPSIS
sim c [ −[adefFiMnOpPRsSTuv] −r N −t N −w N −o F ] file ... [ [ / | ] file ... ]
sim c++ ...
sim java ...
sim pasc ...
sim m2 ...
sim lisp ...
sim mira ...
sim text ...
DESCRIPTION
Sim c reads the C files file ... and looks for segments of text that are similar; two segments of
program text are similar if they only differ in layout, comment, identifiers, and the contents of
numbers, strings and characters. If any runs of sufficient length are found, they are reported on
standard output; the number of significant tokens in the run is given between square brackets.
Sim c++ does the same for C++, sim java for Java, sim pasc for Pascal, sim m2 for Modula-2,
sim mira for Miranda, and sim lisp for Lisp. Sim text works on arbitrary text and it is occasion-
ally useful on shell scripts.
The program can be used for finding copied pieces of code in purportedly unrelated programs
(with −s or −S), or for finding accidentally duplicated co de in larger projects (with −f or −F).
If a separator / or | is present in the list of input files, the files are divided into a group of "new"
files (before the / or |) and a group of "old " files; if there is no / or |, all files are "new". Old
files are never compared to other files. See also the description of the −s and −S options below.
Since the similarity tester needs file names to pinpoint the similarities, it cannot read from stan-
dard input.
The similarity tester takes ASCII or UTF-8 text as input, and produces a sorted list of runs in
text form (default or with the -d or -n options) or in percentage form (with the -p option).
Input in other formats, e.g. .pdf or .doc needs to be converted to ASCII or UTF-8 by preprocess-
ing. Aggregated similarity results can be obtained by doing postprocessing on the output.
There are the following options:
−d The output is in a diff(1)-like format instead of the default 2-column format. Recom-
mended for text in languages with non-Latin alphabets.
−e Each file is compared to each file in isolation. This will find all similarities between all
texts involved, regardless of repetitive text, but may be slow for large numbers of files.
See also ‘Calculating Percentages’ below.
−f Runs are restricted to segments with balancing parentheses, to isolate potential routine
bodies (not in sim
text).
−F The names of routines in calls are required to match exactly (not in sim text).
−i The names of the files to be compared are read from standard input, including a possible
separator / or |; the file names must be one to a line. This option allows a very large
number of file names to be specified; it differs from the @ facility provided by some com-
pilers in that it handles file names only, and does not recognize option arguments.
−M Memory usage information is displayed on standard error output.
−n Similarities found are summarized by file name, position and size, rather than displayed in
full.
−o F The output is written to the file named F .
2016/08/01 1