Word, Character, and Bit Based Compression Using Arithmetic Coding.
-------------------------------------------------------------------
Authors:
John Carpinelli (johnfc@ecr.mu.oz.au)
Wayne Salamonsen (wbs@mundil.cs.mu.oz.au)
Alistair Moffat (alistair@cs.mu.oz.au)
Radford Neal (radford@ai.toronto.edu)
Ian Witten (ihw@waikato.ac.nz)
Based on:
A. Moffat, R. Neal, I.H. Witten, "Arithmetic Coding Revisited", Proc.
IEEE Data Compression Conference, Snowbird, Utah, March 1995.
Radford M. Neal, Low-Precision Arithmetic Coding Implementation.
P.M. Fenwick, "A New Data Structure for Cumulative Probability Tables",
Software--Practice and Experience, 24:327-336, March 1994.
See also:
I.H. Witten, R. Neal, J.G. Cleary, "Arithmetic Coding for Data
Compression", Comm ACM., 30:520-541, June 1987.
Overview:
This package contains C source code for an improved implementation of
arithmetic coding. Three different models are included with the
package: an adaptive zero-order word based model, an adaptive
zero-order character based model and an adaptive fixed-context
bit-based model. The option of using shift/add operations to effect
the necessary integer multiplication and division operations is also
supported. Use of the shift/add option may results in a speed
improvement on some architectures.
The code is based on "Arithmetic Coding Revisited" by Moffat, Neal and
Witten, presented at the IEEE Data Compression Conference in Snowbird,
Utah, March 1995. A more detailed version of that paper is currently
being prepared; contact Alistair Moffat, address below, for
details. The data structure used for storing the frequency counts of
each symbol is based on "A new data structure for cumulative
probability tables", by P.M. Fenwick, Software - Practice and
Experience, March 1994. Much of the arithmetic coding module was based
on a previous low-precision arithmetic coding implementation by Radford
Neal, available from ftp.cs.toronto.edu, file
/pub/radford/lowp_ac.shar. The original CACM coder is available from
ftp.cpsc.ucalgary.ca, file /pub/projects/ar.cod/cacm-87.shar
Each of the programs can be thought of as consisting of three
sections: model, statistics, and coder. Three different models have
been provided, one statistics module, and one coder module (but with a
compile time option to use hardware or software arithmetic).
Models:
The word model follows a strictly alternating sequence of reading a
word then a non-word. After reading each word/non-word it uses a
hashtable to find the corresponding ordinal word number which it then
passes to the statistics module together with the corresponding
word/non word context. A context stores the corresponding frequency for
each ordinal symbol. The intention of this model is to illustrate the
advantages of the new coder on large alphabets, and the manipulation
of a number of ``contexts'' within a single program.
The character-based model is broadly equivalent to the adaptive model
presented in the original CACM paper. Compression is relatively
poor.
The bit-based model takes one bit at a time from the input stream and
passes it and the corresponding context into the statistics module.
The correct context is chosen based on the previous n bits, where 0 <=
n <= 20 is specified as a commandline parameter. Relatively good
compression results when n is large, but it runs slowly. The intention
is to illustrate the use of the arithmetic coding routines when
restricted to a binary alphabet.
Statistics:
The statistics module interfaces the model and the coder. It is
responsible for probability estimation and generation of the parameters
required by the coder. It does so by adaptively counting symbol
occurrences, and uses a tree-based structure devised by Peter Fenwick
to maintain cumulative frequencies. Accessing the structure takes O(logn)
time per symbol, where n is the size of the alphabet, and so is small
even when n is very large. Just as important, it is space-economical,
and requires just n words of storage compared to the 3n words required
by the CACM cumfreq array.
Coder:
The default coder uses multiplications and divisions in its
calculations, as described in "Arithmetic Coding Revisited". The
compile time option "-DSHIFT_ADD" forces the use of shift/add integer
arithmetic. The low precision that is required allows, on some
architectures, faster execution.
One notable feature about the coder is that by using a limited
precision implementation, the total frequency can be up to 30 bits.
That is, the total frequency may reach a value as large as 1,073,741,824
before frequencies need be halved. The number of bits can be set as a
command line argument allowing one to use less bits for more precise
coding. The default number of bits used for frequency counts is 27.
The use of large values for the fbits parameter also allows faster
execution, since much of the arithmetic can be done to low precision.
Very large values for fbits do, however, introduce rounding effects in
the coder that may negate the gain of using more accurate model
probabilities. See the paper for analysis of these rounding effects.
The package:
The package consists of 14 files: 6 source files (.c), 3 header files
(.h), a makefile, 3 man pages (.1), and this README file.
word.c: Driving code for the word based compression model.
char.c: Driving code for the character based compression model.
bits.c: Driving code for the bit based compression model.
stats.c: Functions for maintaining frequency counts.
coder.c: Functions associated with arithmetic coding.
hashtable.c: Functions associated with maintaining the hash table
used to convert words to ordinal word numbers
coder.h: Function prototypes exported from coder.c, #defines used in
coder.c
stats.h: Function prototypes exported from stats.c, #defines used in
stats.c
hashtable.h: Function prototypes exported from hashtable.c, #defines used
in hashtable.c
word.1: Manual page
char.1: Manual page
bits.1: Manual page
Usage:
The programs share a common interface.
prog [-e [-m n] | -d] [-v] [-f n] [file]
where
prog is the program one wishes to run. i.e. word, char or bits
-e forces the program to encode
-m n limits the program to using n megabytes of memory (word and
bits only). Default value is n = 1.
-d forces the program to decode
-v verbose -- prints information about what the program
is doing, and various statistics once finished.
-f n use n bits of precision. Default is 27.
file specify file name of file to encode or decode.
Output is written to stdout. If no input file is specified, input is
taken from stdin. If neither -e nor -d is specified, default is to
encode _unless_ a magic number is detected in the first four bytes, in
which case decoding is assumed.
For more detailed help, refer to the appropriate man page.
Conditions:
These programs are supplied free of charge for research purposes only,
and may not sold or incorporated into any commercial product. There is
ABSOLUTELY NO WARRANTY of any sort, nor any undertaking that they are
fit for ANY PURPOSE WHATSOEVER. Use them at your own risk. If you do
happen to find a bug, or have modifications to suggest, please report
the same to Alistair Moffat at the address below.
Contact:
Alistair Moffat
Department of Computer Science
University of Melbourne
Victoria, Australia 3052.
alistair@cs.mu.oz.au
March 1995