没有合适的资源?快使用搜索试试~ 我知道了~
《Data Compression》第五章
需积分: 0 1 下载量 150 浏览量
2013-05-31
15:24:57
上传
评论
收藏 1.15MB PDF 举报
温馨提示
试读
117页
英文原版教材《Data Compression(数据压缩)》第五章
资源详情
资源评论
资源推荐
5
Statistical Methods
Statistical data compression methods employ variable-length codes, with the shorter
codes assigned to symbols or groups of symbols that appear more often in the data
(have a higher probability of occurrence). Designers and implementors of variable-
length codes have to deal with the two problems of (1) assigning codes that can be
decoded unambiguously and (2) assigning codes with the minimum average size. The
first problem is discussed in detail in Chapters 2 through 4, while the second problem
is solved in different ways by the methods described here.
This chapter is devoted to statistical compression algorithms, such as Shannon-
Fano, Huffman, arithmetic coding, and PPM. It is recommended, however, that the
reader start with the short presentation of information theory in the Appendix. This
presentation covers the principles and important terms used by information theory,
especially the terms redundancy and entropy. An understanding of these terms leads
to a deeper understanding of statistical data compression, and also makes it possible to
calculate how redundancy is reduced, or even eliminated, by the various methods.
5.1 Shannon-Fano Coding
Shannon-Fano coding, named after Claude Shannon and Robert Fano, was the first
algorithm to construct a set of the best variable-length codes.
We start with a set of n symbols with known probabilities (or frequencies) of oc-
currence. The symbols are first arranged in descending order of their probabilities. The
set of symbols is then divided into two subsets that have the same (or almost the same)
probabilities. All symbols in one subset get assigned codes that start with a 0, while
the codes of the symbols in the other subset start with a 1. Each subset is then recur-
sively divided into two subsubsets of roughly equal probabilities, and the second bit of
all the codes is determined in a similar way. When a subset contains just two symbols,
DOI 10.1007/978-1-84882-903-9_5, © Springer-Verlag London Limited 2010
D. Salomon, G. Motta, Handbook of Data Compression, 5th ed.
212 5. Statistical Methods
their codes are distinguished by adding one more bit to each. The process continues
until no more subsets remain. Table 5.1 illustrates the Shannon-Fano algorithm for a
seven-symbol alphabet. Notice that the symbols themselves are not shown, only their
probabilities.
Robert M. Fano was Ford Professor of Engineering, in
the Department of Electrical Engineering and Computer Sci-
ence at the Massachusetts Institute of Technology until his
retirement. In 1963 he organized MIT’s Project MAC (now
the Computer Science and Artificial Intelligence Laboratory)
and was its Director until September 1968. He also served as
Associate Head of the Department of Electrical Engineering
and Computer Science from 1971 to 1974.
Professor Fano chaired the Centennial Study Committee
of the Department of Electrical Engineering and Computer
Science whose report, “Lifelong Cooperative Education,” was published in October,
1982.
Professor Fano was born in Torino, Italy, and did most of his undergraduate work
at the School of Engineering of Torino before coming to the United States in 1939. He
received the Bachelor of Science degree in 1941 and the Doctor of Science degree in 1947,
both in Electrical Engineering from MIT. He has been a member of the MIT staff since
1941 and a member of its faculty since 1947.
During World War II, Professor Fano was on the staff of the MIT Radiation Lab-
oratory, working on microwave components and filters. He was also group leader of the
Radar Techniques Group of Lincoln Laboratory from 1950 to 1953. He has worked and
published at various times in the fields of network theory, microwaves, electromagnetism,
information theory, computers and engineering education. He is author of the book enti-
tled Transmission of Information, and co-author of Electromagnetic Fields, Energy and
Forces and Electromagnetic Energy Transmission and Radiation. He is also co-author
of Volume 9 of the Radiation Laboratory Series.
The first step splits the set of seven symbols into two subsets, one with two symbols
and a total probability of 0.45 and the other with the remaining five symbols and a
total probability of 0.55. The two symbols in the first subset are assigned codes that
start with 1, so their final codes are 11 and 10. The second subset is divided, in the
second step, into two symbols (with total probability 0.3 and codes that start with 01)
and three symbols (with total probability 0.25 and codes that start with 00). Step three
divides the last three symbols into 1 (with probability 0.1 and code 001) and 2 (with
total probability 0.15 and codes that start with 000).
The average size of this code is 0.25 ×2+0.20 ×2+0.15 ×3+0.15 ×3+0.10 ×3+
0.10 × 4+0.05 × 4=2.7 bits/symbol. This is a good result because the entropy (the
smallest number of bits needed, on average, to represent each symbol) is
−
0.25 log
2
0.25 + 0.20 log
2
0.20 + 0.15 log
2
0.15 + 0.15 log
2
0.15
+0.10 log
2
0.10 + 0.10 log
2
0.10 + 0.05 log
2
0.05
≈ 2.67.
5.1 Shannon-Fano Coding 213
Prob. Steps Final
1. 0.25 1 1 :11
2. 0.20 1 0 :10
3. 0.15 0 1 1 :011
4. 0.15 0 1 0 :010
5. 0.10 0 0 1 :001
6. 0.10 0 0 0 1 :0001
7. 0.05 0 0 0 0 :0000
Table 5.1: Shannon-Fano Example.
Exercise 5.1: Repeat the calculation above but place the first split between the third
and fourth symbols. Calculate the average size of the code and show that it is greater
than 2.67 bits/symbol.
The code in the table in the answer to Exercise 5.1 has longer average size because
the splits, in this case, were not as good as those of Table 5.1. This suggests that the
Shannon-Fano method produces better code when the splits are better, i.e., when the
two subsets in every split have very close total probabilities. Carrying this argument
to its limit suggests that perfect splits yield the best code. Table 5.2 illustrates such a
case. The two subsets in every split have identical total probabilities, yielding a code
with the minimum average size (zero redundancy). Its average size is 0.25 × 2+0.25 ×
2+0.125 × 3+0.125 × 3+0.125 × 3+0.125 × 3=2.5 bits/symbols, which is identical
to its entropy. This means that it is the theoretical minimum average size.
Prob. Steps Final
1. 0.25 1 1 :11
2. 0.25 1 0 :10
3. 0.125 0 1 1 :011
4. 0.125 0 1 0 :010
5. 0.125 0 0 1 :001
6. 0.125 0 0 0 :000
Table 5.2: Shannon-Fano Balanced Example.
The conclusion is that this method produces the best results when the symbols have
probabilities of occurrence that are (negative) powers of 2.
Exercise 5.2: Compute the entropy of the codes of Table 5.2.
The Shannon-Fano method is easy to implement but the code it produces is gener-
ally not as good as that produced by the Huffman method (Section 5.2).
214 5. Statistical Methods
5.2 Huffman Coding
David Huffman (1925–1999)
Being originally from Ohio, it is no wonder that Huffman went to Ohio State University
for his BS (in electrical engineering). What is unusual was his age
(18) when he earned it in 1944. After serving in the United States
Navy, he went back to Ohio State for an MS degree (1949) and then
to MIT, for a PhD (1953, electrical engineering).
That same year, Huffman joined the faculty at MIT. In 1967,
he made his only career move when he went to the University of
California, Santa Cruz as the founding faculty member of the Com-
puter Science Department. During his long tenure at UCSC, Huff-
man played a major role in the development of the department (he
served as chair from 1970 to 1973) and he is known for his motto
“my products are my students.” Even after his retirement, in 1994, he remained active
in the department, teaching information theory and signal analysis courses.
Huffman made significant contributions in several areas, mostly information theory
and coding, signal designs for radar and communications, and design procedures for
asynchronous logical circuits. Of special interest is the well-known Huffman algorithm
for constructing a set of optimal prefix codes for data with known frequencies of occur-
rence. At a certain point he became interested in the mathematical properties of “zero
curvature” surfaces, and developed this interest into techniques for folding paper into
unusual sculptured shapes (the so-called computational origami).
Huffman coding is a popular method for data compression. It serves as the basis
for several popular programs run on various platforms. Some programs use just the
Huffman method, while others use it as one step in a multistep compression process.
The Huffman method [Huffman 52] is somewhat similar to the Shannon-Fano method.
It generally produces better codes, and like the Shannon-Fano method, it produces the
best code when the probabilities of the symbols are negative powers of 2. The main
difference between the two methods is that Shannon-Fano constructs its codes top to
bottom (from the leftmost to the rightmost bits), while Huffman constructs a code tree
from the bottom up (builds the codes from right to left). Since its development, in
1952, by D. Huffman, this method has been the subject of intensive research into data
compression.
Since its development in 1952 by D. Huffman, this method has been the subject of
intensive research in data compression. The long discussion in [Gilbert and Moore 59]
proves that the Huffman code is a minimum-length code in the sense that no other
encoding has a shorter average length. An algebraic approach to constructing the Huff-
man code is introduced in [Karp 61]. In [Gallager 74], Robert Gallager shows that the
redundancy of Huffman coding is at most p
1
+0.086 where p
1
is the probability of the
most-common symbol in the alphabet. The redundancy is the difference between the
average Huffman codeword length and the entropy. Given a large alphabet, such as the
5.2 Huffman Coding 215
set of letters, digits and punctuation marks used by a natural language, the largest sym-
bol probability is typically around 15–20%, bringing the value of the quantity p
1
+0.086
to around 0.1. This means that Huffman codes are at most 0.1 bit longer (per symbol)
than an ideal entropy encoder, such as arithmetic coding.
The Huffman algorithm starts by building a list of all the alphabet symbols in
descending order of their probabilities. It then constructs a tree, with a symbol at every
leaf, from the bottom up. This is done in steps, where at each step the two symbols with
smallest probabilities are selected, added to the top of the partial tree, deleted from the
list, and replaced with an auxiliary symbol representing the two original symbols. When
the list is reduced to just one auxiliary symbol (representing the entire alphabet), the
tree is complete. The tree is then traversed to determine the codes of the symbols.
This process is best illustrated by an example. Given five symbols with probabilities
as shown in Figure 5.3a, they are paired in the following order:
1. a
4
is combined with a
5
and both are replaced by the combined symbol a
45
, whose
probability is 0.2.
2. There are now four symbols left, a
1
, with probability 0.4, and a
2
, a
3
, and a
45
, with
probabilities 0.2 each. We arbitrarily select a
3
and a
45
, combine them, and replace them
with the auxiliary symbol a
345
, whose probability is 0.4.
3. Three symbols are now left, a
1
, a
2
, and a
345
, with probabilities 0.4, 0.2, and 0.4,
respectively. We arbitrarily select a
2
and a
345
, combine them, and replace them with
the auxiliary symbol a
2345
, whose probability is 0.6.
4. Finally, we combine the two remaining symbols, a
1
and a
2345
, and replace them with
a
12345
with probability 1.
The tree is now complete. It is shown in Figure 5.3a “lying on its side” with its root
on the right and its five leaves on the left. To assign the codes, we arbitrarily assign a
bit of 1 to the top edge, and a bit of 0 to the bottom edge, of every pair of edges. This
results in the codes 0, 10, 111, 1101, and 1100. The assignments of bits to the edges is
arbitrary.
The average size of this code is 0.4 ×1+0.2 ×2+0.2 ×3+0.1 × 4+0.1 × 4=2.2
bits/symbol, but even more importantly, the Huffman code is not unique. Some of
the steps above were chosen arbitrarily, since there were more than two symbols with
smallest probabilities. Figure 5.3b shows how the same five symbols can be combined
differently to obtain a different Huffman code (11, 01, 00, 101, and 100). The average
size of this code is 0.4 ×2+0.2 ×2+0.2 × 2+0.1 × 3+0.1 × 3=2.2 bits/symbol, the
same as the previous code.
Exercise 5.3: Given the eight symbols A, B, C, D, E, F, G, and H with probabilities
1/30, 1/30, 1/30, 2/30, 3/30, 5/30, 5/30, and 12/30, draw three different Huffman trees
with heights 5 and 6 for these symbols and calculate the average code size for each tree.
Exercise 5.4: Figure Ans.5d shows another Huffman tree, with height 4, for the eight
symbols introduced in Exercise 5.3. Explain why this tree is wrong.
It turns out that the arbitrary decisions made in constructing the Huffman tree
affect the individual codes but not the average size of the code. Still, we have to answer
the obvious question, which of the different Huffman codes for a given set of symbols
is best? The answer, while not obvious, is simple: The best code is the one with the
剩余116页未读,继续阅读
u010417109
- 粉丝: 1
- 资源: 10
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0