【免费】《DataCompression》第五章资源-CSDN文库

《数据压缩》

需积分: 0 150 浏览量 2013-05-31 15:24:57 上传评论收藏 1.15MB PDF 举报

资源详情

资源评论

资源推荐

Statistical Methods

Statistical data compression methods employ variable-length codes, with the shorter

codes assigned to symbols or groups of symbols that appear more often in the data

(have a higher probability of occurrence). Designers and implementors of variable-

length codes have to deal with the two problems of (1) assigning codes that can be

decoded unambiguously and (2) assigning codes with the minimum average size. The

ﬁrst problem is discussed in detail in Chapters 2 through 4, while the second problem

is solved in diﬀerent ways by the methods described here.

This chapter is devoted to statistical compression algorithms, such as Shannon-

Fano, Huﬀman, arithmetic coding, and PPM. It is recommended, however, that the

reader start with the short presentation of information theory in the Appendix. This

presentation covers the principles and important terms used by information theory,

especially the terms redundancy and entropy. An understanding of these terms leads

to a deeper understanding of statistical data compression, and also makes it possible to

calculate how redundancy is reduced, or even eliminated, by the various methods.

5.1 Shannon-Fano Coding

Shannon-Fano coding, named after Claude Shannon and Robert Fano, was the ﬁrst

algorithm to construct a set of the best variable-length codes.

We start with a set of n symbols with known probabilities (or frequencies) of oc-

currence. The symbols are ﬁrst arranged in descending order of their probabilities. The

set of symbols is then divided into two subsets that have the same (or almost the same)

probabilities. All symbols in one subset get assigned codes that start with a 0, while

the codes of the symbols in the other subset start with a 1. Each subset is then recur-

sively divided into two subsubsets of roughly equal probabilities, and the second bit of

all the codes is determined in a similar way. When a subset contains just two symbols,

DOI 10.1007/978-1-84882-903-9_5, © Springer-Verlag London Limited 2010

D. Salomon, G. Motta, Handbook of Data Compression, 5th ed.

212 5. Statistical Methods

their codes are distinguished by adding one more bit to each. The process continues

until no more subsets remain. Table 5.1 illustrates the Shannon-Fano algorithm for a

seven-symbol alphabet. Notice that the symbols themselves are not shown, only their

probabilities.

Robert M. Fano was Ford Professor of Engineering, in

the Department of Electrical Engineering and Computer Sci-

ence at the Massachusetts Institute of Technology until his

retirement. In 1963 he organized MIT’s Project MAC (now

the Computer Science and Artiﬁcial Intelligence Laboratory)

and was its Director until September 1968. He also served as

Associate Head of the Department of Electrical Engineering

and Computer Science from 1971 to 1974.

Professor Fano chaired the Centennial Study Committee

of the Department of Electrical Engineering and Computer

Science whose report, “Lifelong Cooperative Education,” was published in October,

1982.

Professor Fano was born in Torino, Italy, and did most of his undergraduate work

at the School of Engineering of Torino before coming to the United States in 1939. He

received the Bachelor of Science degree in 1941 and the Doctor of Science degree in 1947,

both in Electrical Engineering from MIT. He has been a member of the MIT staﬀ since

1941 and a member of its faculty since 1947.

During World War II, Professor Fano was on the staﬀ of the MIT Radiation Lab-

oratory, working on microwave components and ﬁlters. He was also group leader of the

Radar Techniques Group of Lincoln Laboratory from 1950 to 1953. He has worked and

published at various times in the ﬁelds of network theory, microwaves, electromagnetism,

information theory, computers and engineering education. He is author of the book enti-

tled Transmission of Information, and co-author of Electromagnetic Fields, Energy and

Forces and Electromagnetic Energy Transmission and Radiation. He is also co-author

of Volume 9 of the Radiation Laboratory Series.

The ﬁrst step splits the set of seven symbols into two subsets, one with two symbols

and a total probability of 0.45 and the other with the remaining ﬁve symbols and a

total probability of 0.55. The two symbols in the ﬁrst subset are assigned codes that

start with 1, so their ﬁnal codes are 11 and 10. The second subset is divided, in the

second step, into two symbols (with total probability 0.3 and codes that start with 01)

and three symbols (with total probability 0.25 and codes that start with 00). Step three

divides the last three symbols into 1 (with probability 0.1 and code 001) and 2 (with

total probability 0.15 and codes that start with 000).

The average size of this code is 0.25 ×2+0.20 ×2+0.15 ×3+0.15 ×3+0.10 ×3+

0.10 × 4+0.05 × 4=2.7 bits/symbol. This is a good result because the entropy (the

smallest number of bits needed, on average, to represent each symbol) is

−



0.25 log

0.25 + 0.20 log

0.20 + 0.15 log

0.15 + 0.15 log

0.15

+0.10 log

0.10 + 0.10 log

0.10 + 0.05 log

0.05



≈ 2.67.

214 5. Statistical Methods

5.2 Huﬀman Coding

David Huﬀman (1925–1999)

Being originally from Ohio, it is no wonder that Huﬀman went to Ohio State University

for his BS (in electrical engineering). What is unusual was his age

(18) when he earned it in 1944. After serving in the United States

Navy, he went back to Ohio State for an MS degree (1949) and then

to MIT, for a PhD (1953, electrical engineering).

That same year, Huﬀman joined the faculty at MIT. In 1967,

he made his only career move when he went to the University of

California, Santa Cruz as the founding faculty member of the Com-

puter Science Department. During his long tenure at UCSC, Huﬀ-

man played a major role in the development of the department (he

served as chair from 1970 to 1973) and he is known for his motto

“my products are my students.” Even after his retirement, in 1994, he remained active

in the department, teaching information theory and signal analysis courses.

Huﬀman made signiﬁcant contributions in several areas, mostly information theory

and coding, signal designs for radar and communications, and design procedures for

asynchronous logical circuits. Of special interest is the well-known Huﬀman algorithm

for constructing a set of optimal preﬁx codes for data with known frequencies of occur-

rence. At a certain point he became interested in the mathematical properties of “zero

curvature” surfaces, and developed this interest into techniques for folding paper into

unusual sculptured shapes (the so-called computational origami).

Huﬀman coding is a popular method for data compression. It serves as the basis

for several popular programs run on various platforms. Some programs use just the

Huﬀman method, while others use it as one step in a multistep compression process.

The Huﬀman method [Huﬀman 52] is somewhat similar to the Shannon-Fano method.

It generally produces better codes, and like the Shannon-Fano method, it produces the

best code when the probabilities of the symbols are negative powers of 2. The main

diﬀerence between the two methods is that Shannon-Fano constructs its codes top to

bottom (from the leftmost to the rightmost bits), while Huﬀman constructs a code tree

from the bottom up (builds the codes from right to left). Since its development, in

1952, by D. Huﬀman, this method has been the subject of intensive research into data

compression.

Since its development in 1952 by D. Huﬀman, this method has been the subject of

intensive research in data compression. The long discussion in [Gilbert and Moore 59]

proves that the Huﬀman code is a minimum-length code in the sense that no other

encoding has a shorter average length. An algebraic approach to constructing the Huﬀ-

man code is introduced in [Karp 61]. In [Gallager 74], Robert Gallager shows that the

redundancy of Huﬀman coding is at most p

+0.086 where p

is the probability of the

most-common symbol in the alphabet. The redundancy is the diﬀerence between the

average Huﬀman codeword length and the entropy. Given a large alphabet, such as the

5.2 Huﬀman Coding 215

set of letters, digits and punctuation marks used by a natural language, the largest sym-

bol probability is typically around 15–20%, bringing the value of the quantity p

+0.086

to around 0.1. This means that Huﬀman codes are at most 0.1 bit longer (per symbol)

than an ideal entropy encoder, such as arithmetic coding.

The Huﬀman algorithm starts by building a list of all the alphabet symbols in

descending order of their probabilities. It then constructs a tree, with a symbol at every

leaf, from the bottom up. This is done in steps, where at each step the two symbols with

smallest probabilities are selected, added to the top of the partial tree, deleted from the

list, and replaced with an auxiliary symbol representing the two original symbols. When

the list is reduced to just one auxiliary symbol (representing the entire alphabet), the

tree is complete. The tree is then traversed to determine the codes of the symbols.

This process is best illustrated by an example. Given ﬁve symbols with probabilities

as shown in Figure 5.3a, they are paired in the following order:

1. a

is combined with a

and both are replaced by the combined symbol a

, whose

probability is 0.2.

2. There are now four symbols left, a

, with probability 0.4, and a

, a

, and a

, with

probabilities 0.2 each. We arbitrarily select a

and a

, combine them, and replace them

with the auxiliary symbol a

345

, whose probability is 0.4.

3. Three symbols are now left, a

, a

, and a

345

, with probabilities 0.4, 0.2, and 0.4,

respectively. We arbitrarily select a

and a

345

, combine them, and replace them with

the auxiliary symbol a

2345

, whose probability is 0.6.

4. Finally, we combine the two remaining symbols, a

and a

2345

, and replace them with

12345

with probability 1.

The tree is now complete. It is shown in Figure 5.3a “lying on its side” with its root

on the right and its ﬁve leaves on the left. To assign the codes, we arbitrarily assign a

bit of 1 to the top edge, and a bit of 0 to the bottom edge, of every pair of edges. This

results in the codes 0, 10, 111, 1101, and 1100. The assignments of bits to the edges is

arbitrary.

The average size of this code is 0.4 ×1+0.2 ×2+0.2 ×3+0.1 × 4+0.1 × 4=2.2

bits/symbol, but even more importantly, the Huﬀman code is not unique. Some of

the steps above were chosen arbitrarily, since there were more than two symbols with

smallest probabilities. Figure 5.3b shows how the same ﬁve symbols can be combined

diﬀerently to obtain a diﬀerent Huﬀman code (11, 01, 00, 101, and 100). The average

size of this code is 0.4 ×2+0.2 ×2+0.2 × 2+0.1 × 3+0.1 × 3=2.2 bits/symbol, the

same as the previous code.

 Exercise 5.3: Given the eight symbols A, B, C, D, E, F, G, and H with probabilities

1/30, 1/30, 1/30, 2/30, 3/30, 5/30, 5/30, and 12/30, draw three diﬀerent Huﬀman trees

with heights 5 and 6 for these symbols and calculate the average code size for each tree.

 Exercise 5.4: Figure Ans.5d shows another Huﬀman tree, with height 4, for the eight

symbols introduced in Exercise 5.3. Explain why this tree is wrong.

It turns out that the arbitrary decisions made in constructing the Huﬀman tree

aﬀect the individual codes but not the average size of the code. Still, we have to answer

the obvious question, which of the diﬀerent Huﬀman codes for a given set of symbols

is best? The answer, while not obvious, is simple: The best code is the one with the

剩余116页未读，继续阅读

评论收藏

内容反馈

u010417109

粉丝: 1
资源: 10

《Data Compression》第五章

评论0

最新资源

《Data Compression》第五章

评论0

Data Compression

Data-Compression

Handbook of Data Compression

The_Data_Compression_Book

《Data Compression》第二章

introduction to data compression

Introduction to Data Compression, 5th Edition.zip

多媒体技术教程答案.docx

《Data Compression》 第一章

Data_Compression_Methods_and_Analysis_dataanalysis_Compression_源

Understanding Compression - Data Compression for Modern Developer

Data.Compression.The.Complete.Reference.

Slim Data Compression-开源

Managing Gigabytes: Compressing and Indexing Documents and Images

hadoop 权威指南（第三版）英文版

hadoop_the_definitive_guide_3nd_edition

算法第四版（algorithms），2011年新出版，算法设计力作

算法导论_英文第三版

A Concise Introduction to Data Compression

Introduction To Data Compression

Data Compression Techniques

Data-Structures--Compression

Data Compression The Complete Reference

算法导论 第三版 英文原版 高清文字版

算法导论-introduction to algrithms 3rd edition

Servlets和JSP核心技术 卷2(英文版) 第一部分

Servlets和JSP核心技术 卷2(英文版) 第二部分

最新资源

《Data Compression》第一章

算法导论第三版英文原版高清文字版

Servlets和JSP核心技术卷2(英文版) 第一部分

Servlets和JSP核心技术卷2(英文版) 第二部分