Application of 2D graphic representation of protein sequence based on
Huffman tree method
Zhao-Hui Qi
a,
n
, Jun Feng
a
, Xiao-Qin Qi
a
, Ling Li
b
a
College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, Hebei 050043, People’s Republic of China
b
Basic Courses Department, Zhejiang Shuren University, Hangzhou, Zhejiang 310015, People’s Republic of China
article info
Article history:
Received 13 May 2011
Accepted 30 January 2012
Keywords:
Protein sequence
Graphic representation
Sequence analysis
Escherichia coli
Huffman tree
abstract
Based on Huffman tree method, we propose a new 2D graphic representation of protein sequence. This
representation can completely avoid loss of information in the transfer of data from a protein sequence
to its graphic representation. The method consists of two parts. One is about the 0–1 codes of 20 amino
acids by Huffman tree with amino acid frequency. The amino acid frequency is defined as the statistical
number of an amino acid in the ana lyzed protein sequences. The other is about the 2D graphic
representation of protein sequence based on the 0–1 codes. Then the applications of the method on ten
ND5 genes and seven Escherichia coli strains are presented in detail. The results show that the proposed
model may provide us with some new sights to understand the evolution patterns determined from
protein sequences and complete genomes.
& 2012 Elsevier Ltd. All rights reserved.
1. Introduction
The rapid growth of biological sequence such as DNA and
protein has created many challenges for bioscientists. Facing the
explosive growth of DNA and protein sequences, experimental,
mathematical and graphic approaches have been employed to
study the structure, function, evolution and attribution [1] of
these sequences.
Graphic techniques have emergedasapowerfultoolforthe
analysis and visualization of long biology sequences. The advantage
of graphic representations of biology sequences is that they provide
a simple way of viewing, sorting, and comparing various gene
structures, helping in recognizing major differences among similar
DNA and protein sequences. Graphical method for visualizing DNA
sequence is early proposed by Hamori in 1983 [2]. Afterwards,
Hamori [3] and Jeffrey [4] considered two other graphical repre-
sentation methods of DNA sequences. The original plot of a DNA
sequence as a random walk on a 2D grid using the four cardinal
directions to represent the four bases A (adenine), G (guanine),
T (thymine) and C (cytosine) was done by Gates [5],Nandy[6] and
Leong and Morgenthaler [7]. In recent ten years, some authors such
as Bielinska-Waz [8,9], Randic
´
[10–13], Jaklic [14],Novic[15] and
Qi [16–18], also presented their graphical representations. These
graphical methods visualizing DNA sequences provide useful
insights into local and global characteristics along a sequence, which
are not easily observed from DNA sequences. In recent two
references, Randic
´
et al. [19] and Gho sh and Nan dy [20], authors
gave more detailed introduction about graphical methods visualiz-
ing DNA sequences. Readers can find more detailed accounts of
various graphical representation of DNA.
Compared with the graphical representation of DNA, the first
graphical representation of proteins was published in 2004 [21].It
assumes a unique correspondence between one selected collections
of 20 nucleotide triplets and the 20 amino acids, which they
represent. The Virtual Genetic Code converts a protein sequence
into a hypothetical DNA sequence, and allows one to use available
graphical representations of DNA to generate a graphical represen-
tation for proteins [19]. Then some novel graphical approaches were
developed for graphical representation of proteins that allow a
direct representation of proteins [22,23]. In addition, to reflect the
difference among 20 natural amino acids, some graphic representa-
tions of proteins consider more physicochemical properties. For
example, Chou et al. [24] proposed a 2D representation method,
‘wenxiang diagram’, to characterize the disposition of hydrophobic
and hydrophilic residue. Wen and Zhang [25] proposes a 2D graphic
representation based on the pKa values of different amino acids. Wu
et al.
[26] build up a web-server for creating graphic representation
of protein sequences by two different physicochemical properties of
their constituent amino acids.
In the present study, we propose a new 2D graphic represen-
tation of protein sequence based on the 0–1 codes of 20 amino
acids from Huffman tree. The 0–1 codes of 20 amino acids based
on Huffman tree can provide an approach with a compression
to represent protein sequences by binary unit. Further, the use of
Contents lists available at SciVerse ScienceDirect
journal homepage: www.elsevier.com/locate/cbm
Computers in Biology and Medicine
0010-4825/$ - see front matter & 2012 Elsevier Ltd. All rights reserved.
doi:10.1016/j.compbiomed.2012.01.011
n
Corresponding author.
E-mail address: zhqi_yh2004@yahoo.com.cn (Z.-H. Qi).
Computers in Biology and Medicine 42 (2012) 556–563