Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and
that copies bear this notice and the full citation on the first page.
Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To
copy otherwise, or republish, to post on servers or to redistribute to
lists, requires prior specific permission and/or a fee. Request
permissions from Permissions@acm.org.
ASONAM '17, July 31-August 03, 2017, Sydney, Australia
© 2017 Copyright is held by the owner/author(s). Publication rights
licensed to ACM.
ACM ISBN 978-1-4503-4993-2/17/07…$15.00
http://dx.doi.org/10.1145/3110025.3110140
Mapping Whole DNA Sequence on Variant Maps
Yuyuan Mao, Jeffrey Zheng, Wenjia Liu
School of Software, Yunnan University
Kunming, China
yujemao@qq.com, conjugatelogic@yahoo.com
Abstract— Whole DNA sequence is naturally related to big
data streams, it is a challenge task to make a classification and
visualization for whole DNA sequences. In this paper, a new
mapping method for whole DNA sequence is proposed, and a
special mapping scheme is used to transfer a whole DNA sequence
as multiple 2D statistical probability maps. A sample case is
selected from a night monkey species from south America (Aotus
Nancymaae), interesting patterns are observed from relevant
maps.
Keywords— Gene sequence, Aotus Nancymaae, mapping
method, sequential model, variant map
I. INTRODUCTION
In modern biologics, DNA sequences are sequencing from
wider species from human to simple cells in DNA data banks as
big data streams. It is difficult to process various DNA streams
for classification and identification on various species from
whole sequences. The main task of present genomic research is
to obtain more biological information by processing and
analyzing of the DNA sequence from multi angles and multi-
levels. In recent years, the processing and utilization of
biological gene data is being carried out in a variety of ways,
such as gene feature extraction, gene sequence location and so
on.
Variant map is an emerging technology to handle four
symbols as meta structure to process random sequences from
cryptographic sequences, DNA sequences to ECG signals.
Multiple statistical probability distributions are generated from
selected sequences to form 2D-3D visual maps in representation.
This scheme makes whole data sequences more compact and
effectively visualized, and mapping results may be useful to
explore non-linear complex behaviors of whole genomics.
In this paper, a special scheme is proposed to show a series
of mapping results from a selected gene sequence of a Aotus
Nancymaae.
II. PROCESS MODEL
A. Architecture
The architecture of the process model is shown in Figure
1(a) The process model consists of five parts: input, processing,
measurement, projection and output. There are three modules:
Processing, Measurement and Projection.
Input: A DNA sequence
Output: A 2D map
Modules: Processing, Measurement, Projection
Process: From a selected DNA sequence, multiple segments
are divided by a fixed length m on the whole sequence
sequentially in Processing module. Each segment needs to
count four symbols: {A, C, G, T} in the segment to transfer all
segments into a measuring sequence of four measures in
Measurement module. A special combination on X: {AT} and
Y: {AG} is selected to determine four measures in a projection
position and the whole measuring sequence projected to be a
2D map in Projection module.
B. Processing module
From an input DNA sequence, multiple segments can be
separated by a fixed length m to generate a sequence of
segments.
Input: a DNA sequence
Output: a sequence of segments
C. Measurement module
In this module shown in Figure 1(b), each segment counts
four numbers of {A, G, C, T} in each proportions respectively.
As the result, each count is an integer number between 0 and m
to transfer a segment sequence into a measuring sequence of
four measures.
Input: a sequence of segments
Output: a sequence of four measures
D. Projection module
The projection module is shown in Figure 1(c) as two units:
Position and Projecting. For each four measures, two axis
positions are determined by X(AT) and Y(AG) respectively.
When all measures are processed, a 2D histogram is established
as a statistical distribution as a 2D map.
Input: a sequence of four measures
Output: a 2D map
(a)