A Review: Text Detection in Natural Scene Image
Yue Sun, Abdusalam Dawut, Askar Hamdulla*
Institute of Information Science and Engineering, Xinjiang University, Urumqi, 830046, China
*corresponding author’s email: askarhamdulla@sina.com
Abstract—Multimedia files such as images and videos
contain most of the textual information, which is an
important route for people to obtain information.
Therefore, the automatic detection and recognition of text
has been becoming more and more popular topic in
computer vision and document analysis. Aiming at the
problem of scene text detection and recognition, this paper
introduces a variety of text detection methods. It focusses on
the use of machine learning for text detection and compares
these methods. Then, there is a brief introduction for text
recognition and some standard datasets. Finally, some
valuable research topics in this field are proposed and
Keywords: Text detection, text recognition, natural image,
machine learning, deep learning.
With the popularization of digital devices such as
mobile phones and cameras, the detection and recognition
of things with images has become an increasingly
important part of people's life. Among the many carriers
containing information, multimedia files such as image
and video contain lots of text information which is an
important way for people to obtain information. The
abundant and accurate information contained by text is
very advantageous for all sorts of applications, such as
image retrieval, object tracking. Therefore, text detection
and recognition have become hot issues in computer
The text in the image can be roughly divided into two
categories: overlay text and scene text. Overlay text, in
other words, the text is artificially superimposed on
images or video, such as transcripts and title in news and
subtitles in movies. The background of the superimposed
text is simple, the text has a sharp contrast with the
background, and the font and size are same. Therefore, it
is very easy for the superimposed text to locate, extract
and identify the text in the image. Different from
superimposed text, scene text often has complex
background, sometimes the text color is like background
color, and the text is vulnerable to hardware devices,
illumination intensity, shooting angle, natural
environment and human interference. These factors make
the detection, location and recognition of scene text more
The main reasons that affect the scene text detection
effect and recognition results are as follows [1]:
1) Multiple text attributes: For superimposed text, it
has same fonts, sizes, colors and spacing. Different from
overlay text, there are different fonts and colors in scene
text. In addition to this, it has diverse sizes and directions.
Even in the same background, these attributes are distinct.
2) Intricate background: In most cases, the
background and text are confusing, which makes it easy
to make mistakes when distinguishing between text and
3) Other disturbance factors: These interference
factors increase the error rate of scene text detection such
as noise, blur, distortion, low resolution, uneven
illumination, partial occlusion etc..
In allusion to these problems, many researchers come
up with lots of methods to solve them.
2. RESEARCH PROGRESS ON SCENE TEXT DETECTION
In recent years, with the continuous development of
machine vision, scene text detection and recognition are
roughly divided into text detection, text recognition, and
The general steps for text detection are as follows:
Figure 1. text detection procedures
Scene text detection methods can be roughly divided
into two categories: text positioning based on text
features (i.e, using traditional methods for text
localization), and text positioning based on machine
2.1 Text positioning based on text features
The text in the natural image has the following text
features: special texture features, the same color of
characters, a lot of edge features of characters, specific
stroke width. Thus, researchers have put forward some
methods for text location.
2.1.1 Text positioning based on texture
The texture-based method treats characters as a special
kind of texture and determines whether pixels belong to
text with image texture features. The reason is that
characters have a certain arrangement direction. This
method is described as follows: Firstly, the image is
divided into several non-overlapping sub-regions.
Extract corresponding features
Determine whether it is text based on characteristics