Conference on Document Analysis and Recognition (ICDAR), remains challenging
because of various scenes, backgrounds and text appearances. There are two kinds of
methods for text localization: heuristics-based and learning-based methods.
Heuristics-based methods. The heuristics-based methods are based on heuristics,
such as connected component analysis (CCA).Lucas et al. [2, 3] used the Maximally
Stable Extremal Regions (MSER) algorithm to detect candidate text and the special
text component features for text localization. In [4], they presented a novel approach
based on Oriented Stroke Detection to improve the poor performance on noisy im-
ages. Another widely used methods are Stroke Width Transform [2]and its extensions
such as those in[3].Epstein et al.[2]proposed a novel stroke filter to improve the de-
tection of text candidate regions. Huang et al. [3] introduced a low-level filter called
the Stroke Feature Transform (SFT) by incorporating color cues of text pixels.
Learning-based methods. Learning-based methods use the same technology as
other object recognition methods. Those methods can be roughly classified into two
classes, supervised learning and unsupervised learning. Popular supervised learning
algorithms, such as Support Vector Machine (SVM) and AdaBoost, were used for text
localization in [4-6]. Unsupervised classification methods were explored for text and
non-text classification in [7].Wang et al. [8, 9] proposed to use Convolutional Neural
Network (CNN) to learn the unsupervised features for text recognition.
Among the state-of-the-art methods of text localization, few have focused on cha-
racter segmentation, because character segmentation is considered as challenging as
text localization. In this paper, we present a hierarchical approach for text localization
which goes from characters to strings, and to words, in a semantically bottom-up way.
Different from existing methods which either bet on a few hand-crafted features [4,
10, 11]or rely on heavy learning models [12], our approach presents a systematic way
to integrate various effective features extracted or learned at different semantic levels.
We adopt simple learning models, such as kernel SVMs and CNN, and focus more on
designing simple yet effective new features. The framework of our approach is shown
in Fig. 1. And some visual results are given in Fig. 2.
Character localization. For character localization, we explore three types of sup-
plementary features: structure, gradient based (HOG), and CNN-based features.
String localization. For string localization, at first, we group characters by their
structure features because the characters in a word often have similar structure fea-
ture, such as color, aspect ratio, alignment, etc. Then we design a nine-dimension
string feature to learn a SVM model that distinguishes non-text strings efficiently.
Word localization. In word localization, we use the interval cues of adjacent cha-
racters in a word to learn the best strategy to split the candidate strings into words.
Our contributions are as follows: 1) A hierarchical text localization framework
which goes from characters to strings and to words is proposed; 2) A group of struc-
ture feature combined with HOG and CNN features are implemented for character
localization; 3) The structure and string features are designed for string localization.
The rest of the paper is organized as follows: Character localization is introduced
in Section 2. String localization and word localization are proposed in Section 3 and
Section 4, respectively. Experimental results are described in Section 5. Conclusions
are given in Section 6.