International Journal of Hybrid Information Technology
Vol.8, No.2 (2015)
248 Copyright ⓒ 2015 SERSC
2. Related Work
2.1. Document Object Model Structure
In general, the structure of an HTML document is composed of kinds of labels and
components. Meanwhile, the order they appear in the document is the same as its display
order. DOM tree is a tree topology structure which is obtained by parsing the HTML
document. The DOM tree is good at describing data of a semi-structured nature such as
HTML document. The DOM tree structure can accurately describe the relative position and
hierarchical relationships between the tree nodes. From the visual angel, Web can be viewed
as the set of visual blocks which were defined by recursion, a visual block can be seen as the
set of smaller visual blocks. Each node in the DOM tree is a component object (such as
element, attributes, text) of the HTML documents. The root node of DOM tree is HTML
document, all of the body text, picture, hyperlinks and tag are leaf nodes. We can complete all
kinds of HTML document processing by operating DOM nodes.
2.2. Page Segmentations based on Layout
One of the efficient and widely researched page segmentation which is based on the layout
of webpage is VIPS(Vision-based Page Segmentation) [4].By detecting useful visual cues
based on DOM structure, a tree-like vision-based content structure of web page is obtained.
Visual cues such as font, color and size are used to detect blocks. The three steps of the
algorithm are shown as follows: 1) extract visible blocks. The webpage is divided into several
separate semantic blocks recursively; 2) detect the separator bar. Finding out the visual
vertical and horizontal lines of the webpage which are used to divide the webpage; 3) recreate
the content structure. The segmentation will not stop unless the coherence between each
block is bigger than the threshold.
VIPS excels in both an appropriate partition granularity and coherent semantic
aggregation. However, the complexity of VIPS is high and it is difficult to ensure the
consistency and integrity of the heuristic rules.
Many researcher are finding plenty of inspiration in VIPS. [5] have proposed a CTVPS
which will reduce the time and space complexities obviously. However CTVPS does not
suitable for “div+css” layout which is very popular now. Most of the page segmentation are
based on special application, [2] proposed a webpage page segmentation to receive the
subject information by delete the blocks which have no relevance to the subject. This method
isn’t generic, even though the retrieval performance is improved.
2.3. Block Importance Model
Though page segmentation take one step ahead to look down into the structure of a
webpage instead of treating it as a unit, they do not differentiate the function of the blocks in