面向主题网页的视图块分割方法的研究与实现资源-CSDN文库

9 浏览量 2021-02-26 03:37:47 上传评论收藏 320KB PDF 举报

面向主题网页的视图块分割方法的研究与实现涉及到网页内容的结构化分析，目的是提高信息检索的效率和准确性。网页分析的方法通常将网页视为一个整体，对其进行内容和布局的解析，从而提取出有用信息。文章提到的VTPS算法（Visual-Textual Page Segmentation Algorithm），是一种针对主题网页进行语义块分割的算法，其核心是将网页分割成多个语义块，每个块作为分析的单位。这种方法考虑了网页的视觉布局和文本内容，从而更有效地提取和组织信息。根据提供的文件内容，知识点可以具体分为以下几个方面： 1. **视图块分割技术的重要性**： - 在互联网时代，网页成为人们获取信息的重要途径。随着网页上信息传播的快速增长，用户能够从海量内容中辨别有用信息的需求日益迫切。 - 网页内容在不同的应用场景下有不同的重要性。为了提取网页中的信息，页面分割技术变得越来越受到关注。 2. **网页与传统文本的区别**： - 网页与传统的文本内容存在显著差异，其动态性和非正式特性使得传统的文本分析方法不完全适用于网页分析。 - 网页具有广泛的内容表现形式和交互特性。从视觉上，网页的信息可以通过其表现格式进行传递，因此网页布局特性比内容特性更为重要。 3. **基于视觉特征的页面分割技术**： - 视觉特征在页面分割技术中起到了关键作用。这类技术涉及信息提取、信息检索、信息存储、网页分类、网页适配等。 - 尽管基于视觉的页面分割方法能够通过视觉特征辅助信息的提取和分类，但大多数这类方法缺乏通用性，因为它们通常是为特定应用而设计的。 4. **VTPS算法（Visual-Textual Page Segmentation Algorithm）**： - VTPS算法的核心是将网页划分成语义块。这种划分方式允许网页的不同部分根据其内容和视觉布局进行分割，使得后续的信息提取和组织工作更加高效。 - 该算法通过提取每个语义块的空间和内容特征，进而构建特征向量。基于这些特征向量，可以利用SVM（支持向量机）学习算法对网页中的主题信息进行训练和分类。 5. **面向主题的网页分割模型**： - 提出了一种针对主题网页进行分割的模型，这是文章的主要贡献之一。 - 该模型能够对网页进行有效的分割和分类，提高信息检索的质量和效率。 6. **研究方法的实现**： - 通过实现VTPS算法来对网页进行语义块的分割。 - 构造了针对每个语义块的特征向量，并利用机器学习方法对不同主题的网页内容进行训练和分类。 - 通过分类实验验证了所提出方法的效率。通过这些详细的分析，可以看出面向主题网页的视图块分割方法在提取和组织网络信息方面具有重要意义。它不仅提高了对网页中信息的识别效率，而且增强了用户在信息检索时的准确性。VTPS算法作为分割和分类主题网页的核心技术，其应用对于网页分析领域的发展具有积极的推动作用。

资源推荐

资源详情

资源评论

International Journal of Hybrid Information Technology

Vol.8, No.2 (2015), pp.247-256

http://dx.doi.org/10.14257/ijhit.2015.8.2.23

ISSN: 1738-9968 IJHIT

Research and Implementation of View Block Partition Method for

Theme-oriented Webpage

Lv Fang, Huang Junheng, Wei Yuliang and Wang Bailing

Harbin Institute of Technology at Weihai, Shandong, 264209

{huangjh, wbl@hitwh.edu.cn}

Abstract

A semantic block is treated as a unit while analyzing the webpage. First, we implement the

VTPS algorithm to partition a webpage into semantic blocks. Then, we propose an algorithm

to extract the spatial and content features, and then construct the feature vector for each

block. Based on these vectors, the SVM learning algorithm is applied to train and classify the

various theme-oriented webpage blocks. At last, the classification experiments show the

efficiency of this method.

Keywords: Semantic Block, VTPS Algorithm, Feature Vector, Classification

1. Introduction

In the Internet age, Web has become an important means for people to obtain information

online. With the rapid increase of information spreading on the Web, an effective method for

users to discern the useful information from the junk is in great demand. Different

information inside a web page has different importance for different applications. Therefore,

the technology of page segmentation which is useful in extracting information from the

webpage has gain more attention. Web page is radically distinct from traditional text due to

its dynamic and informal natural. Meanwhile webpage has extensive content performance and

interaction features. Visually, the message of webpage can be conveyed through its

presentation format. The layout features of webpage are much more important than the

content features [1].The technology of page segmentation is facilitated by visual features,

such as information extraction [2], information retrieval [3], information storage, web page

classification, and web adaption. However, most of the vision-based page segmentation

methods lack of generality for they are designed for a special application.

The main contributions of this paper are: 1) the introduction of the related technology; 2)

the VTPS algorithm is proposed to partition a web page into semantic blocks; 3) a

theme-oriented webpage partition model is proposed to automatically assign different

function areas in the webpage. This model takes into account both spatial features and content

features.4) do experiments to test the model.

International Journal of Hybrid Information Technology

Vol.8, No.2 (2015)

2. Related Work

2.1. Document Object Model Structure

In general, the structure of an HTML document is composed of kinds of labels and

components. Meanwhile, the order they appear in the document is the same as its display

order. DOM tree is a tree topology structure which is obtained by parsing the HTML

document. The DOM tree is good at describing data of a semi-structured nature such as

HTML document. The DOM tree structure can accurately describe the relative position and

hierarchical relationships between the tree nodes. From the visual angel, Web can be viewed

as the set of visual blocks which were defined by recursion, a visual block can be seen as the

set of smaller visual blocks. Each node in the DOM tree is a component object (such as

element, attributes, text) of the HTML documents. The root node of DOM tree is HTML

document, all of the body text, picture, hyperlinks and tag are leaf nodes. We can complete all

kinds of HTML document processing by operating DOM nodes.

2.2. Page Segmentations based on Layout

One of the efficient and widely researched page segmentation which is based on the layout

of webpage is VIPS(Vision-based Page Segmentation) [4].By detecting useful visual cues

based on DOM structure, a tree-like vision-based content structure of web page is obtained.

Visual cues such as font, color and size are used to detect blocks. The three steps of the

algorithm are shown as follows: 1) extract visible blocks. The webpage is divided into several

separate semantic blocks recursively; 2) detect the separator bar. Finding out the visual

vertical and horizontal lines of the webpage which are used to divide the webpage; 3) recreate

the content structure. The segmentation will not stop unless the coherence between each

block is bigger than the threshold.

VIPS excels in both an appropriate partition granularity and coherent semantic

aggregation. However, the complexity of VIPS is high and it is difficult to ensure the

consistency and integrity of the heuristic rules.

Many researcher are finding plenty of inspiration in VIPS. [5] have proposed a CTVPS

which will reduce the time and space complexities obviously. However CTVPS does not

suitable for “div+css” layout which is very popular now. Most of the page segmentation are

based on special application, [2] proposed a webpage page segmentation to receive the

subject information by delete the blocks which have no relevance to the subject. This method

isn’t generic, even though the retrieval performance is improved.

2.3. Block Importance Model

Though page segmentation take one step ahead to look down into the structure of a

webpage instead of treating it as a unit, they do not differentiate the function of the blocks in

剩余9页未读，继续阅读

评论收藏

内容反馈

weixin_38708945

粉丝: 2
资源: 908

面向主题网页的视图块分割方法的研究与实现

splitter:React组件，用于像VS Code中那样构建拆分视图

深入解析MFC

Visual C++ 6.0.rar

易康培训资源

超级有影响力霸气的Java面试题大全文档

java面试题

一步一步学习 iOS 6 编程(第四版)

Visual.C#2010从入门到精通

一步一步学习_iOS_6_编程(第四版)

Java开发实战1200例(第1卷).(清华出版.李钟尉.陈丹丹).part3

基于CORDIC的反正弦和反余弦计算的FPGA实现

BA无标度网络中的SIR模型

使用3DCNN和卷积LSTM进行手势识别学习时空特征

基于三次贝塞尔曲线的类汽车曲率连续路径平滑

基于机器学习的设备剩余寿命预测方法综述

基于维纳过程的退化模型，具有递归过滤算法，可用于估计剩余使用寿命

基于FPGA的奇异值和特征值分解的快速实现。

磁悬浮系统自适应模糊PID控制器的设计

最新资源