没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
中文摘要
互联网资源含有大量的有用信息,且其信息数量仍在以指数形式飞速增长,
这为用户提供了一个极具价值的信息源.但是因为互联网信息的海量性、异构性、
易变性、非语义性等特点,人们要快速准确在海量网页中得到所需的信息并不容
易,迫切需要一些自动化工具帮助用户有效获取互联网上信息。
本文提出了一种新型的基于网站语义结构的信息自动抽取机制,意在从网站
逻辑结构所体现的网站本身语义入手进行网页信息抽取,以使得计算机在一定程
度上理解信息的含义,达到使信息抽取更为有效的目的。
本文构建了一个基于网站语义结构的信息抽取系统,系统由三个主要部分组
成:网站网页搜索器,网站语义结构生成器,网页信息抽取器。由网站网页搜索
器对目标网站进行搜索,提供网站的链接关系以生成网站有向图,提供采集回的
页面以进行信息提取;由网站语义结构生成器在网站管理者对网页内容的理解所
进行分类的基础上,将网站有向图 (网站的物理结构)转换为网站的语义结构,
即得到一个按照网站语义分类得到的分类关系结构图;由网页信息抽取器在得到
的分类关系的基础上对网页进行信息抽取,进而抽取出相关信息。
本文实现了网站Spider,可对网站遍历采集,生成网站有向图,对网站Spider
实现的一些关键问题进行了详细阐述:提出了基于网站语义的网页分类,并在网
站有向图的基础上依据网站网页语义分类进一步生成可以反映网站语义的网站语
义结构;然后在得到的网站语义结构的基础上进行信息抽取,提出了一个融入网
页上视觉信息的、基于同类页面匹配的抽取网页信息的算法。
关键词:web信息抽取;网页分类;标记树;网页去噪;Spider
分类号:TP391
ABSTRACT
The Internet is an extremely large information repository with its data amount
ever-increasing in an exponential rate. This provides users with a valuable resource of
information. However, the information in the Internet is massive, heterogeneous,
variable and non-semantic, which makes it difficult to retrieve relevant data quickly and
accurately from the tremendous amount of web pages. Therefore, the availability of
robust, flexible and automatic tools that can help users effectively retrieve information
from the Internet has become a great necessity.
This thesis presents a novel web information automatic extracting mechanism
based on the website semantic structure, which trying to extract information using the
semantically-meaningful logical view of the website, so the computer can be made to
understand the meaning of information to a certain extent, attaining the goal of making
the process of information extraction more efficient.
This thesis designs a web information extraction system which based on semantic
structure of the website. The system consists of three main components: website spider,
website semantic structure generator, web information extractor. The task of website
spider is to search the target website, provide relations of links to generate the website
direct graph, download pages to extract relevant information. The task of website
semantic structure generator is to translate the website direct graph (the physical
structure of the websi哟 to the website semantic structure based on the web page
classification which has been done by website designer according to his understanding
of the content of web pages, namely to produce a category relationship chart in
accordance with the semantic classification of the website. The task of web information
extractor is to extract relevant information based on this classification.
A website spider was implemented in the thesis. The website spider can traverse
websites, download web pages and generate website direct graphs. Several key issues
about implementation are demonstrated in details. The thesis also proposes a web page
classification based on the semantic meaning of the website. When the website direct
graph has been constructed, a topology structure which reflects the website semantic
meaning will be generated based on web page semantic classification and website direct
graph. When the semantic structure of the website has been construct喊 web
information extraction can be done based on this structure. The thesis presents a web
lV
information extraction algorithm based on the matching of web pages in the same
category and with the help of visual content features in the web page.
KEYWORDS: web information extraction; web page classification; tag tree; web page
noise reduction; spider
CLASSNO: TP391
致谢
本论文的工作是在尊敬的导师瞿有利老师的悉心指导下完成的,瞿有利老师
渊博的学识、严谨的治学态度、孜孜不倦的进取精神、对问题实质的洞察细微和
高屋建板、以及科学的工作方法都给了我极大的帮助和影响。在此衷心感谢三年
来瞿老师对我的关心和指导,不仅让我学到了很多科研知识,而且还给了我很好
的实践机会。
衷心感谢黄厚宽、田盛丰、于剑、王志海教授、魏名元、林有芳、田凤占老
师三年来的教导和帮助,他们一丝不苟的学习精神永远是我学习的榜样。
在实验室工作及撰写论文期间,张晓峰、周成、陈亮、车德军等同学和实验
室的各位博士、各位师兄、师姐、师弟、师妹对我论文中和研究工作给予了热情
帮助,在此向他们表达我的感激之情。
另外也感谢父母,他们的理解和支持使我能够在学校专心完成我的学业。
剩余71页未读,继续阅读
资源评论
hutaoer06051
- 粉丝: 24
- 资源: 11
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功