互联网网页文本对象抽取实现技术本科生.doc资源-CSDN文库

111 浏览量 2023-07-07 14:14:11 上传评论收藏 1.09MB DOC 举报

资源推荐

资源详情

资源评论

湖南大学毕业论文第 I 页

湖南大学软件学院

互联网网页文本对象抽取实现技术

摘要

互联网中蕴含着大量的关于现实世界对象的结构化信息。为了能应对信息爆炸带来

的严重挑战，抽取、集成网页上各式各样的文本对象信息，进行对象级别的搜索，迫切

需要一些自动化的技术帮助人们在海量信息中迅速找到自己真正需要的信息。网页文

本对象抽取实现技术正是解决这个问题的一种方法。

本文以传统的信息抽取理论和方法为基础，针对目前热门的博客领域，提出了一种

基于 HTML 特征和机器学习的博客正文抽取算法。在该算法中，研究了博客网页的特

征，提出了一种基于 HTML 标签特征的网页分块算法，使用决策树算法对博客数据集

进行统计训练，采用专门的统计工具 WEKA 对该算法进行了测试和评估，并总结出该

算法的优点以及可以改进的地方。最后，展示了基于该博客正文抽取算法的博客搜索引

擎 Geeseek 的系统结构和界面演示。该系统属于新型的垂直搜索引擎，能够对博客和博

文进行快速有效的搜索。据了解，Geeseek 也是目前国内高校中第一个博客搜索引擎。

关键词：互联网，信息爆炸，信息抽取，博客，HTML，机器学习，决策树，搜索引

擎，Geeseek

湖南大学毕业论文第 II 页

湖南大学软件学院

Implementation of text object extraction for Internet web pages

Author: Zhang Hui

Tutor: Lin Yaping

Abstract

Nowadays, there is a large number of semi-structural information which represents

objects in the real world on the Internet. In order to deal with the severe challenge brought by

information explosion, extract and integrate all kinds of text object information on web pages,

and put up the object-level searching, it cries for the automated technologies to help people

find the very information they really need among such a large number of information. The

technology of text object extraction is just one of methods to solve this problem.

Based on the traditional theory of Information Extraction and aiming at the blog domain,

this paper puts forward an arithmetic implementing the extraction function for the text objects

of blog articles with the HTML features and machine learning. In this arithmetic, it analyses

the features of blog pages, introduces an arithmetic for web page partition basing on the

HTML tag features, uses decision tree to do statistics and training on the blog data set, tests

and evaluates this arithmetic using the expert statistical tool, WEKA, and summarizes the

advantages as well as the points needing improving. Finally, it shows the system architecture

and interface presentation of the Geeseek, a blog Search Engine which applies the technology

of text object extraction for blog pages. This system blongs to the new-style vertical Search

Engine and is able to search for the blog home pages and blog article pages quickly and

effectively. So far as we know, Geeseek is the first blog Search Engine in all the colleges in

China.

Key words: Internet, information explosion, Information Extraction, blog, HTML, machine

learning, Search Engine, decision tree , Geeseek

毕业设计（论文）原创性声明和使用授权说明

湖南大学毕业论文第 V 页

湖南大学软件学院

1. 绪论........................................................................................................................................1

1.1 课题背景及目的 ..............................................................................................................1

1.2 国内外研究状况 ..............................................................................................................3

1.2.1 国内研究现状............................................................................................................3

1.2.2 国外研究现状............................................................................................................4

1.3 课题研究方法 ..................................................................................................................5

1.4 论文构成及研究内容 ......................................................................................................5

2. Web 信息抽取及网页文本对象抽取概述............................................................................7

2.1 Web 信息抽取的概念 ......................................................................................................7

2.2 Web 信息抽取的方法 ......................................................................................................8

2.3 Web 信息抽取的典型流程 ..............................................................................................9

2.4 网页文本对象抽取的理论和方法 .................................................................................11

3. 博客正文信息抽取系统的设计..........................................................................................14

3.1 博客搜索的概况 ............................................................................................................14

3.2 博客正文抽取的过程 ....................................................................................................15

3.2.1 分类..........................................................................................................................15

3.2.2 分块..........................................................................................................................18

3.2.3 统计训练，获取决策树..........................................................................................21

3.3 算法的测试和评估 ........................................................................................................24

3.4 博客正文抽取算法的意义和思考 ................................................................................25

4. 基于博客正文抽取的 Geeseek 搜索引擎..........................................................................27

4.1 Geeseek 系统介绍 ..........................................................................................................27

4.2 博客正文抽取模块 .........................................................................................................28

4.2.1 博客正文抽取模块简介..........................................................................................28

4.2.2 博客正文抽取模块的主要数据类..........................................................................29

4.2.3 博客正文抽取模块的实现思路..............................................................................30

4.3 系统展示 .........................................................................................................................33

5. 总结......................................................................................................................................36

致谢.......................................................................................................................................37

参考文献...................................................................................................................................39

剩余46页未读，继续阅读

评论收藏

内容反馈

zzzzl333

粉丝: 706
资源: 7万+

互联网网页文本对象抽取实现技术本科生.doc

互联网网页文本对象抽取实现技术本科毕业论文.doc

本科毕设论文-—互联网网页文本对象抽取实现技术.doc

大学毕业论文-—互联网网页文本对象抽取实现技术.doc

互联网网页文本对象抽取实现技术本科生毕业论文本科毕设论文.doc

互联网网页文本对象抽取实现技术本科毕业(设计)论文.doc

HTMLParser抽取Web网页正文信息.doc

基于Python实现中文文本关键词抽取的三种方法.zip

web信息抽取中的文本分类.doc

大学毕业论文-—web信息抽取中的文本分类.doc

基于heritrix的web信息抽取本科论文.doc

浅析信息抽取技术与前景.doc

web信息抽取中的文本分类本科毕设论文.doc

C编写的时域基2抽取的FFT算法程序.doc

基于python的开放领域事件抽取系统源码数据库论文.doc

北邮大三上2022年《算法设计与分析》期末试题-A卷

2019city.zip

计算机毕业设计答辩PPT模板（11套）.zip

1000套计算机毕业设计带源码

山东大学计算机图形学2010试卷A(含答案).doc

MATLABSimulink电力系统建模与仿真

浙江理工大学-计算机组成原理-实验一-运算器实验（完整报告）

计算机统考408思维导图xmind

头歌实践教学平台 MIPS流水CPU设计---HUST

计算机毕设答辩-写作技巧-ppt-计算机毕设答辩ppt 怎么写-计算机毕设答辩ppt写作技巧-计算机-毕设-答辩-ppt

计算机理论问答集锦包括OS、计算机组成原理等课程

计算机网络实验-使用Wireshark分析IP协议.doc

AAAI2023 会议论文集合（Oral）

最新资源