没有合适的资源?快使用搜索试试~ 我知道了~
基于Spark的Web文本挖掘系统的研究与实现beta2副本.doc
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 109 浏览量
2022-06-20
09:27:40
上传
评论
收藏 565KB DOC 举报
温馨提示
试读
30页
基于Spark的Web文本挖掘系统的研究与实现beta2副本.doc
资源推荐
资源详情
资源评论
基于 Spark 的 Web 文本挖掘系统的研究与实现
摘 要
通信技术与电子技术的高速发展带动了互联网网页的爆发式增长。各类网站由雨后
春笋大量出现,数百亿网页散布在整个互联网中。同时各类网页有着自身不同的结构,
因此急需一种高效的处理手段来帮助人们快速地从庞大的互联网文本数据集中提取出有
价值的信息,用以取代传统的人工数据处理。近年来,以 Hadoop 框架和并行处理框架
Spark 为代表的大数据技术开始兴起,为海量数据的存储和处理提供了新的思路和技术
支持。尤其是新一代计算框架 Spark,由于其底层采用了基于内存的计算,相比 Hadoop
具备更高的处理效率,同时还提供了对实时计算和交互式数据访问的支持。克服了
Hadoop 在这些应用中的不足。因此,本系统选择了并行计算框架 Spark 作为文本挖掘过
程的实现工具,以此为基础构建了针对新浪微博舆情热点的 Web 文本挖掘系统。主要工
作可以概括为以下几个部分:
1. 在系统开发的知识准备和技术了解部分,对 Web 文本挖掘的基本概念和通用流程
进行了研究和介绍;随后按照整个系统的处理流程,对各个部分设计的技术进行了分别
介绍,包括网络爬虫、挖掘工具以及数据可视化。其中选取 HDFS 与并行计算框架 Spark
框架做了着重介绍。
2. 在系统的算法研究部分,选取了经典特征提取算法 TF-IDF 算法进行了深入研究。
从算法的原理,优势、不足和改进几个方面分别进行了阐述。着重介绍了算法的原理和
内容,同时也提供了关于优化算法的思路。
3. 在系统的设计部分,将系统按照功能划分分成了三个主要模块:数据采集模块、
文本挖掘模块以及数据可视化模块。分模块对每个模块的功能、架构设计进行了介绍,
同时确定了各个模块实现的计算选型和执行流程。
4. 在系统的实现部分,首先介绍了 HDFS 与 Spark 框架环境的详细搭建部署过程。
随后根据系统设计的编排顺序,对系统三大模块的具体实现做了详细的介绍,包括功能
实现和交互实现两个部分。最后对系统进行了运行测试,抓取了约 100 万微博消息数据
进行了初步测试,证实了系统具有良好的可用性。
关键词:Web 文本挖掘,Spark,大数据,TF-IDF 算法,网络舆情
Research and Implementation of Web Text Mining System
Based on Spark
Abstract
The rapid development of communication technology and electronic technology has led to
the explosive growth of Internet pages. Various types of sites have emerged from the
mushroom, tens of billions of pages scattered throughout the Internet. At the same time all
kinds of pages have their own different structure, so an urgent need for an efficient means to
help people quickly from the huge Internet text data set to extract valuable information to
replace the traditional manual data processing. In recent years, Hadoop framework and parallel
processing framework Spark as the representative of the large data technology began to rise, for
the mass data storage and processing provides a new way of thinking and technical support. In
particular, the new generation of computing framework Spark, because of its underlying use of
memory-based computing, compared to Hadoop with higher processing efficiency, but also
provides real-time computing and interactive data access support. Overcoming the
shortcomings of Hadoop in these applications. Therefore, the system chooses the parallel
computing framework Spark as the realization tool of the text mining process, and builds the
Web text mining system for the hotspot of Sina microblogging. The main work can be
summarized as follows:
1. In the knowledge preparation and technology understanding part of the system
development, the basic concepts and general flow of Web text mining are studied and
introduced. Then, according to the whole process flow, the technology of each part is
introduced separately, including network Reptiles, digging tools and data visualization. Which
selected HDFS and parallel computing framework Spark framework made a focus on the
introduction.
2. In the part of the algorithm research, the TF-IDF algorithm of classical feature
extraction algorithm is studied deeply. From the principle of the algorithm, advantages,
shortcomings and improvements in several aspects were described. This paper introduces the
principle and content of the algorithm, and also provides the idea of optimizing the algorithm.
3. In the design part of the system, the system is divided into three main modules
according to the function: data acquisition module, text mining module and data visualization
module. The function and architecture of each module are introduced, and the calculation and
selection process of each module is determined.
4. In the implementation part of the system, first introduced the HDFS and Spark
framework environment detailed deployment process. Then, according to the order of the
system design, the detailed implementation of the three modules of the system is introduced in
detail, including the function realization and the interactive realization of the two parts. Finally,
the system was run test, grabbed about 100 million microblogging message data for the initial
test, confirmed that the system has good usability.
Keywords: Web Text Mining, Spark , TF-IDF Algorithm,Internet Public Opinion
目 录
Research and Implementation of Web Text Mining System...................................................II
Based on Spark...........................................................................................................................II
Abstract........................................................................................................................................II
第 1 章 绪论...................................................................................................................................1
1.1 研究背景........................................................................................................................................... 1
1.2 研究现状........................................................................................................................................... 1
文本挖掘技术................................................................................................................................. 1
大数据处理技术............................................................................................................................. 2
1.3 文章的内容及意义............................................................................................................................ 2
第 2 章 相关知识与技术介绍.......................................................................................................3
2.1 网络爬虫........................................................................................................................................... 3
2.2 文本挖掘........................................................................................................................................... 4
2.3 分布式存储系统——HDFS................................................................................................................ 6
大数据处理框架 Spark........................................................................................................................... 6
2. Spark 核心组件............................................................................................................................ 7
2.4.2 弹性分布式数据集——RDD................................................................................................... 7
2.4.3 Spark 工作流程....................................................................................................................... 7
2.4.4 Spark 的优势........................................................................................................................... 8
数据可视化............................................................................................................................................. 8
第 3 章 挖掘算法研究.................................................................................................................10
3.1 TF-IDF 算法....................................................................................................................................... 10
TF-IDF 算法介绍............................................................................................................................ 10
TF-IDF 算法的理论依据及不足..................................................................................................... 10
3. 算法的改进思路........................................................................................................................ 11
第 4 章 系统总体设计.................................................................................................................12
文本采集....................................................................................................................................... 12
数据存储....................................................................................................................................... 12
文本分析....................................................................................................................................... 12
结果可视化................................................................................................................................... 12
4.2 系统设计概要.................................................................................................................................. 12
4. 系统架构设计............................................................................................................................ 12
4. 系统模块划分与技术选取......................................................................................................... 13
数据处理流程................................................................................................................................ 14
第 5 章 系统实现.........................................................................................................................16
系统环境搭建....................................................................................................................................... 16
HDFS 的搭建................................................................................................................................. 16
5.2.2 Spark 框架的安装................................................................................................................. 17
模块设计的具体实现........................................................................................................................... 17
5.数据采集模块............................................................................................................................. 18
第 6 章 总结与展望.....................................................................................................................22
本文总结.............................................................................................................................................. 22
6.2 展望................................................................................................................................................. 22
参考文献......................................................................................................................................24
致谢..............................................................................................................................................25
剩余29页未读,继续阅读
资源评论
智慧安全方案
- 粉丝: 3621
- 资源: 59万+
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功