没有合适的资源?快使用搜索试试~ 我知道了~
人工智能-数据分析-基于Spark计算的实时数据分析的应用研究.pdf
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 187 浏览量
2022-07-08
15:43:55
上传
评论 1
收藏 4.34MB PDF 举报
温馨提示
试读
89页
人工智能-数据分析-基于Spark计算的实时数据分析的应用研究.pdf
资源推荐
资源详情
资源评论
I
基于 Spark 计算的实时数据分析的应用研究
摘要
随着网络的快速发展,各式各样的数据呈现出爆炸式增长,海量数据的不断
累积对数据的存储与计算提出了更多的要求,各类分布式计算框架和分布式存储
模式接连涌现。其中分布式文件存储系统
HDFS
凭借其较好的实用特性得到了广
泛应用;与此同时,
Spark
计算框架 也因为其基于内存计算的高可用性受到了学
术界与社会的广泛关注。合理地利用这两种计算框架处理日志数据,并且将日志
分析产生的结果利用可视化工具进行展示,是现如今社会亟待解决的问题,为了
实现这一目标,就需要制定出对应业务场景中的数据分析解决方案。
本文设计和开发基于 Hadoop 平台的网站日志数据分析系统,其中 Hadoop
生态系统中的各个组件提供了日志数据的离线分析计算的能力;应用系统釆用
Spark Streaming 计算框架设计了日志实时计算的应用,采用 MapReduce 计算框
架设计了离线计算应用,前端展示使用当前主流 javaEE 平台进行设计开发,各
种后端开发框架,如 SpringMVC 等提供了更好的可维护性与可扩展性;同时提
供了基于 HTML5 页面开发的 WEB 应用功能,使用户可以得到对于分析结果的
多维度统计信息;在数据展示方面,采用 Echart、Highcharts 此类可交互性图表,
为解析结果提供了灵活的个性化定制和可视化展现。
本文工作主要分为两部分,一部分基于
Spark
计算的实时数据分析,另一部
分是基于 Hadoop 平台的离线数据分析。论文首先介绍相关知识与关键技术,其
次分别介绍实时数据处理与离线数据处理的平台架构设计,应用需求,具体模块
实现与可视化设计,最后进行测试环境搭建与测试分析。
关键字:Hadoop,Spark,HDFS,日志数据
II
Application research of real-time data analysis
based on Spark computing
Abstract
With the rapid development of the network, all kinds of data show explosive
growth. The continuous accumulation of massive data puts forward higher
requirements for data storage and calculation, and various distributed computing
frameworks and distributed storage models emerge in endlessly. Distributed file
storage system HDFS has been widely used for its good practicability. At the same
time, spark computing framework has attracted wide attention of academia and
Society for its high availability. It is an urgent problem to reasonably use these two
computing frameworks to process log data and display log analysis results with visual
tools. In order to achieve this goal, it is necessary to develop data analysis solutions in
corresponding business scenarios.
In this paper, we design and develop a web log data analysis system based on
Hadoop platform, in which each component of Hadoop ecosystem provides the ability
of offline log data analysis and calculation. The application system uses spark flow
computing framework to design real-time log computing application, and MapReduce
computing framework to design offline computing application. The front-end display
uses the current mainstream Java EE platform for design and development. Various
back-end development frameworks, such as spring MVC, provide better
maintainability and scalability. At the same time, it provides the web application
function based on HTML5 page development, so that users can get multidimensional
statistical information of analysis results. In the aspect of data display, we use
interactive charts such as echart and highcharts to provide flexible customization and
visualization for the analysis results.
The work of this paper is divided into two parts: real-time data analysis based on
spark computing and offline data analysis based on Hadoop platform. This paper first
introduces the relevant knowledge and key technologies, then introduces the platform
architecture design, application requirements, specific module implementation and
visual design of real-time data processing and offline data processing, and finally
constructs and tests the test environment.
Keywords: Hadoop, Spark, HDFS, log data
III
目 录
第一章 绪论
..........................................................................................................1
1.1 研究背景与意义..................................................................................... 1
1.2
国内外研究现状
.....................................................................................2
1.3 主要涉及工作......................................................................................... 3
1.4
论文结构
.................................................................................................3
第二章 相关知识背景与技术介绍...................................................................... 5
2.1 分布式计算框架..................................................................................... 5
2.1.1 Hadoop
概述
.............................................................................. 5
2.1.2 Flume
介绍
.................................................................................6
2.1.3 Kafka 介绍................................................................................. 7
2.2 Spark 概述................................................................................................8
2.2.1 Spark Streaming......................................................................... 9
2.3 数据存储层介绍................................................................................... 10
2.3.1 Hbase........................................................................................10
2.3.2 MySQL 数据库........................................................................10
2.4
可视化技术介绍
................................................................................... 11
2.4.1 Highcharts.................................................................................11
2.4.2 springmvc+mybatis.................................................................. 11
第三章 网站日志数据实时分析
........................................................................12
3.1 需求分析............................................................................................... 12
3.1.1 数据采集模块需求.................................................................... 12
3.1.2
数据存储模块需求
....................................................................12
3.1.3 数据分析模块需求.................................................................... 13
3.1.4
数据展示模块需求
....................................................................14
3.2 应用平台架构设计............................................................................... 15
3.2.1 应用平台架构设计.................................................................... 15
3.2.2 集群资源规划设计.................................................................... 16
3.2.3
系统数据流程设计
....................................................................17
3.3 日志数据实时处理的实现................................................................... 17
IV
3.3.1 数据采集模块实现.................................................................... 17
3.3.2
数据存储模块实现
....................................................................24
3.3.3 数据分析模块实现.................................................................... 25
3.3.4
数据展示模块实现
....................................................................28
第四章 网站日志数据离线分析........................................................................ 31
4.1
需求分析
...............................................................................................31
4.1.1 采集需求.................................................................................... 31
4.1.2 数据分析需求............................................................................ 32
4.1.3
展示需求
....................................................................................32
4.2
离线分析平台架构设计
.......................................................................33
4.3 日志数据离线处理的实现................................................................... 34
4.3.1 采集模块实现............................................................................ 34
4.3.2
分析模块实现
............................................................................36
4.3.3 展示模块实现............................................................................ 43
4.4
相关数据库表设计
...............................................................................44
4.4.1 访客基本数据分析模块表........................................................ 45
4.4.2
浏览器信息分析模块表
............................................................45
4.4.3 访客地区信息分析模块表........................................................ 45
4.4.4
访客浏览深度分析模块表
........................................................46
4.4.5 访客来源分析模块表................................................................ 46
4.4.6
购买事件分析模块表
................................................................47
第五章 测试........................................................................................................ 48
5.1 环境搭建............................................................................................... 48
5.1.1 Hadoop2.X
分布式集群部署
..................................................... 48
5.1.2 Flume 安装..................................................................................48
5.1.3 Hbase 安装..................................................................................48
5.1.4 Nginx
服务器配置
......................................................................48
5.1.5 Spark2.2.0 环境准备...................................................................49
5.1.6 Hive 安装及配置........................................................................ 49
5.1.7
其余配置环境准备
....................................................................49
剩余88页未读,继续阅读
资源评论
programcx
- 粉丝: 40
- 资源: 13万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功