高分毕业设计基于Hadoop+Kafka+Spark大数据平台的新闻日志分析处理及可视化系统源码+部署文档+全部数据资料.

共239个文件

class：54个

xml：34个

properties：30个

版权申诉

毕业设计

Hadoop

Kafka

Spark

5星 · 超过95%的资源 10 浏览量 2024-04-23 17:33:28 上传评论 2 收藏 19.13MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

高分毕业设计基于Hadoop+Kafka+Spark大数据平台的新闻日志分析处理及可视化系统源码+部署文档+全部数据资料. （239个子文件）

$CACHE_FILE$ 1KB

$PRODUCT_WORKSPACE_FILE$ 1KB

zoo.cfg 1KB

container-executor.cfg 318B

TestHBaseSink.class 19KB

HBaseSink.class 19KB

AsyncHBaseSink.class 19KB

TestAsyncHBaseSink.class 15KB

TestRegexHbaseEventSerializer.class 8KB

RegexHbaseEventSerializer.class 7KB

StructuredStreamingKafka$.class 6KB

JDBCSink.class 5KB

KfkAsyncHbaseEventSerializer.class 4KB

SimpleAsyncHbaseEventSerializer.class 4KB

StructuredStreamingKafka$Weblog.class 4KB

SimpleHbaseEventSerializer.class 4KB

TestHBaseSink$CoalesceValidator.class 4KB

Test$.class 3KB

GenerateLogs.class 3KB

StructuredStreamingKafkaTest$.class 3KB

WeblogService.class 3KB

Test$$typecreator3$1.class 3KB

AsyncHBaseSink$FailureCallback.class 3KB

IncrementHBaseSerializer.class 3KB

StructuredStreamingKafka$Weblog$.class 3KB

StructuredStreamingKafka.class 3KB

HBaseSink$4.class 2KB

AsyncHBaseSink$SuccessCallback.class 2KB

WeblogSocket.class 2KB

IncrementAsyncHBaseSerializer.class 2KB

HBaseSink$3.class 2KB

TestHBaseSinkCreation.class 2KB

StructuredStreamingKafka$$typecreator5$1.class 2KB

HBaseSinkConfigurationConstants.class 2KB

SimpleRowKeyGenerator.class 2KB

StructuredStreamingKafkaTest$$anonfun$1.class 1KB

HBaseSink$2.class 1KB

HBaseSink$1.class 1KB

SimpleHbaseEventSerializer$KeyType.class 1KB

Test$$anonfun$main$1.class 1KB

AsyncHBaseSink$2.class 1KB

StructuredStreamingKafka$$anonfun$2.class 1KB

AsyncHBaseSink$3.class 1KB

AsyncHBaseSink$1.class 1KB

AsyncHBaseSink$CellIdentifier.class 1KB

Test$$anonfun$main$2.class 1KB

AsyncHBaseSink$4.class 1KB

StructuredStreamingKafka$$anonfun$1.class 1KB

SimpleAsyncHbaseEventSerializer$1.class 1KB

StructuredStreamingKafkaTest.class 865B

MockSimpleHbaseEventSerializer.class 856B

Test.class 815B

HBaseSink$DebugIncrementsCallback.class 680B

AsyncHbaseEventSerializer.class 564B

HbaseEventSerializer.class 535B

BatchAware.class 158B

hadoop-env.cmd 4KB

yarn-env.cmd 2KB

mapred-env.cmd 918B

log.conf 2KB

Dockerfile 2KB

.dockerignore 82B

ssl-client.xml.example 2KB

ssl-server.xml.example 2KB

.gitignore 38B

index.html 4KB

flume-ng-hbase-sink.iml 15KB

sparkWeb.iml 2KB

generateLogs.iml 664B

sparkScala.iml 80B

hue.ini 41KB

word.input 78B

hbase-protocol-0.98.6-cdh5.3.0.jar 3.39MB

hadoop-common-2.5.0.jar 2.83MB

zookeeper-3.4.5-cdh5.10.0.jar 1.29MB

hbase-client-0.98.6-cdh5.3.0.jar 939KB

mysql-connector-java-5.1.27-bin.jar 852KB

commons-collections-3.2.2.jar 575KB

hbase-common-0.98.6-cdh5.3.0.jar 441KB

fastjson-1.1.33.jar 343KB

commons-httpclient-3.1.jar 298KB

commons-lang-2.6.jar 278KB

commons-beanutils-1.7.0.jar 184KB

json-lib-2.2.3-jdk13.jar 145KB

ezmorph-1.0.6.jar 84KB

commons-logging-1.1.jar 52KB

共 239 条

# Hadoop-logs 这是一个基于分布式大数据平台的日志存储、分析及可视化系统项目进度见：[项目日志](./project-diary.md) ## 需求分析 - 搭建分布式大数据平台( Hadoop, Yarn, ZooKeeper, Docker )。 - 收集( Flume )网站日志数据( 暂定使用搜狗实验室的新闻网站数据 )。 - 实时处理数据( Kafka, Spark )并在前端( Echarts )做展示( 实时热度TopK等 )。 - 数据清洗后存储到数据仓库中( Hbase, Hive )。 - 离线处理数据( MapReduce )后提供交互式报表( Hue )( 统计不同时段对应的访问量等数据 )。 ## 架构 ![架构](http://pic.xuecq.cc/hadoopNews-structures.png) ## 节点情况 | | 机器1 | 机器2 | 机器3 | | :-------: | :-------------------------: | :-------------------------: | :----------: | | HDFS | NameNode/DataNode | NameNode/DataNode | DataNode | | YARN | ResourceManager/NodeManager | ResourceManager/NodeManager | NodeManager | | ZooKeeper | ZooKeeper | ZooKeeper | ZooKeeper | | Kafka | Kafka | Kafka | Kafka | | Hbase | master/RegionServer | RegionServer | RegionServer | | Flume | Flume | Flume | Flume | | Hive | | | Hive | | Mysql | | | Mysql | | Spark | | Spark | | | Hue | | | Hue | ## 使用说明 1. 配置 ssh 免密登陆 ``` shell # 注意，这是初始化时才要做的，只需配置一次即可。hadoop-1 和 hadoop-2 都需要做 bash /opt/tools/autoSsh.sh ``` 2. 启动 ZooKeeper 集群 ``` shell # 在 hadoop-1 上单独启动集群，下面的命令无特殊说明均在 hadoop-1 上执行 bash /opt/tools/zoo.sh ``` 3. 启动 JournalNode 集群 4. 选择一个节点作为 NameNode 格式化并启动 5. 在备用名称节点上同步 NameNode 6. 选择一个节点格式化 ZooKeeper 7. 启动 HDFS 8. 启动 zkfc ``` bash # 3 - 8 步以及集成在 first.sh 中，方便集群创建后的初始化 bash /opt/tools/first.sh # 若非首次使用，则无需进行格式化，直接启动 HDFS 和zkfc 即可 cd /opt/modules/hadoop-2.5.0-cdh5.3.6 ./sbin/start-dfs.sh # zkfc 需在 hadoo-1 和 hadoop-2 上执行 cd /opt/modules/hadoop-2.5.0-cdh5.3.6 ./sbin/hadoop-daemon.sh start zkfc ``` 9. 启动 Yarn 集群（需手动启动备用 ResourceManager ） ``` bash cd /opt/modules/hadoop-2.5.0-cdh5.3.6 ./sbin/start-yarn.sh # 去另一节点（hadoop-2） cd /opt/modules/hadoop-2.5.0-cdh5.3.6 ./sbin/yarn-daemon.sh start resourcemanager ``` 10. 启动日志聚合服务 ``` shell cd /opt/modules/hadoop-2.5.0-cdh5.3.6 ./sbin/mr-jobhistory-daemon.sh start historyserver ``` 11. 启动 Hbase ``` shell cd /opt/modules/hbase-0.98.6-cdh5.3.0 ./bin/start-hbase.sh # 建表 ./hbase shell create 'weblogs','info'; ``` 12. 启动 Kafka ``` shell cd /opt/tools ./kafka.sh ``` 13. 启动 Flume ``` shell # 每个节点都要单独做，因为不同节点的 Flume 功能不同 cd /opt/modules/flume-1.7.0-bin ./flume.sh start ``` 14. 开始生产并记录日志 ``` shell # hadoop-2 和 hadoop-3 去产生日志 cd /opt/tools ./generateLog.sh # hadoop-1 可以通过 Kafka 的消费端或 Hbase shell 去验证结果 cd /opt/modules/kafka_2.11-0.9.0.0 ./kfk-weblogs-consumer.sh cd /opt/modules/hbase-0.98.6-cdh5.3.0 ./bin/hbase shell ``` 15. Hive 创建外部表 ``` bash # Hive 配置成依赖 Mysql，注意检查 Mysql 是否启动，启动后，在 hadoop-3 上初始化 Hive cd /opt/modules/hive-0.13.1-cdh5.3.6 ./bin/hive # 创建和 weblogs 对应的表结构 CREATE EXTERNAL TABLE weblogs( id string, datetime string, userid string, searchname string, retorder string, cliorder string, cliurl string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES("hbase.columns.mapping"= ":key,info:datetime,info:userid,info:searchname,info:retorder,info:cliorder,info:cliurl") TBLPROPERTIES("hbase.table.name"="weblogs"); # 接下来可以运行相应的 sql 语句进行检查 ``` 16. 启动 Hive 的后台服务 ``` shell # 若想在 Hue 中使用 Hive，需在后台开启相应的 Hive 进程 cd /opt/modules/hive-0.13.1-cdh5.3.6 nohup ./bin/hiveserver2 & ``` 17. 编译并启动 Hue ``` shell # 确保编译相关的依赖正常后再编译，相关依赖在 conf 目录下 cd /opt/modules/hue-3.7.0-cdh5.3.6 make apps # 修改 desktop.db 文件权限和所属用户 chmod o+w desktop/desktop.db # 启动 Hue 服务 cd /opt/modules/hue-3.9.0-cdh5.5.0/build/env/bin ./supervisor # 之后便可前往浏览器访问了 http://hadoop-3:8888 ``` 17. 上传实时数据处理模块 ``` shell # 进入安装 spark 的机器 docker-compose exec --user kfk kfk2 bash # 进入 spark 目录 cd /opt/modules/spark-2.2.0-bin # 上传 jar 包 ./bin/spark-submit --master local[2] /opt/data/sparkScala.jar # 之后如果 mysql 和 tomcat 部署正常就可以正常访问了详情看 web 部分代码 ```

评论收藏

内容反馈

版权申诉