## 基于Spark2.2新闻网大数据实时系统项目
### 1. 说明
[项目代码](https://github.com/pkeropen/BigData-News)是参考[基于Spark2.x新闻网大数据实时分析可视化系统项目](https://blog.csdn.net/u011254180/article/details/80172452) 或者[大数据项目实战之新闻话题的实时统计分析](http://www.raincent.com/content-10-11077-1.html),谢谢作者分享心得!
### 2.环境配置
##### 2.1 CDH-5.14.2 (安装步骤可参考[地址](https://blog.51cto.com/kaliarch/2122467)),关于版本是按实际操作, CDH的版本兼容性很好。
|Service | hadoop01 | hadoop02 | hadoop03
|:----------|:----------|:----------|:---------
|HDFS | NameNode | DateNode | DataNode
|HBase | HMaster、HRegionServer | HRegionServer| HRegionServer
|Hive | Hive
|Flume | Flume | Flume | Flume
|Kafka | Kafka
|YARN | ResourceManager | NodeManager | NodeManager
|Oozie | Oozie
|Hue | Hue
|Spark2 | Spark
|Zookeeper | Zookeeper
|MySQL | MySQL
##### 2.2 主机配置
```
1.Hadoop01, 4核16G , centos7.2
2.Hadoop02, 2核8G, centos7.2
3.Haddop03, 2核8G, centos7.2
```
##### 2.3 项目架构
![项目架构图](https://github.com/pkeropen/BigData-News/blob/master/pic/Architecture.png)
##### 2.4 安装依赖包
```
# yum -y install psmisc MySQL-python at bc bind-libs bind-utils cups-client cups-libs cyrus-sasl-gssapi cyrus-sasl-plain ed fuse fuse-libs httpd httpd-tools keyutils-libs-devel krb5-devel libcom_err-devel libselinux-devel libsepol-devel libverto-devel mailcap noarch mailx mod_ssl openssl-devel pcre-devel postgresql-libs python-psycopg2 redhat-lsb-core redhat-lsb-submod-security x86_64 spax time zlib-devel wget psmisc
# chmod +x /etc/rc.d/rc.local
# echo "echo 0 > /proc/sys/vm/swappiness" >>/etc/rc.d/rc.local
# echo "echo never > /sys/kernel/mm/transparent_hugepage/defrag" >>/etc/rc.d/rc.local
# echo 0 > /proc/sys/vm/swappiness
# echo never > /sys/kernel/mm/transparent_hugepage/defrag
# yum -y install rpcbind
# systemctl start rpcbind
# echo "systemctl start rpcbind" >> /etc/rc.d/rc.local
安装perl支持
yum install perl* (yum安装perl相关支持)
yum install cpan (perl需要的程序库,需要cpan的支持,详细自行百度)
```
### 3. 编写数据生成模拟程序
##### 3.1 模拟从nginx生成日志的log,数据来源(搜狗实验室[下载](https://www.sogou.com/labs/resource/q.php)用户查询日志,搜索引擎查询日志库设计为包括约1个月(2008年6月)Sogou搜索引擎部分网页查询需求及用户点击情况的网页查询日志数据集合。)
##### 3.2 数据清洗
##### 数据格式为:访问时间\t用户ID\t[查询词]\t该URL在返回结果中的排名\t用户点击的顺序号\t用户点击的URL其中,用户ID是根据用户使用浏览器访问搜索引擎时的Cookie信息自动赋值,即同一次使用浏览器输入的不同查询对应同一个用户ID
1. 将文件中的tab更换成逗号
```
cat weblog.log|tr "\t" "," > weblog2.log
```
2. 将文件中的空格更换成逗号
```
cat weblog2.log|tr " " "," > weblog.log
```
##### 3.3 主要代码段
```
public static void readFileByLines(String fileName) {
FileInputStream fis = null;
InputStreamReader isr = null;
BufferedReader br = null;
String tempString = null;
try {
System.out.println("以行为单位读取文件内容,一次读一整行:");
fis = new FileInputStream(fileName);
//// 从文件系统中的某个文件中获取字节
isr = new InputStreamReader(fis, "GBK");
br = new BufferedReader(isr);
int count = 0;
while ((tempString = br.readLine()) != null) {
count++;
//显示行号
Thread.sleep(300);
String str = new String(tempString.getBytes("GBK"), "UTF8");
System.out.println("row:"+count+">>>>>>>>"+str);
writeFile(writeFileName, str);
}
isr.close();
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
} finally {
if (isr != null) {
try {
isr.close();
} catch (IOException e1) {
}
}
}
}
```
#### 3.4 打包成weblogs.jar,[打包步骤](https://blog.csdn.net/xuemengrui12/article/details/74984731), 写Shell脚本weblog-shell.sh
```
#/bin/bash
echo "start log......"
#第一个参数是原日志文件,第二个参数是日志生成输出文件
java -jar /opt/jars/weblogs.jar /opt/datas/weblog.log /opt/datas/weblog-flume.log
```
#### 3.5 修改weblog-shell.sh可执行权限
```
chmod 777 weblog-shell.sh
```
### 4. Flume数据采集配置
##### 4.1 将hadoop02, hadoop03中Flume数据采集到hadoop01中,而且hadoop02和hadoop03的flume配置文件大致相同
```
flume-collect-conf.properties
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type =exec
a1.sources.r1.command= tail -F /opt/datas/weblog-flume.log
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop01
a1.sinks.k1.port = 5555
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
a1.channels.c1.keep-alive = 5
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
```
##### 4.2 hadoop01通过flume接收hadoop02与hadoop03中flume传来的数据,并将其分别发送至hbase与kafka中,配置内容如下:
```
a1.sources = r1
a1.channels = kafkaC hbaseC
a1.sinks = kafkaSink hbaseSink
a1.sources.r1.type = avro
a1.sources.r1.channels = hbaseC kafkaC
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 5555
a1.sources.r1.threads = 5
#****************************flume + hbase******************************
a1.channels.hbaseC.type = memory
a1.channels.hbaseC.capacity = 10000
a1.channels.hbaseC.transactionCapacity = 10000
a1.channels.hbaseC.keep-alive = 20
a1.sinks.hbaseSink.type = asynchbase
## HBase表名
a1.sinks.hbaseSink.table = weblogs
## HBase表的列族名称
a1.sinks.hbaseSink.columnFamily = info
## 自定义异步写入Hbase
a1.sinks.hbaseSink.serializer = main.hbase.KfkAsyncHbaseEventSerializer
a1.sinks.hbaseSink.channel = hbaseC
## Hbase表的列 名称
a1.sinks.hbaseSink.serializer.payloadColumn = datetime,userid,searchname,retorder,cliorder,cliurl
#****************************flume + kafka******************************
a1.channels.kafkaC.type = memory
a1.channels.kafkaC.capacity = 10000
a1.channels.kafkaC.transactionCapacity = 10000
a1.channels.kafkaC.keep-alive = 20
a1.sinks.kafkaSink.channel = kafkaC
a1.sinks.kafkaSink.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.kafkaSink.brokerList = hadoop01:9092
a1.sinks.kafkaSink.topic = webCount
a1.sinks.kafkaSink.zookeeperConnect = hadoop01:2181
a1.sinks.kafkaSink.requiredAcks = 1
a1.sinks.kafkaSink.batchSize = 1
a1.sinks.kafkaSink.serializer.class = kafka.serializer.StringEncoder
```
##### 4.3 配置Flume执行Shell脚本
```
flume-collect-start.sh 分发到hadoop02,hadoop03 ,/opt/shell/
#/bin/bash
echo "flume-collect start ......"
sh /bin/flume-ng agent --conf c
没有合适的资源?快使用搜索试试~ 我知道了~
基于Spark2.2的新闻网大数据实时分析系统设计与实现.zip
共403个文件
xml:364个
scala:14个
iml:6个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 6 浏览量
2023-11-03
21:50:44
上传
评论
收藏 264KB ZIP 举报
温馨提示
1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用! 2、本项目适合计算机相关专业(如计科、人工智能、通信工程、自动化、电子信息等)的在校学生、老师或者企业员工下载学习,也适合小白学习进阶,当然也可作为毕设项目、课程设计、作业、项目初期立项演示等。 3、如果基础还行,也可在此代码基础上进行修改,以实现其他功能,也可用于毕设、课设、作业等。 -------- 下载后请首先打开README.md文件(如有),仅供学习参考, 切勿用于商业用途。
资源推荐
资源详情
资源评论
收起资源包目录
基于Spark2.2的新闻网大数据实时分析系统设计与实现.zip (403个子文件)
.gitignore 301B
BigData_News_Project.iml 28KB
structured-streaming-demo.iml 28KB
flume-ng-hbase-sink.iml 14KB
hbase_flume.iml 14KB
BigData-News.iml 742B
main.iml 511B
KfkAsyncHbaseEventSerializer.java 6KB
ReadWriteLog.java 2KB
SimpleRowKeyGenerator.java 2KB
TestProducer.java 2KB
TestKafkaConsumer.java 1KB
README.md 16KB
README.md 6KB
MANIFEST.MF 7KB
Architecture.png 51KB
flume-hbase-kafka-conf.properties 2KB
flume-collect-conf.properties 719B
RedisSingle.scala 6KB
StructuredStreamingKafka.scala 2KB
JDBCSink.scala 2KB
StructuredStreamingOffset.scala 2KB
LoggerSimulation.scala 2KB
MySqlPool.scala 1KB
StreamingKafka10.scala 1KB
StreamingKafka8.scala 1KB
Test.scala 887B
Test1.scala 877B
MysqlSink.scala 718B
TestStructureNetwork.scala 703B
RedisWriteKafkaOffset.scala 690B
Constants.scala 116B
weblog-shell.sh 198B
flume-kfk-hb-start.sh 167B
flume-collect-start.sh 163B
kfk-test-consumer.sh 143B
classes.timestamp 1B
start-kafka-topic.txt 815B
split.txt 178B
workspace.xml 51KB
structured_streaming_demo_jar.xml 26KB
pom.xml 7KB
pom.xml 7KB
Maven__org_scala_lang_scala_library_2_11_8.xml 1KB
compiler.xml 928B
pom.xml 806B
Maven__org_apache_hadoop_hadoop_yarn_server_applicationhistoryservice_2_4_0.xml 783B
modules.xml 783B
Maven__org_glassfish_jersey_containers_jersey_container_servlet_core_2_22_2.xml 741B
Maven__org_apache_hadoop_hadoop_mapreduce_client_jobclient_test_jar_tests_2_4_0.xml 737B
Maven__org_apache_hadoop_hadoop_yarn_server_resourcemanager_2_4_0.xml 713B
Maven__org_apache_tephra_tephra_hbase_compat_1_1_0_9_0_incubating.xml 713B
Maven__org_apache_hadoop_hadoop_mapreduce_client_jobclient_2_6_0.xml 706B
Maven__org_apache_hadoop_hadoop_mapreduce_client_jobclient_2_4_0.xml 706B
Maven__org_glassfish_jersey_containers_jersey_container_servlet_2_22_2.xml 706B
Maven__org_apache_hbase_hbase_hadoop2_compat_test_jar_tests_0_98_2_hadoop2.xml 705B
Maven__org_apache_directory_server_apacheds_kerberos_codec_2_0_0_M15.xml 704B
Maven__org_apache_hbase_hbase_hadoop_compat_test_jar_tests_0_98_2_hadoop2.xml 698B
Maven__com_fasterxml_jackson_module_jackson_module_scala_2_11_2_6_5.xml 694B
Maven__org_glassfish_hk2_external_aopalliance_repackaged_2_4_0_b34.xml 693B
Maven__org_apache_hadoop_hadoop_mapreduce_client_shuffle_2_6_0.xml 692B
Maven__org_apache_hadoop_hadoop_mapreduce_client_shuffle_2_4_0.xml 692B
Maven__org_apache_twill_twill_discovery_core_0_6_0_incubating.xml 688B
Maven__org_apache_spark_spark_streaming_kafka_0_10_2_11_2_2_0.xml 688B
Maven__org_apache_spark_spark_streaming_flume_sink_2_11_2_2_0.xml 688B
Maven__com_fasterxml_jackson_module_jackson_module_paranamer_2_6_5.xml 687B
Maven__org_scala_lang_modules_scala_parser_combinators_2_11_1_0_1.xml 686B
Maven__org_apache_hadoop_hadoop_mapreduce_client_common_2_4_0.xml 685B
Maven__org_apache_hadoop_hadoop_yarn_server_nodemanager_2_4_0.xml 685B
Maven__org_apache_hadoop_hadoop_mapreduce_client_common_2_6_0.xml 685B
Maven__org_apache_twill_twill_discovery_api_0_6_0_incubating.xml 681B
Maven__org_apache_spark_spark_streaming_kafka_0_8_2_11_2_2_0.xml 681B
Maven__org_apache_hadoop_hadoop_yarn_server_tests_test_jar_tests_2_4_0.xml 674B
Maven__org_apache_hbase_hbase_hadoop2_compat_0_98_2_hadoop2.xml 674B
Maven__com_github_stephenc_findbugs_findbugs_annotations_1_3_9_1.xml 673B
Maven__org_apache_hadoop_hadoop_yarn_server_web_proxy_2_4_0.xml 671B
Maven__org_apache_hadoop_hadoop_mapreduce_client_core_2_4_0.xml 671B
Maven__org_apache_hadoop_hadoop_mapreduce_client_core_2_6_0.xml 671B
Maven__org_apache_hbase_hbase_hadoop_compat_0_98_2_hadoop2.xml 667B
Maven__org_apache_hadoop_hadoop_mapreduce_client_app_2_6_0.xml 664B
Maven__org_apache_hadoop_hadoop_mapreduce_client_app_2_4_0.xml 664B
Maven__org_apache_calcite_calcite_avatica_1_2_0_incubating.xml 661B
Maven__org_apache_hbase_hbase_testing_util_0_98_2_hadoop2.xml 660B
Maven__org_apache_hadoop_hadoop_mapreduce_client_hs_2_4_0.xml 657B
Maven__org_apache_calcite_calcite_linq4j_1_2_0_incubating.xml 654B
Maven__org_glassfish_jersey_bundles_repackaged_jersey_guava_2_22_2.xml 654B
Maven__org_apache_spark_spark_network_shuffle_2_11_2_2_0.xml 653B
Maven__org_apache_spark_spark_streaming_flume_2_11_2_2_0.xml 653B
Maven__org_apache_hbase_hbase_prefix_tree_0_98_2_hadoop2.xml 653B
Maven__org_apache_twill_twill_zookeeper_0_6_0_incubating.xml 653B
Maven__org_apache_spark_spark_streaming_kafka_2_11_1_6_2.xml 653B
Maven__org_apache_hadoop_hadoop_yarn_server_common_2_4_0.xml 650B
Maven__org_apache_hadoop_hadoop_yarn_server_common_2_6_0.xml 650B
Maven__org_apache_hbase_hbase_common_test_jar_tests_0_98_2_hadoop2.xml 649B
Maven__org_apache_spark_spark_sql_kafka_0_10_2_11_2_2_0.xml 646B
Maven__org_apache_spark_spark_network_common_2_11_2_2_0.xml 646B
Maven__com_google_inject_extensions_guice_assistedinject_3_0.xml 645B
Maven__com_fasterxml_jackson_core_jackson_annotations_2_6_5.xml 644B
Maven__org_spark_project_hive_hive_metastore_1_2_1_spark2.xml 642B
Maven__com_github_stephenc_high_scale_lib_high_scale_lib_1_1_1.xml 641B
共 403 条
- 1
- 2
- 3
- 4
- 5
资源评论
程皮
- 粉丝: 277
- 资源: 2566
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于Java语言的实用型通知APP设计源码
- 基于Python、HTML、CSS的多语言apiIHRMTest设计源码
- 基于dotnet standard 2.0的SAEA.Socket高性能网络框架设计源码
- SublimeText 3 的 Golang 插件集合,提供代码完成和其他类似 IDE 的功能 .zip
- Sarasa Gothic , 更纱黑体 , 更纱黑体 , 更纱ゴshikku , 사라사 고딕.zip
- 基于Vue的刷脸支付系统及OEM定制设计源码
- tb_image_share_1733150361392.jpg.png
- Ruby 进程监视器.zip
- 基于Python、HTML、JavaScript、CSS的咖啡主题网站设计源码
- SimpleDiskAnalyzer.7z
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功