基于Flink的组件化实时特征处理平台的设计与实现.docx资源-CSDN文库

版权申诉

67 浏览量 2023-09-10 15:55:03 上传评论收藏 3.64MB DOCX 举报

资源推荐

资源详情

资源评论

i

摘要

近年来，得益于互联网、云计算等技术的快速发展，各行各业每日都在产

出数以亿计的大规模海量数据。人们可以通过大数据技术对海量数据进行处

理，并对数据进行分析，可以带来许多有价值的产出，于是，如何高效地加工

和利用这些数据是目前技术人才面临的一个头号难题。当下，大数据处理主要

存在如下问题：1）技术繁杂，上手难度大。2）存在大量重复编码，处理效率

低。3）实时数据处理能力缺乏，批计算无法应对实时场景。

综上所述，为解决当下大数据处理中的种种难题，我们以电商搜索推荐场

景为例，构建了全链路的实时大数据处理平台，主要包括了：1）打造一站式的

大数据处理平台，实现从数据采集、数据处理、分布式存储、数据管理的一站

式大数据闭环。同时，系统必须具备高效、低延迟和高容错性的要求，保证任

务严格无误地执行。2）大数据组件化抽象。基于 Flink 二次开发，在 Flink 计算

图 StreamGraph 之上重新定义 JobGraph，将大数据中每一个独立的功能抽象成

JobGraph 中的一个节点，在执行计算时将各独立的组件模块在任务执行时能够

组合成一个任务，减少不必要的重复开发。3）支持实时流计算，同时优化实现

了双流 Join 功能。针对实时流关联中左右流速率不一致的问题，提出了双流

Join 和 Watermark 方案，协调多个实时计算流之间的速率，提升关联成功率。

关键词：大数据处理，实时计算，分布式计算，双流 Join，组件化

ii

Abstract

In recent years, with the rapid development of Internet, cloud computing and

other technologies, hundreds of millions of large-scale massive data are generated in

all walks of life every day. People can process and analyze massive data through big

data technology, which can bring a lot of valuable output. Therefore, how to efficiently

process and use these data has become a number one problem for current technical

talents. At present, big data processing mainly has the following problems: 1) the

technology is complex and difficult to start. 2) There are a lot of repetitive codes, so

the processing efficiency is low. 3) Lack of real-time data processing ability, unable to

cope with high real-time requirements of the scene.

To sum up, in order to solve various problems in the current big data processing,

we take e-commerce search recommendation scenario as an example to build a full

link real-time big data processing platform, which mainly includes: 1) building a one-

stop big data processing platform to realize one-stop big data closed-loop from data

collection, data processing, distributed storage and data management. At the same

time, the system must have the requirements of high efficiency, low delay and high

fault tolerance to ensure that the task is executed strictly and correctly. 2) Big data

component abstraction. Based on the second development of Flink, we redefine the

jobgraph on the stream graph of Flink, abstract each independent function in big data

into a node in the jobgraph, and combine each independent component module into a

task when the task is executed. 3) Support real-time flow calculation and optimize the

implementation of double flow join. To solve the problem of inconsistency between

the left and right flow rates in real-time flow Association, a two flow join and

watermark scheme is proposed to coordinate the rates of multiple real-time calculation

flows and improve the success rate of association.

Keywords: big data processing, real-time computing, distributed computing, dual

flow join, componentization

剩余89页未读，继续阅读

内容反馈

版权申诉

南抖北快东卫

粉丝: 72
资源: 5584

最新资源

资源上传下载、课程学习等过程中有任何疑问或建议，欢迎提出宝贵意见哦~我们会及时处理！点击此处反馈

feedback-tip