Stage划分算法原理剖析_java中stage资源-CSDN文库

共1个文件

pdf：1个

java

需积分: 9 9 浏览量 2019-10-15 13:51:27 上传评论收藏 246KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Stage划分算法原理剖析@www.java1234.com.zip （1个子文件）

Stage划分算法原理剖析@www.java1234.com.pdf 255KB

先上Demo，我们根据这个Demo一步一步的跟进代码，看看代码里面发生了什么？

1 //一下代码共有8个rdd，4个Stage：ShuffleMapStage0,ShuffleMapStage1,Shuffl

eMapStage2,ResultStage3

2 objectUserViewCount{

3 defmain(args:Array[String]):Unit={

4 valconf=newSparkConf()

5 conf.setAppName("UserViewCount").setMaster("local[2]")

6 valsc=newSparkContext(conf)

7 valpath="C:\empcl\data\1.txt"

9 vallineRDD=sc.textFile(path)//rdd1

10 valdateRDD=lineRDD.map(line=>{//rdd2

11 valdate=line.split(" ")(0)

12 (date,1)

13 })

15 valreduceRDD=dateRDD.reduceByKey(_+_)//rdd3

16 valaddOneRDD=reduceRDD.map(x=>(x._1,x._2+1))//rdd4

17 valgroupRDD=addOneRDD.groupByKey()//rdd5

18 .map(x=>{//rdd6

19 valkey=x._1

20 (key,1)

21 })

22 valresRDD=groupRDD.reduceByKey(_+_)//rdd7

23 resRDD

24 .map(x=>x)//rdd8

25 .collect().foreach(println)

26 sc.stop()

27 }

28 }

当我们使用action算子collect的时候，就触发了job的执行。我们跟进下代码看看。

1 defcollect():Array[T]=withScope{

2 valresults=sc.runJob(this,(iter:Iterator[T])=>iter.toArray)

3 Array.concat(results:_*)

4 }

5 //RunajobonallpartitionsinanRDDandreturntheresultsinanarr

ay.

6 defrunJob[T,U:ClassTag](rdd:RDD[T],func:Iterator[T]=>U):Array[U]

={

7 runJob(rdd,func,0untilrdd.partitions.length)

8 }

9 ...

10 defrunJob[T,U:ClassTag](...):Unit={

11 ...

12 dagScheduler.runJob(rdd,cleanedFunc,partitions,callSite,resultHandl

er,localProperties.get)

13 ...

14 }

16 //dagScheduler.runJob(...)

17 defrunJob[T,U](...):Unit={

18 ...

19 valwaiter=submitJob(rdd,func,partitions,callSite,resultHandler,

properties)

20 ...

21 }

23 defsubmitJob[T,U](...):JobWaiter[U]={

24 valwaiter=newJobWaiter(this,jobId,partitions.size,resultHandler)

25 //eventProcessLoop实例在构造DAGScheduler对象的时候创建，

26 //并且调用eventProcessLoop.start()方法以启动线程不断的读取eventQueue中的数

据

27 eventProcessLoop.post(JobSubmitted(

28 jobId,rdd,func2,partitions.toArray,callSite,waiter,

29 SerializationUtils.clone(properties)))

30 waiter

31 }

从上面的代码我们可以知道，当执行collect算子的时候，里面会连续调用多个

runJob(...)方法，直到在最后一个runJob(...)方法里调用DAGScheduler的runJob(...)方

法，在这个runJob(...)方法里，我们会调用一个submitJob(...)方法。程序会将job信息，

如jobId，rdd，执行函数，分区等信息封装成一个JobSubmitted对象。然后放入阻塞队列

eventQueue中，以供event线程在之后进行调用。

我们首先来看下eventProcessLoop的类型DAGSchedulerEventProcessLoop，它继承于抽

象类EventLoop，并重写了一些方法。

1 private[spark]abstractclassEventLoop[E](name:String)extendsLogging

{

3 //阻塞队列用于保存相关对象，以供线程从队列中取出对象执行

4 privatevaleventQueue:BlockingQueue[E]=newLinkedBlockingDeque[E]()

6 //不断的从阻塞队列中取数据，并且交由onReceive()进行执行。

7 privatevaleventThread=newThread(name){

8 setDaemon(true)

9 overridedefrun():Unit={

10 while(!stopped.get){

11 valevent=eventQueue.take()

12 onReceive(event)

13 }

14 }

15 }

17 overridedefonReceive(event:DAGSchedulerEvent):Unit={

18 doOnReceive(event)

19 }

21 //doOnReceive(...)内部使用的是模式匹配，选择合适的方法进行执行

22 privatedefdoOnReceive(event:DAGSchedulerEvent):Unit=eventmatch{

23 caseJobSubmitted(jobId,rdd,func,partitions,callSite,listener,pro

perties)=>

24 dagScheduler.handleJobSubmitted(jobId,rdd,func,partitions,callSite,

listener,properties)

26 casecompletion:CompletionEvent=>

27 dagScheduler.handleTaskCompletion(completion)

28 }

从上面的代码我们可以知道，将JobSubmitted插入到阻塞队列中后，

eventProcessLoop中的守护线程会从阻塞队列中取出数据，并且调用DAGScheduler的

handleJobSub()方法进行执行。

1 private[scheduler]defhandleJobSubmitted(...){

2 varfinalStage:ResultStage=null

3 //首先根据触发操作的最后一个RDD创建一个ResultStage。

4 finalStage=createResultStage(finalRDD,func,partitions,jobId,callSi

te)

5 ...

6 submitStage(finalStage)

7 }

从上面代码我们可以看到，首先会创建finalStage，然后将这个finalStage作为

submitStage()的参数进行执行。

1 //从下面代码我们可以知道，首先获取当前finalStage的父stage，然后作为参数封装到R

esultStage，并返回创建的ResultStage

2 privatedefcreateResultStage(...):ResultStage={

3 valparents=getOrCreateParentStages(rdd,jobId)

4 valstage=newResultStage(id,rdd,func,partitions,parents,jobId,c

allSite)

5 stageIdToStage(id)=stage

6 updateJobIdStageIdMaps(jobId,stage)

7 stage

8 }

10 //getShuffleDependencies(rdd)=>通过Stack获得finalStage的shuffleDepende

ncy

11 //getOrCreateShuffleMapStage（shuffleDep,jobId）=>

12 privatedefgetOrCreateParentStages(rdd:RDD[_],firstJobId:Int):

List[Stage]={

13 getShuffleDependencies(rdd).map{shuffleDep=>

14 getOrCreateShuffleMapStage(shuffleDep,firstJobId)

15 }.toList

16 }

18 /*

19 *Returnsshuffledependenciesthatareimmediateparentsofthegiven

RDD.

20 *

21 *Thisfunctionwillnotreturnmoredistantancestors.Forexample,if

Chasashuffle

22 *dependencyonBwhichhasashuffledependencyonA:

23 *

24 *A<‐‐B<‐‐C

25 *

26 *callingthisfunctionwithrddCwillonlyreturntheB<‐‐Cdepende

ncy.

27 *

28 *Thisfunctionisscheduler‐visibleforthepurposeofunittesting.

29 */

30 private[scheduler]defgetShuffleDependencies(

31 rdd:RDD[_]):HashSet[ShuffleDependency[_,_,_]]={

32 valparents=newHashSet[ShuffleDependency[_,_,_]]

33 valvisited=newHashSet[RDD[_]]

34 valwaitingForVisit=newStack[RDD[_]]

35 waitingForVisit.push(rdd)

36 while(waitingForVisit.nonEmpty){

37 valtoVisit=waitingForVisit.pop()

评论收藏

内容反馈

java1234_小锋

粉丝: 1w+
资源: 518

Stage划分算法原理剖析

Stage划分算法原理剖析.pdf

Java-stage:JAVA_STAGE

Stage.java

android多媒体stagefright框架详细分析

2.在目标检测算法中，two stage的算法比one stage在检测小物体上更有效，此说法你同意么，为什么？（）

stageFright_OpenMax系统.doc

e-stage 6.6

自然语言处理--人工智能--ROS、Stage、TLD算法--机器人编程示例

stagefright框架2.0

One-stage目标检测最强算法 ExtremeNet源码

Python-ExtremeNet德克萨斯大学提出Onestage目标检测最强算法

ROS下Stage仿真器以及TF详解PPT（含例程代码）

stagefright + omx小结

Player/Stage robot software

Adobe Flash 11 Stage3D游戏编程初学者指南 代码

stagefright-plugins-master_stagefright_android_

stagefright与opencore对比

模拟集成电路的分析与设计：Chapter 3-Single-Stage Amplifiers.ppt

SIEMENS/UNIFY OpenStage HFA/SIP 话机设置文档

STM32CubeMX安装包(版本:6.9.0) 附带 Java安装包(版本:371) - -2023年7月14日

基于spring boot的小区物业管理系统源码+论文+答辩ppt

最新Java JDK 8安装版（Windows 64位）

自主研发的软著申请代码文档整理输出工具

毕业设计-基于JAVA的springboot超市进销存系统(源代码+论文）

Java 面经手册·小傅哥.pdf

apache-maven-3.9.6版本安装包

java-11 windows-x64 安装包

2024最强Java面试八股文

最新资源

Adobe Flash 11 Stage3D游戏编程初学者指南代码