菜鸟的Spark 源码学习之路 -4 DAGScheduler源码 - part3

程序员文章站 2022-07-14 11:12:57

...

上文中，DAGScheduler-02我们了解到DagScheduler的Job提交和管理。接下来我们看一下DAGSCheduler中的一个重要的组件：DAGSchedulerEventProcessLoop，它是处理消息的组件。之前我们看到，Job和task等的提交、管理过程很多都是调用该组件的post方法发送一个event。

我们先看一下DAGScheduler 里面它的定义和初始化：

private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)

看一下DAGSchedulerEventProcessLoop的定义，它和DAGScheduler在同一个个文件中，传入一个DAGScheduler作为参数。

菜鸟的Spark 源码学习之路 -4 DAGScheduler源码 - part3

结构也比较简洁，主要是以下几个方法：

菜鸟的Spark 源码学习之路 -4 DAGScheduler源码 - part3

看下他的实现的接口EventLoop,功能就一目了然了:

菜鸟的Spark 源码学习之路 -4 DAGScheduler源码 - part3

接收和处理调用者发送的消息。它会启动一个独立的线程去处理所有的消息。

Event队列理论上是可以无限增长的，所以子类必须保证能够及时地处理消息避免可能的OOM。

看一下事件处理的入口方法，它负责接收事件：

菜鸟的Spark 源码学习之路 -4 DAGScheduler源码 - part3

调用了doOnReceive()方法：

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
  case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
    dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

  case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
    dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)

  case StageCancelled(stageId, reason) =>
    dagScheduler.handleStageCancellation(stageId, reason)

  case JobCancelled(jobId, reason) =>
    dagScheduler.handleJobCancellation(jobId, reason)

  case JobGroupCancelled(groupId) =>
    dagScheduler.handleJobGroupCancelled(groupId)

  case AllJobsCancelled =>
    dagScheduler.doCancelAllJobs()

  case ExecutorAdded(execId, host) =>
    dagScheduler.handleExecutorAdded(execId, host)

  case ExecutorLost(execId, reason) =>
    val workerLost = reason match {
      case SlaveLost(_, true) => true
      case _ => false
    }
    dagScheduler.handleExecutorLost(execId, workerLost)

  case WorkerRemoved(workerId, host, message) =>
    dagScheduler.handleWorkerRemoved(workerId, host, message)

  case BeginEvent(task, taskInfo) =>
    dagScheduler.handleBeginEvent(task, taskInfo)

  case SpeculativeTaskSubmitted(task) =>
    dagScheduler.handleSpeculativeTaskSubmitted(task)

  case GettingResultEvent(taskInfo) =>
    dagScheduler.handleGetTaskResult(taskInfo)

  case completion: CompletionEvent =>
    dagScheduler.handleTaskCompletion(completion)

  case TaskSetFailed(taskSet, reason, exception) =>
    dagScheduler.handleTaskSetFailed(taskSet, reason, exception)

  case ResubmitFailedStages =>
    dagScheduler.resubmitFailedStages()
}

这个方法负责处理各种事件，通过模式匹配，调用不同的方法来处理相应的事件。

def start(): Unit = {
  if (stopped.get) {
    throw new IllegalStateException(name + " has already been stopped")
  }
  // Call onStart before starting the event thread to make sure it happens before onReceive
  onStart()
  eventThread.start()
}

DagScheduler中最后调用了start方法启动事件处理：

菜鸟的Spark 源码学习之路 -4 DAGScheduler源码 - part3

start方法中调用onstart()通知事件处理器DAGScheduler启动了事件处理，并启动了一个线程去做事件处理。

菜鸟的Spark 源码学习之路 -4 DAGScheduler源码 - part3

接下来我们回到doOnReceive()中看一下不同的事件是如何处理的。以提交job事件为例：

case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
  dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

这里调用了dagScheduler.handleJobSubmitted去处理job提交事件。

private[scheduler] def handleJobSubmitted(jobId: Int,
    finalRDD: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    callSite: CallSite,
    listener: JobListener,
    properties: Properties) {
// 构建最后的stage
  var finalStage: ResultStage = null
  try {
    // New stage creation may throw an exception if, for example, jobs are run on a
    // HadoopRDD whose underlying HDFS files have been deleted.
    finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
  } catch {
    case e: Exception =>
      logWarning("Creating new stage failed due to exception - job: " + jobId, e)
      listener.jobFailed(e)
      return
  }

  val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
  clearCacheLocs()
  logInfo("Got job %s (%s) with %d output partitions".format(
    job.jobId, callSite.shortForm, partitions.length))
  logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
  logInfo("Parents of final stage: " + finalStage.parents)
  logInfo("Missing parents: " + getMissingParentStages(finalStage))

  val jobSubmissionTime = clock.getTimeMillis()
  jobIdToActiveJob(jobId) = job
  activeJobs += job
  finalStage.setActiveJob(job)
  val stageIds = jobIdToStageIds(jobId).toArray
  val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
  listenerBus.post(
    SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
  submitStage(finalStage)
}

这里，stage的构建过程是从最后一个stage开始反向构建其ParentStage，主要是调用createResultStage方法。

/**
 * Create a ResultStage associated with the provided jobId.
 */
private def createResultStage(
    rdd: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    jobId: Int,
    callSite: CallSite): ResultStage = {
  // 获取或创建其ParentStages
  val parents = getOrCreateParentStages(rdd, jobId)
  val id = nextStageId.getAndIncrement()
  val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
  //更新相应的数据结构
  stageIdToStage(id) = stage
  updateJobIdStageIdMaps(jobId, stage)
  stage
}

下面是getOrCreateParentStages的过程

/**
 * Get or create the list of parent stages for a given RDD.  The new Stages will be created with
 * the provided firstJobId.
 */
private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
  getShuffleDependencies(rdd).map { shuffleDep =>
    getOrCreateShuffleMapStage(shuffleDep, firstJobId)
  }.toList
}

这里主要涉及到shuffle依赖解析，划分stage的过程.

/**
 * Returns shuffle dependencies that are immediate parents of the given RDD.
 //这里只会返回直接关联的父依赖
 * This function will not return more distant ancestors.  For example, if C has a shuffle
 * dependency on B which has a shuffle dependency on A:
 *
 * A <-- B <-- C
 *
 * calling this function with rdd C will only return the B <-- C dependency.
 *
 * This function is scheduler-visible for the purpose of unit testing.
 */
private[scheduler] def getShuffleDependencies(
    rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
  val parents = new HashSet[ShuffleDependency[_, _, _]]
  val visited = new HashSet[RDD[_]]
  val waitingForVisit = new ArrayStack[RDD[_]]
  waitingForVisit.push(rdd)
  while (waitingForVisit.nonEmpty) {
    val toVisit = waitingForVisit.pop()
    if (!visited(toVisit)) {
      visited += toVisit
      toVisit.dependencies.foreach {
        case shuffleDep: ShuffleDependency[_, _, _] =>
          parents += shuffleDep
        case dependency =>
          waitingForVisit.push(dependency.rdd)
      }
    }
  }
  parents
}

对每一个父依赖，生成stage：

/**
 * Gets a shuffle map stage if one exists in shuffleIdToMapStage. Otherwise, if the
 * shuffle map stage doesn't already exist, this method will create the shuffle map stage in
 * addition to any missing ancestor shuffle map stages.
  如果存在 stage 已经在shuffleIdToMapStage，则返回，如果没有，则创建一个
 */
private def getOrCreateShuffleMapStage(
    shuffleDep: ShuffleDependency[_, _, _],
    firstJobId: Int): ShuffleMapStage = {
  shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
    case Some(stage) =>
      stage

    case None =>
      // Create stages for all missing ancestor shuffle dependencies.
      getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
        // Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies
        // that were not already in shuffleIdToMapStage, it's possible that by the time we
        // get to a particular dependency in the foreach loop, it's been added to
        // shuffleIdToMapStage by the stage creation process for an earlier dependency. See
        // SPARK-13902 for more information.
        if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
          createShuffleMapStage(dep, firstJobId)
        }
      }
      // Finally, create a stage for the given shuffle dependency.
      createShuffleMapStage(shuffleDep, firstJobId)
  }
}

首先尝试查找未注册的祖先stage：

/** Find ancestor shuffle dependencies that are not registered in shuffleToMapStage yet */
private def getMissingAncestorShuffleDependencies(
    rdd: RDD[_]): ArrayStack[ShuffleDependency[_, _, _]] = {
  val ancestors = new ArrayStack[ShuffleDependency[_, _, _]]
  val visited = new HashSet[RDD[_]]
  // We are manually maintaining a stack here to prevent *Error
  // caused by recursively visiting
  val waitingForVisit = new ArrayStack[RDD[_]]
  waitingForVisit.push(rdd)
  while (waitingForVisit.nonEmpty) {
    val toVisit = waitingForVisit.pop()
    if (!visited(toVisit)) {
      visited += toVisit
      //这里又回到了依赖解析，相当于循环调用
      getShuffleDependencies(toVisit).foreach { shuffleDep =>
        if (!shuffleIdToMapStage.contains(shuffleDep.shuffleId)) {
          ancestors.push(shuffleDep)
          waitingForVisit.push(shuffleDep.rdd)
        } // Otherwise, the dependency and its ancestors have already been registered.
      }
    }
  }
  ancestors
}

创建shuffle map stage：

/**
 * Creates a ShuffleMapStage that generates the given shuffle dependency's partitions. If a
 * previously run stage generated the same shuffle data, this function will copy the output
 * locations that are still available from the previous shuffle to avoid unnecessarily
 * regenerating data.
 */
def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {
  val rdd = shuffleDep.rdd
  val numTasks = rdd.partitions.length
  // 这里又开始了构建parent staged 的过程
  val parents = getOrCreateParentStages(rdd, jobId)
  val id = nextStageId.getAndIncrement()
  val stage = new ShuffleMapStage(
    id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep, mapOutputTracker)

  stageIdToStage(id) = stage
  shuffleIdToMapStage(shuffleDep.shuffleId) = stage
  updateJobIdStageIdMaps(jobId, stage)

  if (!mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
    // Kind of ugly: need to register RDDs with the cache and map output tracker here
    // since we can't do it in the RDD constructor because # of partitions is unknown
    logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
    mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
  }
  stage
}

期间包括注册shuffle过程，更新一些状态等操作。

到这里，job提交和stage 解析生成的过程都比较清楚了。

我们最后再来梳理一下：

DagScheduler内部都是通过event传递来触发操作。

创建之初 Dagscheduler会启动eventloop监听器，由单独的事件监听进程来处理事件。

Job等的提交都是向eventloop发送了一个事件，本质是调用eventloop.post方法向事件队列中添加一个事件。事件处理线程轮询队列中的事件时，会调用OnReceive()方法开始处理事件。通过doOnReceive()方法对事件进行模式匹配并分发出去进行处理。具体的处理根据事件的不同会有所区别。

至此，DAGScheduler的消息处理过程的探索暂时告一段落，对DagScheduler的内部结构和操作细节的了解暂时结束。下次我们会将目光转移到任务的执行端Executor上去。

相关标签： spark 大数据分布式计算

上一篇： Python爬虫学习记录——8.使用自动化神器Selenium爬取动态网页

下一篇：使用sklearn accuracy_score,f1_score,roc_auc_score,recall_score,precision_score对模型进行评估

菜鸟的Spark 源码学习之路 -4 DAGScheduler源码 - part3

菜鸟的Spark 源码学习之路 -4 DAGScheduler源码 - part3

菜鸟的Spark 源码学习之路 -3 TaskScheduler源码 - part3