欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

spark任务在executor端的运行过程分析

程序员文章站 2022-03-22 11:22:52
CoarseGrainedExecutorBackend 上一篇,我们主要分析了一次作业的提交过程,严格说是在driver端的过程,作业提交之后经过DAGScheduler根据shuffle依赖关系划分成多个stage,依次提交每个stage,将每个stage创建于分区数相同数量的Task,并包装成 ......

coarsegrainedexecutorbackend

上一篇,我们主要分析了一次作业的提交过程,严格说是在driver端的过程,作业提交之后经过dagscheduler根据shuffle依赖关系划分成多个stage,依次提交每个stage,将每个stage创建于分区数相同数量的task,并包装成一个任务集,交给taskschedulerimpl进行分配。taskschedulerimpl则会根据schedulerbackend提供的计算资源(executor),并考虑任务本地性,黑名单,调度池的调度顺序等因素对任务按照round-robin的方式进行分配,并将task与executor的分配关系包装成taskdescription返回给schedulerbackend。然后schedulerbackend就会根据收到的taskdescription将任务再次序列化之后发送到对应的executor上执行。本篇,我们就来分析一下task在executor上的执行过程。

任务执行入口executor.launchtask

首先,我们知道coarsegrainedexecutorbackend是yarn模式下的executor的实现类,这时一个rpc服务端,所以我们根据rpc客户端也就是coarsegraineschedulerbackend发送的消息,然后在服务端找到处理对应消息的方法,顺藤摸瓜就能找到task执行的入口。通过上一篇的分析知道发送任务时,coarsegraineschedulerbackend发送的是一个launchtask类型的消息,我们看一下coarsegrainedexecutorbackend.receive方法,其中对于launchtask消息的处理如下:

case launchtask(data) =>
  if (executor == null) {
    exitexecutor(1, "received launchtask command but executor was null")
  } else {
    val taskdesc = taskdescription.decode(data.value)
    loginfo("got assigned task " + taskdesc.taskid)
    executor.launchtask(this, taskdesc)
  }

可以看到,实际上任务时交给内部的executor对象来处理,实际上executor对象承担了executor端的绝大部分逻辑,可以认为coarsegrainedexecutorbackend仅仅是充当rpc消息中转的角色,充当spark的rpc框架中端点的角色,而实际的任务执行的逻辑则是由executor对象来完成的。

executor概述

我们先来看一下executor类的说明:

/**
 * spark executor, backed by a threadpool to run tasks.
 *
 * this can be used with mesos, yarn, and the standalone scheduler.
 * an internal rpc interface is used for communication with the driver,
 * except in the case of mesos fine-grained mode.
 */

executor内部有一个线程池用来运行任务,mesos, yarn, 和 standalone模式都是用这个类作为任务运行的逻辑。此外executor对象持有sparkenv的引用,以此来使用spark的一些基础设施,包括rpc引用。
我们还是以任务运行为线索分析这个类的代码。

executor.launchtask

def launchtask(context: executorbackend, taskdescription: taskdescription): unit = {
  val tr = new taskrunner(context, taskdescription)
  runningtasks.put(taskdescription.taskid, tr)
  threadpool.execute(tr)
}

这个代码没什么好说的,应该没人看不懂吧。所以接下来我们就看一下taskrunner这个类。
从这个地方也能看出来,在executor端,一个task对应一个线程。

taskrunner.run

这个方法贼长,没有一点耐心还真不容易看完。
其中有一些统计量我就不说了,比如任务运行时间统计,cpu耗时统计,gc耗时统计等等,这里有一点可以积累的地方是mxbean,cpu,gc耗时都是通过获取jvm内置的相关的mxbean获取到的,入口类是managementfactory,具体的可以细看,这里不再展开。

总结一下这个方法的主要步骤:

  • 首先向driver发送一个更新任务状态的消息,通知driver这个task处于运行的状态。
  • 设置任务属性,更新依赖的文件和jar包,将新的jar包添加到类加载器的寻找路径中;注意这些信息都是从driver端跟着taskdescription一起传过来的。
  • 对任务进行反序列化生成task对象,根据任务类型可能是shufflemaptask或者resulttask
  • 检查任务有没有被杀死,如果被杀死则跑一个异常;(driver随时都可能发送一个杀死任务的消息)
  • 调用task.run方法执行任务的运行逻辑
  • 任务运行结束后,清除未正常释放的内存资源和block锁资源,并在需要的时候打印资源泄漏的告警日志和抛出异常
  • 再次检测任务是否被杀死
  • 将任务运行的结果数据序列化
  • 更新一些任务统计量(一些累加器),以及更新度量系统中的相关统计量
  • 收集该任务相关的所有累加器(包括内置的统计量累加器和用户注册的累加器)
  • 将累加器数据和任务结果数据封装成一个对象并在此序列化
  • 检测序列化后的体积,有两个阈值:maxresultsize和maxdirectresultsize,如果超过maxresultsize直接丢弃结果,就是不往blockmanager里面写数据,这样driver端在试图通过blockmanager远程拉取数据的时候就获取不到数据,这时driver就知道这个任务的结果数据太大,失败了;而对于体积超过maxdirectresultsize的情况,会将任务结果数据通过blockmanager写到本地内存和磁盘,然后将block信息发送给driver,driver会根据这些信息来这个节点拉取数据;如果体积小于maxdirectresultsize,则直接通过rpc接口将结果数据发送给driver。
  • 最后还会有对任务失败的各种总异常的处理。

    override def run(): unit = {
    threadid = thread.currentthread.getid
    thread.currentthread.setname(threadname)
    // 监控线程运行情况的mxbean
    val threadmxbean = managementfactory.getthreadmxbean
    // 内存管理器
    val taskmemorymanager = new taskmemorymanager(env.memorymanager, taskid)
    // 记录反序列化的耗时,回忆一下,我们再spark的ui界面上可以看到这个统计值,原来就是在这里统计的
    val deserializestarttime = system.currenttimemillis()
    // 统计反序列化的cpu耗时
    val deserializestartcputime = if (threadmxbean.iscurrentthreadcputimesupported) {
    threadmxbean.getcurrentthreadcputime
    } else 0l
    thread.currentthread.setcontextclassloader(replclassloader)
    val ser = env.closureserializer.newinstance()
    loginfo(s"running $taskname (tid $taskid)")
    // todo 通过executor后端向driver发送一个任务状态更新的消息
    execbackend.statusupdate(taskid, taskstate.running, empty_byte_buffer)
    var taskstart: long = 0
    var taskstartcpu: long = 0
    // 依然是通过mxbean获取gc总时长
    startgctime = computetotalgctime()

    try {
      // must be set before updatedependencies() is called, in case fetching dependencies
      // requires access to properties contained within (e.g. for access control).
      executor.taskdeserializationprops.set(taskdescription.properties)
    
      // todo 更新依赖的文件和jar包,从driver端拉取到本地,并缓存下来
      updatedependencies(taskdescription.addedfiles, taskdescription.addedjars)
      // 对任务进行反序列化,这里却并没有进行耗时统计
      task = ser.deserialize[task[any]](
        taskdescription.serializedtask, thread.currentthread.getcontextclassloader)
      // 属性集合也是从driver端跟随taskdescription一起发送过来的
      task.localproperties = taskdescription.properties
      // 设置内存管理器
      task.settaskmemorymanager(taskmemorymanager)
    
      // if this task has been killed before we deserialized it, let's quit now. otherwise,
      // continue executing the task.
      // 检查有没有被杀掉
      val killreason = reasonifkilled
      if (killreason.isdefined) {
        // throw an exception rather than returning, because returning within a try{} block
        // causes a nonlocalreturncontrol exception to be thrown. the nonlocalreturncontrol
        // exception will be caught by the catch block, leading to an incorrect exceptionfailure
        // for the task.
        throw new taskkilledexception(killreason.get)
      }
    
      // the purpose of updating the epoch here is to invalidate executor map output status cache
      // in case fetchfailures have occurred. in local mode `env.mapoutputtracker` will be
      // mapoutputtrackermaster and its cache invalidation is not based on epoch numbers so
      // we don't need to make any special calls here.
      //
      if (!islocal) {
        logdebug("task " + taskid + "'s epoch is " + task.epoch)
        // 更新epoch值和map输出状态
        env.mapoutputtracker.asinstanceof[mapoutputtrackerworker].updateepoch(task.epoch)
      }
    
      // run the actual task and measure its runtime.
      // 运行任务并统计运行时间
      taskstart = system.currenttimemillis()
      // 统计当前线程的cpu耗时
      taskstartcpu = if (threadmxbean.iscurrentthreadcputimesupported) {
        threadmxbean.getcurrentthreadcputime
      } else 0l
      var threwexception = true
      val value = try {
        // 调用task.run方法运行任务
        val res = task.run(
          // 任务id
          taskattemptid = taskid,
          // 任务的尝试次数
          attemptnumber = taskdescription.attemptnumber,
          // 度量系统
          metricssystem = env.metricssystem)
        threwexception = false
        res
      } finally {
        // 释放关于该任务的所有锁, 该任务相关的block的读写锁
        val releasedlocks = env.blockmanager.releasealllocksfortask(taskid)
        // 清除所有分配给该任务的内存空间
        val freedmemory = taskmemorymanager.cleanupallallocatedmemory()
    
        // 如果threwexception为false,说明任务正常运行完成
        // 在任务正常运行完的前提下如果还能够释放出内存,
        // 说明在任务正常执行的过程中没有正确地释放使用的内存,也就是发生了内存泄漏
        if (freedmemory > 0 && !threwexception) {
          val errmsg = s"managed memory leak detected; size = $freedmemory bytes, tid = $taskid"
          if (conf.getboolean("spark.unsafe.exceptiononmemoryleak", false)) {
            throw new sparkexception(errmsg)
          } else {
            logwarning(errmsg)
          }
        }
    
        // 这里对于锁资源的检测和内存资源的检测是相同的逻辑
        // spark作者认为,具体的任务应该自己负责将申请的资源(包括内存和锁资源)在使用完后释放掉,
        // 不能依赖于靠后面的补救措施
        // 如果没有正常释放,就发生了资源泄漏
        // 这里则是对锁锁资源泄漏的检查
        if (releasedlocks.nonempty && !threwexception) {
          val errmsg =
            s"${releasedlocks.size} block locks were not released by tid = $taskid:\n" +
              releasedlocks.mkstring("[", ", ", "]")
          if (conf.getboolean("spark.storage.exceptiononpinleak", false)) {
            throw new sparkexception(errmsg)
          } else {
            loginfo(errmsg)
          }
        }
      }
      // 打印拉取异常日志
      // 代码执行到这里说明用户并没有抛拉取异常
      // 但是框架检测到拉取异常,这说明用户把拉取异常吞了,这显然是错误的行为,
      // 因此需要打印一条错误日志提醒用户
      task.context.fetchfailed.foreach { fetchfailure =>
        // uh-oh.  it appears the user code has caught the fetch-failure without throwing any
        // other exceptions.  its *possible* this is what the user meant to do (though highly
        // unlikely).  so we will log an error and keep going.
        logerror(s"tid ${taskid} completed successfully though internally it encountered " +
          s"unrecoverable fetch failures!  most likely this means user code is incorrectly " +
          s"swallowing spark's internal ${classof[fetchfailedexception]}", fetchfailure)
      }
      // 统计任务完成时间
      val taskfinish = system.currenttimemillis()
      // 统计任务线程占用的cpu时间
      val taskfinishcpu = if (threadmxbean.iscurrentthreadcputimesupported) {
        threadmxbean.getcurrentthreadcputime
      } else 0l
    
      // if the task has been killed, let's fail it.
      // 再次检测任务是否被杀掉
      task.context.killtaskifinterrupted()
    
      // 任务结果的序列化
      val resultser = env.serializer.newinstance()
      val beforeserialization = system.currenttimemillis()
      val valuebytes = resultser.serialize(value)
      val afterserialization = system.currenttimemillis()
    
      // deserialization happens in two parts: first, we deserialize a task object, which
      // includes the partition. second, task.run() deserializes the rdd and function to be run.
      task.metrics.setexecutordeserializetime(
        (taskstart - deserializestarttime) + task.executordeserializetime)
      task.metrics.setexecutordeserializecputime(
        (taskstartcpu - deserializestartcputime) + task.executordeserializecputime)
      // we need to subtract task.run()'s deserialization time to avoid double-counting
      task.metrics.setexecutorruntime((taskfinish - taskstart) - task.executordeserializetime)
      task.metrics.setexecutorcputime(
        (taskfinishcpu - taskstartcpu) - task.executordeserializecputime)
      task.metrics.setjvmgctime(computetotalgctime() - startgctime)
      task.metrics.setresultserializationtime(afterserialization - beforeserialization)
    
      // expose task metrics using the dropwizard metrics system.
      // update task metrics counters
      executorsource.metric_cpu_time.inc(task.metrics.executorcputime)
      executorsource.metric_run_time.inc(task.metrics.executorruntime)
      executorsource.metric_jvm_gc_time.inc(task.metrics.jvmgctime)
      executorsource.metric_deserialize_time.inc(task.metrics.executordeserializetime)
      executorsource.metric_deserialize_cpu_time.inc(task.metrics.executordeserializecputime)
      executorsource.metric_result_serialize_time.inc(task.metrics.resultserializationtime)
      executorsource.metric_shuffle_fetch_wait_time
        .inc(task.metrics.shufflereadmetrics.fetchwaittime)
      executorsource.metric_shuffle_write_time.inc(task.metrics.shufflewritemetrics.writetime)
      executorsource.metric_shuffle_total_bytes_read
        .inc(task.metrics.shufflereadmetrics.totalbytesread)
      executorsource.metric_shuffle_remote_bytes_read
        .inc(task.metrics.shufflereadmetrics.remotebytesread)
      executorsource.metric_shuffle_remote_bytes_read_to_disk
        .inc(task.metrics.shufflereadmetrics.remotebytesreadtodisk)
      executorsource.metric_shuffle_local_bytes_read
        .inc(task.metrics.shufflereadmetrics.localbytesread)
      executorsource.metric_shuffle_records_read
        .inc(task.metrics.shufflereadmetrics.recordsread)
      executorsource.metric_shuffle_remote_blocks_fetched
        .inc(task.metrics.shufflereadmetrics.remoteblocksfetched)
      executorsource.metric_shuffle_local_blocks_fetched
        .inc(task.metrics.shufflereadmetrics.localblocksfetched)
      executorsource.metric_shuffle_bytes_written
        .inc(task.metrics.shufflewritemetrics.byteswritten)
      executorsource.metric_shuffle_records_written
        .inc(task.metrics.shufflewritemetrics.recordswritten)
      executorsource.metric_input_bytes_read
        .inc(task.metrics.inputmetrics.bytesread)
      executorsource.metric_input_records_read
        .inc(task.metrics.inputmetrics.recordsread)
      executorsource.metric_output_bytes_written
        .inc(task.metrics.outputmetrics.byteswritten)
      executorsource.metric_output_records_written
        .inc(task.metrics.inputmetrics.recordsread)
      executorsource.metric_result_size.inc(task.metrics.resultsize)
      executorsource.metric_disk_bytes_spilled.inc(task.metrics.diskbytesspilled)
      executorsource.metric_memory_bytes_spilled.inc(task.metrics.memorybytesspilled)
    
      // note: accumulator updates must be collected after taskmetrics is updated
      // 这里手机
      val accumupdates = task.collectaccumulatorupdates()
      // todo: do not serialize value twice
      val directresult = new directtaskresult(valuebytes, accumupdates)
      val serializeddirectresult = ser.serialize(directresult)
      val resultsize = serializeddirectresult.limit()
    
      // directsend = sending directly back to the driver
      val serializedresult: bytebuffer = {
        if (maxresultsize > 0 && resultsize > maxresultsize) {
          logwarning(s"finished $taskname (tid $taskid). result is larger than maxresultsize " +
            s"(${utils.bytestostring(resultsize)} > ${utils.bytestostring(maxresultsize)}), " +
            s"dropping it.")
          ser.serialize(new indirecttaskresult[any](taskresultblockid(taskid), resultsize))
        } else if (resultsize > maxdirectresultsize) {
          val blockid = taskresultblockid(taskid)
          env.blockmanager.putbytes(
            blockid,
            new chunkedbytebuffer(serializeddirectresult.duplicate()),
            storagelevel.memory_and_disk_ser)
          loginfo(
            s"finished $taskname (tid $taskid). $resultsize bytes result sent via blockmanager)")
          ser.serialize(new indirecttaskresult[any](blockid, resultsize))
        } else {
          loginfo(s"finished $taskname (tid $taskid). $resultsize bytes result sent to driver")
          serializeddirectresult
        }
      }
    
      settaskfinishedandclearinterruptstatus()
      execbackend.statusupdate(taskid, taskstate.finished, serializedresult)
    
    } catch {
      case t: throwable if hasfetchfailure && !utils.isfatalerror(t) =>
        val reason = task.context.fetchfailed.get.totaskfailedreason
        if (!t.isinstanceof[fetchfailedexception]) {
          // there was a fetch failure in the task, but some user code wrapped that exception
          // and threw something else.  regardless, we treat it as a fetch failure.
          val fetchfailedcls = classof[fetchfailedexception].getname
          logwarning(s"tid ${taskid} encountered a ${fetchfailedcls} and " +
            s"failed, but the ${fetchfailedcls} was hidden by another " +
            s"exception.  spark is handling this like a fetch failure and ignoring the " +
            s"other exception: $t")
        }
        settaskfinishedandclearinterruptstatus()
        execbackend.statusupdate(taskid, taskstate.failed, ser.serialize(reason))
    
      case t: taskkilledexception =>
        loginfo(s"executor killed $taskname (tid $taskid), reason: ${t.reason}")
        settaskfinishedandclearinterruptstatus()
        execbackend.statusupdate(taskid, taskstate.killed, ser.serialize(taskkilled(t.reason)))
    
      case _: interruptedexception | nonfatal(_) if
          task != null && task.reasonifkilled.isdefined =>
        val killreason = task.reasonifkilled.getorelse("unknown reason")
        loginfo(s"executor interrupted and killed $taskname (tid $taskid), reason: $killreason")
        settaskfinishedandclearinterruptstatus()
        execbackend.statusupdate(
          taskid, taskstate.killed, ser.serialize(taskkilled(killreason)))
    
      case causedby(cde: commitdeniedexception) =>
        val reason = cde.totaskcommitdeniedreason
        settaskfinishedandclearinterruptstatus()
        execbackend.statusupdate(taskid, taskstate.killed, ser.serialize(reason))
    
      case t: throwable =>
        // attempt to exit cleanly by informing the driver of our failure.
        // if anything goes wrong (or this was a fatal exception), we will delegate to
        // the default uncaught exception handler, which will terminate the executor.
        logerror(s"exception in $taskname (tid $taskid)", t)
    
        // spark-20904: do not report failure to driver if if happened during shut down. because
        // libraries may set up shutdown hooks that race with running tasks during shutdown,
        // spurious failures may occur and can result in improper accounting in the driver (e.g.
        // the task failure would not be ignored if the shutdown happened because of premption,
        // instead of an app issue).
        if (!shutdownhookmanager.inshutdown()) {
          // collect latest accumulator values to report back to the driver
          val accums: seq[accumulatorv2[_, _]] =
            if (task != null) {
              task.metrics.setexecutorruntime(system.currenttimemillis() - taskstart)
              task.metrics.setjvmgctime(computetotalgctime() - startgctime)
              task.collectaccumulatorupdates(taskfailed = true)
            } else {
              seq.empty
            }
    
          val accupdates = accums.map(acc => acc.toinfo(some(acc.value), none))
    
          val serializedtaskendreason = {
            try {
              ser.serialize(new exceptionfailure(t, accupdates).withaccums(accums))
            } catch {
              case _: notserializableexception =>
                // t is not serializable so just send the stacktrace
                ser.serialize(new exceptionfailure(t, accupdates, false).withaccums(accums))
            }
          }
          settaskfinishedandclearinterruptstatus()
          execbackend.statusupdate(taskid, taskstate.failed, serializedtaskendreason)
        } else {
          loginfo("not reporting error to driver during jvm shutdown.")
        }
    
        // don't forcibly exit unless the exception was inherently fatal, to avoid
        // stopping other tasks unnecessarily.
        if (!t.isinstanceof[sparkoutofmemoryerror] && utils.isfatalerror(t)) {
          uncaughtexceptionhandler.uncaughtexception(thread.currentthread(), t)
        }
    } finally {
      runningtasks.remove(taskid)
    }

    }

task.run

final def run(
  taskattemptid: long,
  attemptnumber: int,
  metricssystem: metricssystem): t = {
sparkenv.get.blockmanager.registertask(taskattemptid)
context = new taskcontextimpl(
  stageid,
  stageattemptid, // stageattemptid and stageattemptnumber are semantically equal
  partitionid,
  taskattemptid,
  attemptnumber,
  taskmemorymanager,
  localproperties,
  // 度量系统就是sparkenv的度量对象
  metricssystem,
  metrics)
taskcontext.settaskcontext(context)
// 记录运行任务的线程
taskthread = thread.currentthread()

// 主要是更改taskcontext中的任务杀死原因的标记变量
// 以给线程发一次中断
if (_reasonifkilled != null) {
  kill(interruptthread = false, _reasonifkilled)
}

new callercontext(
  "task",
  sparkenv.get.conf.get(app_caller_context),
  appid,
  appattemptid,
  jobid,
  option(stageid),
  option(stageattemptid),
  option(taskattemptid),
  option(attemptnumber)).setcurrentcontext()

try {
  runtask(context)
} catch {
  case e: throwable =>
    // catch all errors; run task failure callbacks, and rethrow the exception.
    try {
      context.marktaskfailed(e)
    } catch {
      case t: throwable =>
        e.addsuppressed(t)
    }
    context.marktaskcompleted(some(e))
    throw e
} finally {
  try {
    // call the task completion callbacks. if "marktaskcompleted" is called twice, the second
    // one is no-op.
    context.marktaskcompleted(none)
  } finally {
    try {
      utils.trylognonfatalerror {
        // release memory used by this thread for unrolling blocks
        // 释放内存快管理器中该任务使用的内存,最终是通过内存管理器来释放的
        // 实际上就是更新内存管理器内部的一些用于记录内存使用情况的簿记量
        // 真正的内存回收肯定还是有gc来完成的
        sparkenv.get.blockmanager.memorystore.releaseunrollmemoryforthistask(memorymode.on_heap)
        sparkenv.get.blockmanager.memorystore.releaseunrollmemoryforthistask(
          memorymode.off_heap)
        // notify any tasks waiting for execution memory to be freed to wake up and try to
        // acquire memory again. this makes impossible the scenario where a task sleeps forever
        // because there are no other tasks left to notify it. since this is safe to do but may
        // not be strictly necessary, we should revisit whether we can remove this in the
        // future.
        val memorymanager = sparkenv.get.memorymanager
        // 内存释放之后,需要通知其他在等待内存资源的 线程
        memorymanager.synchronized { memorymanager.notifyall() }
      }
    } finally {
      // though we unset the threadlocal here, the context member variable itself is still
      // queried directly in the taskrunner to check for fetchfailedexceptions.
      taskcontext.unset()
    }
  }
}
}
  • 创建一个taskcontextimpl,并设置到一个threadlocal变量中
  • 检查任务是否被杀死
  • 调用runtask方法执行实际的任务逻辑
  • 最后会释放在shuffle过程中申请的用于数据unroll的内存资源

所以,接下来我们要分析的肯定就是runtask方法,而这个方法是个抽象方法,由于resulttask很简单,我就不再分析了,这里我重点分析一下shufflemaptask。

shufflemaptask.runtask

override def runtask(context: taskcontext): mapstatus = {
// deserialize the rdd using the broadcast variable.
val threadmxbean = managementfactory.getthreadmxbean
val deserializestarttime = system.currenttimemillis()
val deserializestartcputime = if (threadmxbean.iscurrentthreadcputimesupported) {
  threadmxbean.getcurrentthreadcputime
} else 0l
val ser = sparkenv.get.closureserializer.newinstance()
// 反序列化rdd和shuffle, 关键的步骤
// 这里思考rdd和shuffle反序列化时,内部的sparkcontext对象是怎么反序列化的
val (rdd, dep) = ser.deserialize[(rdd[_], shuffledependency[_, _, _])](
  bytebuffer.wrap(taskbinary.value), thread.currentthread.getcontextclassloader)
_executordeserializetime = system.currenttimemillis() - deserializestarttime
_executordeserializecputime = if (threadmxbean.iscurrentthreadcputimesupported) {
  threadmxbean.getcurrentthreadcputime - deserializestartcputime
} else 0l

var writer: shufflewriter[any, any] = null
try {
  // shuffle管理器
  val manager = sparkenv.get.shufflemanager
  // 获取一个shuffle写入器
  writer = manager.getwriter[any, any](dep.shufflehandle, partitionid, context)
  // 这里可以看到rdd计算的核心方法就是iterator方法
  // sortshufflewriter的write方法可以分为几个步骤:
  // 将上游rdd计算出的数据(通过调用rdd.iterator方法)写入内存缓冲区,
  // 在写的过程中如果超过 内存阈值就会溢写磁盘文件,可能会写多个文件
  // 最后将溢写的文件和内存中剩余的数据一起进行归并排序后写入到磁盘中形成一个大的数据文件
  // 这个排序是先按分区排序,在按key排序
  // 在最后归并排序后写的过程中,没写一个分区就会手动刷写一遍,并记录下这个分区数据在文件中的位移
  // 所以实际上最后写完一个task的数据后,磁盘上会有两个文件:数据文件和记录每个reduce端partition数据位移的索引文件
  writer.write(rdd.iterator(partition, context).asinstanceof[iterator[_ <: product2[any, any]]])
  // 主要是删除中间过程的溢写文件,向内存管理器释放申请的内存
  writer.stop(success = true).get
} catch {
  case e: exception =>
    try {
      if (writer != null) {
        writer.stop(success = false)
      }
    } catch {
      case e: exception =>
        log.debug("could not stop writer", e)
    }
    throw e
}
}

这个方法还是大概逻辑还是很简单的,主要就是通过rdd的iterator方法获取当前task对应的分区的计算结果(结果一一个迭代器的形式返回)利用shufflemanager通过blockmanager写入到文件block中,然后将block信息传回driver上报给blockmanagermaster。
所以实际上重要的步骤有两个:通过rdd的计算链获取计算结果;将计算结果经过排序和分区写到文件中。
这里我先分析第二个步骤。

sortshufflewriter.write

spark在2.0之后shuffle管理器改成了排序shuffle管理器,即sortshufflemanager,所以这里通过sortshufflemanager管理器获取到的在一般情况下都是sortshufflewriter,当然在满足bypass条件(map端不需要combine,并且分区数小于200)的情况下会使用bypassmergesortshufflewriter。

override def write(records: iterator[product2[k, v]]): unit = {
sorter = if (dep.mapsidecombine) {
  // map端进行合并的情况,此时用户应该提供聚合器和顺序
  require(dep.aggregator.isdefined, "map-side combine without aggregator specified!")
  new externalsorter[k, v, c](
    context, dep.aggregator, some(dep.partitioner), dep.keyordering, dep.serializer)
} else {
  // in this case we pass neither an aggregator nor an ordering to the sorter, because we don't
  // care whether the keys get sorted in each partition; that will be done on the reduce side
  // if the operation being run is sortbykey.
  new externalsorter[k, v, v](
    context, aggregator = none, some(dep.partitioner), ordering = none, dep.serializer)
}
// 将map数据全部写入排序器中,
// 这个过程中可能会生成多个溢写文件
sorter.insertall(records)

// don't bother including the time to open the merged output file in the shuffle write time,
// because it just opens a single file, so is typically too fast to measure accurately
// (see spark-3570).
// mapid就是shufflemap端rdd的partitionid
// 获取这个map分区的shuffle输出文件名
val output = shuffleblockresolver.getdatafile(dep.shuffleid, mapid)
// 加一个uuid后缀
val tmp = utils.tempfilewith(output)
try {
  val blockid = shuffleblockid(dep.shuffleid, mapid, indexshuffleblockresolver.noop_reduce_id)
  // 这一步将溢写到的磁盘的文件和内存中的数据进行归并排序,
  // 并溢写到一个文件中,这一步写的文件是临时文件名
  val partitionlengths = sorter.writepartitionedfile(blockid, tmp)
  // 这一步主要是写入索引文件,使用move方法原子第将临时索引和临时数据文件重命名为正常的文件名
  shuffleblockresolver.writeindexfileandcommit(dep.shuffleid, mapid, partitionlengths, tmp)
  // 返回一个状态对象,包含shuffle服务id和各个分区数据在文件中的位移
  mapstatus = mapstatus(blockmanager.shuffleserverid, partitionlengths)
} finally {
  if (tmp.exists() && !tmp.delete()) {
    logerror(s"error while deleting temp file ${tmp.getabsolutepath}")
  }
}
}

总结一下这个方法的主要逻辑:

  • 首先获取一个排序器,并检查是否有map端的合并器
  • 将rdd计算结果数据写入排序器,过程中可能会溢写过个磁盘文件
  • 最后将多个碎小的溢写文件和内存缓冲区的数据进行归并排序,写到一个文件中
  • 将每个分区数据在文件中的偏移量写到一个索引文件中,用于reduce阶段拉取数据时使用
  • 返回一个mapstatus对象,封装了当前executor上的blockmanager的id和每个分区在数据文件中的位移量

总结

本篇先分析到这里。剩下的代码都是属于排序器内部的对数据的排序和溢写文件的逻辑。这部分内容值得写一篇文章来单独分析。
总结一下任务在executor端的执行流程:

  • 首先executor端的rpc服务端点收到launchtask的消息,并对传过来的任务数据进行反序列化成taskdescription
  • 将任务交给executor对象运行
  • executor根据传过来的taskdescription对象创建一个taskrunner对象,并放到线程池中运行。这里的线程池用的是executors.newcachedthreadpool,空闲是不会有线程在跑
  • taskrunner对任务进一步反序列化,调用task.run方法执行任务运行逻辑
  • shufflemaptask类型的任务会将rdd计算结果数据经过排序合并之后写到一个文件中,并写一个索引文件
  • 任务运行完成后会更新一些任务统计量和度量系统中的一些统计量
  • 最后会根据结果序列化后的大小选择不同的方式将结果传回driver。