Delta Lake - 增删改事务操作之大结局

程序员文章站 2022-07-14 20:38:28

...

在《Delta Lake - 数据写入的旅程》和《Delta Lake - 数据更新的旅程》文章中，我们已经从源码层面掌握了 Delta Lake 数据写入和数据更新的实现过程，并结合案例进行实战，相信读者应该有比较深入的理解。

针对不再使用或有异常的数据，我们需要进行删除操作。那么 Delta Lake 数据删除是如何实现的呢？

笔者将在本章基于源码研究 Delta Lake 数据删除的始末。

本篇文章为 Delta 增删改的最后一部分内容，学完后，读者将真正入门 Delta Lake。

数据删除示例

笔者使用《Delta Lake - 数据更新的旅程》文章中的数据更新的结果进行删除操作。

以 Scala 编程语言实现，首先看一下之前的 Delta 数据：


scala> import io.delta.tables._
scala> val deltaTable = DeltaTable.forPath(spark,"/spark/datasets/delta/")
scala> deltaTable.toDF.show()
+---+-------+
|age|   name|
+---+-------+
| 40|Michael|
| 30|   Andy|
| 19| Justin|
+---+-------+

执行 Delta 删除操作：


scala> deltaTable.delete("name = 'Justin'")

scala> deltaTable.toDF.show()
+---+-------+
|age|   name|
+---+-------+
| 40|Michael|
| 30|   Andy|
+---+-------+

Delta 删除操作历史记录

我们可以通过 Delta Lake 的 HISTORY 命令查看操作历史记录（按时间倒序返回）：

Delta Lake - 增删改事务操作之大结局


scala> deltaTable.history().show()
+-------+--------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+
|version|           timestamp|userId|userName|operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|
+-------+--------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+
|      3|2019-12-02 22:30:...|  null|    null|   DELETE|[predicate -> ["(...|null|    null|     null|          2|          null|        false|
|      2|2019-11-21 16:38:...|  null|    null|   UPDATE|[predicate -> (na...|null|    null|     null|          1|          null|        false|
|      1|2019-11-21 16:29:...|  null|    null|   UPDATE|[predicate -> (na...|null|    null|     null|          0|          null|        false|
|      0|2019-11-21 16:18:...|  null|    null|    WRITE|[mode -> ErrorIfE...|null|    null|     null|       null|          null|         true|
+-------+--------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+

HISTORY 命令返回的结果中，version 为 3 的记录，即为本次 Delta 删除的操作记录。

查看删除操作的详细记录：


scala> deltaTable.history(1).show(false)
+-------+-----------------------+------+--------+---------+--------------------------------------+----+--------+---------+-----------+--------------+-------------+
|version|timestamp              |userId|userName|operation|operationParameters                   |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|
+-------+-----------------------+------+--------+---------+--------------------------------------+----+--------+---------+-----------+--------------+-------------+
|3      |2019-12-02 22:30:49.969|null  |null    |DELETE   |[predicate -> ["(`name` = 'Justin')"]]|null|null    |null     |2          |null          |false        |
+-------+-----------------------+------+--------+---------+--------------------------------------+----+--------+---------+-----------+--------------+-------------+

Delta 事务日志分析

上面 Delta 删除操作成功后，则会生成一个事务日志，如下：


$ hdfs dfs -ls /spark/datasets/delta/_delta_log/00000000000000000003.json

{"commitInfo":{"timestamp":1575297049841,"operation":"DELETE","operationParameters":{"predicate":"[\"(`name` = 'Justin')\"]"},"readVersion":2,"isBlindAppend":false}}
{"remove":{"path":"part-00000-9c1da674-7c4d-4061-ba3c-0ae3926bd593-c000.snappy.parquet","deletionTimestamp":1575297049819,"dataChange":true}}
{"add":{"path":"part-00000-240bcdd5-b087-4696-bdc4-f0ce64dcc7ae-c000.snappy.parquet","partitionValues":{},"size":641,"modificationTime":1575297049785,"dataChange":true}}

事务日志的具体含义，之前都详细讲解过，这里不再重复说明。

有几点需要补充一下：

1. Delta Lake Delete 操作在最新版本中支持 Scala、Java、Python API，不支持 SQL，而在 Databricks Runtime 商业版本中才支持 SQL。
2. Delta Lake Delete 操作成功后，其底层存储的数据并没有被删除，而是在事务日志里面标记删除状态。执行 vacuum 命令后，数据才真正被删除。

接下来，我们进入正题，基于源码去深入理解 Delta Lake Delete 操作。

数据删除的旅程

通过前面的更新操作，我们知道 Delete API 也在 io.delta.tables.DeltaTable 中实现的，涉及删除的方法有三个：


// Delete data from the table that match the given `condition`.
def delete(condition: String): Unit = {
  delete(functions.expr(condition))
}

// Delete data from the table that match the given `condition`.
def delete(condition: Column): Unit = {
  executeDelete(Some(condition.expr))
}

// Delete data from the table.
def delete(): Unit = {
  executeDelete(None)
}

上面定义的三个函数最终都是调用 executeDelete 函数，该函数定义在 io.delta.tables.execution.DeltaTableOperations，提供 DeltaTable 操作的实际实现的接口。

executeDelete 函数实现内容如下：


// Catalyst 中的表达式
protected def executeDelete(condition: Option[Expression]): Unit = {
  // Delete 为 case class Delete(child: LogicalPlan, condition: Option[Expression])
  // child 为 Delta Lake 表的 Analyzed Logical Plan
  // condition 为执行删除操作的条件表达式
  val delete = Delete(self.toDF.queryExecution.analyzed, condition)

  // 当然版本的 DELETE 不支持子查询
  subqueryNotSupportedCheck(condition, "DELETE")

  // 生成执行计划
  val qe = sparkSession.sessionState.executePlan(delete)
  val resolvedDelete = qe.analyzed.asInstanceOf[Delete]
  // 下面重点分析 DeleteCommand
  val deleteCommand = DeleteCommand(resolvedDelete)
  deleteCommand.run(sparkSession)
}

接下来，笔者重点分析如下操作：


val deleteCommand = DeleteCommand(resolvedDelete)
deleteCommand.run(sparkSession)

DeleteCommand case class 中定义了伴生对象 DeleteCommand，里面定义了 apply 方法，如下：


object DeleteCommand {
  def apply(delete: Delete): DeleteCommand = {
    val index = EliminateSubqueryAliases(delete.child) match {
      case DeltaFullTable(tahoeFileIndex) =>
        tahoeFileIndex
      case o =>
        throw DeltaErrors.notADeltaSourceException("DELETE", Some(o))
    }
    DeleteCommand(index, delete.child, delete.condition)
  }

  val FILE_NAME_COLUMN = "_input_file_name_"
}

DeleteCommand(resolvedDelete) 调用了 apply 方法，初始化 DeleteCommand，DeleteCommand 定义为：


case class DeleteCommand(
    tahoeFileIndex: TahoeFileIndex,
    target: LogicalPlan,
    condition: Option[Expression])
  extends RunnableCommand with DeltaCommand

可知，DeleteCommand 类扩展了 Spark 的 RunnableCommand trait，看一下 RunnableCommand ：


trait RunnableCommand extends Command {

  // The map used to record the metrics of running the command. This will be passed to
  // `ExecutedCommand` during query planning.
  lazy val metrics: Map[String, SQLMetric] = Map.empty

  def run(sparkSession: SparkSession): Seq[Row]
}

根据 RunnableCommand trait，DeleteCommand 需要实现 run 方法，前面我们学习过的 update 和 merge 也是继承这个类。

DeleteCommand 的 run 方法实现如下：


final override def run(sparkSession: SparkSession): Seq[Row] = {
  // 用于记录持续时间以及操作的成功或失败
  recordDeltaOperation(tahoeFileIndex.deltaLog, "delta.dml.update") {
    // 获取事务日志对象
    val deltaLog = tahoeFileIndex.deltaLog
    // 检查 Delta 表是否支持删除操作
    // 因为 Delta Lake 表允许用户设置成 appendOnly，即只能追加，所以需要检查
    deltaLog.assertRemovable()
    // 开启新事务，执行删除操作，保证原子性
    deltaLog.withNewTransaction { txn =>
      performUpdate(sparkSession, deltaLog, txn)
    }
    // Re-cache all cached plans(including this relation itself, if it's cached) that refer to
    // this data source relation.
    sparkSession.sharedState.cacheManager.recacheByPlan(sparkSession, target)
  }
  Seq.empty[Row]
}

这里我们再回顾一下 withNewTransaction 事务的实现：


def withNewTransaction[T](thunk: OptimisticTransaction => T): T = {
  try {
    // 通过应用新的增量文件（如果有）来更新ActionLog
    // 更新当前表事务日志的快照
    update()
    // 实例化乐观事务锁对象
    val txn = new OptimisticTransaction(this)
    // 开启乐观事务锁
    OptimisticTransaction.setActive(txn)
    // performDelete(sparkSession, deltaLog, txn) 操作
    // Delta Delete 的核心操作方法
    thunk(txn)
  } finally {
    // 关闭事务
    OptimisticTransaction.clearActive()
  }
}

Delta Lake 删除的核心代码定义在 performDelete 方法中，下面我们具体分析源码，并附上注释，方便读者查看：


private def performDelete(
    sparkSession: SparkSession, deltaLog: DeltaLog, txn: OptimisticTransaction) = {
  import sparkSession.implicits._

  // 统计信息
  var numTouchedFiles: Long = 0
  var numRewrittenFiles: Long = 0
  var scanTimeMs: Long = 0
  var rewriteTimeMs: Long = 0

  // 开始时间
  val startTime = System.nanoTime()
  val numFilesTotal = deltaLog.snapshot.numOfFiles

  val deleteActions: Seq[Action] = condition match {
    // Delta delete 分为几种情况，下面依次进行解释
    // 如果执行 delete 时，没有传递任何删除条件，则删除当前 Delta 表的所有数据，对应 Case 1 处理方式
    case None =>
      // Case 1: 如果 condition 为 true，直接删除 Delta 表对应的所有文件即可
      // 获取内存中快照里面所有的 AddFile 文件
      val allFiles = txn.filterFiles(Nil)

      // 文件数量
      numTouchedFiles = allFiles.size
      scanTimeMs = (System.nanoTime() - startTime) / 1000 / 1000

      val operationTimestamp = System.currentTimeMillis()
      // 将 AddFile 标记成 RemoveFile，用于标记删除
      allFiles.map(_.removeWithTimestamp(operationTimestamp))
    // 如果删除时传递了条件，则对应 Case 2 和 Case 3 处理方式
    case Some(cond) =>
      // 针对表有分区的分区条件和其他删除条件处理
      val (metadataPredicates, otherPredicates) =
        DeltaTableUtils.splitMetadataAndDataPredicates(
          cond, txn.metadata.partitionColumns, sparkSession)

      // 其他删除条件为空，只有分区删除条件，即使用分区字段
      if (otherPredicates.isEmpty) {
        // Case 2: 从缓存在内存中的 snapshot 中获取需要删除的文件，然后直接删除，不需要执行数据重写操作
        val operationTimestamp = System.currentTimeMillis()
        // 从快照中拿出符合这个分区删除条件的 AddFile 文件
        val candidateFiles = txn.filterFiles(metadataPredicates)

        scanTimeMs = (System.nanoTime() - startTime) / 1000 / 1000
        numTouchedFiles = candidateFiles.size

        // 对于上面选出符合条件的文件，标记为 RemoveFile
        candidateFiles.map(_.removeWithTimestamp(operationTimestamp))
      } else {
        // 这是最后一种情况 Case 3，稍微复杂点
        // Case 3: 用户删除 Delta 表时，删除条件含有一些非分区字段的过滤条件
        // 这种情况还要分为 Case 3.1 和 Case 3.2 两种情况，下面会一一说明

        // 根据分区条件和其他条件，找到删除的数据潜在的文件列表
        val candidateFiles = txn.filterFiles(metadataPredicates ++ otherPredicates)

        // 记录文件数
        numTouchedFiles = candidateFiles.size
        // 生成文件名 map，以添加用于执行需要重写文件（例如删除、合并和更新）的操作的文件条目。map 中存储的文件名都是唯一的，因为每个文件都包含一个 UUID
        val nameToAddFileMap = generateCandidateFileMap(deltaLog.dataPath, candidateFiles)

        // 从 DeltaLog version 范围内给定文件列表中生成文件列表
        val fileIndex = new TahoeBatchFileIndex(
          sparkSession, "delete", candidateFiles, deltaLog, tahoeFileIndex.path, txn.snapshot)

        // target 为替换文件 index 的 logical plan（LogicalPlan）
        // fileIndex 为新的文件 index（FileIndex）
        // 功能是将文件 index 替换为 logical plan，然后返回更新后的 plan
        val newTarget = DeltaTableUtils.replaceFileIndex(target, fileIndex)
        // 潜在被删除的文件对应的 Dataset
        val data = Dataset.ofRows(sparkSession, newTarget)
        val filesToRewrite =
          withStatusCode("DELTA", s"Finding files to rewrite for DELETE operation") {
            // 没有潜在被删除的文件
            if (numTouchedFiles == 0) {
              Array.empty[String]
            } else {
              // 过滤出需要被删除数据所在的文件
              // 返回一个包含此数据集中所有行的 Array
              data.filter(new Column(cond)).select(new Column(InputFileName())).distinct()
                .as[String].collect()
            }
          }

        scanTimeMs = (System.nanoTime() - startTime) / 1000 / 1000
        if (filesToRewrite.isEmpty) {
          // Case 3.1: Delta 表没有找到需要被删除数据的文件，则不需要做任何操作，直接返回 Nil，另外也不需要记录事务日志
          Nil
        } else {
          // Case 3.2: some files need an update to remove the deleted files
          // Do the second pass and just read the affected files
          // Case 3.2: 获取需要删除的文件列表，并将需要删除文件里面不用删除的数据重新写到新文件中
          val baseRelation = buildBaseRelation(
            sparkSession, txn, "delete", tahoeFileIndex.path, filesToRewrite, nameToAddFileMap)
          // Keep everything from the resolved target except a new TahoeFileIndex
          // that only involves the affected files instead of all files.
          val newTarget = DeltaTableUtils.replaceFileIndex(target, baseRelation.location)

          // 潜在被删除的文件对应的 Dataset
          val targetDF = Dataset.ofRows(sparkSession, newTarget)
          // 将删除过滤条件取反，获取保留的数据，后面会将不用删除的数据写入新文件
          val filterCond = Not(EqualNullSafe(cond, Literal(true, BooleanType)))
          // 过滤不要删除的 Dataset
          val updatedDF = targetDF.filter(new Column(filterCond))

          // rewrittenFiles 为 新增的 AddFile 列表
          val rewrittenFiles = withStatusCode(
            "DELTA", s"Rewriting ${filesToRewrite.size} files for DELETE operation") {
            // 事务操作，将不需要删除的数据写入到新文件
            txn.writeFiles(updatedDF)
          }

          numRewrittenFiles = rewrittenFiles.size
          rewriteTimeMs = (System.nanoTime() - startTime) / 1000 / 1000 - scanTimeMs

          val operationTimestamp = System.currentTimeMillis()
          // 需要被删除的文件和重新写数据的新增文件的集合
          removeFilesFromPaths(deltaLog, nameToAddFileMap, filesToRewrite, operationTimestamp) ++
            rewrittenFiles
        }
      }
  }
  // 如果匹配到需要删除的文件
  if (deleteActions.nonEmpty) {
    // 写事务日志，即在 _delta_log 目录下写一个新文件，记录事务操作信息
    txn.commit(deleteActions, DeltaOperations.Delete(condition.map(_.sql).toSeq))
  }

  // 记录 Delta Delete 操作的事件信息
  recordDeltaEvent(
    deltaLog,
    "delta.dml.delete.stats",
    data = DeleteMetric(
      condition = condition.map(_.sql).getOrElse("true"),
      numFilesTotal,
      numTouchedFiles,
      numRewrittenFiles,
      scanTimeMs,
      rewriteTimeMs)
  )
}

代码量还是比较多的，所以笔者在代码的大部分地方都加了注释，希望读者跟着代码更好地理解。

最后，需要强调的是，官方建议删除数据的时候提供分区过滤条件，这样可以避免扫描全表的数据，除非的确需要删除全表数据。

大结局

到此，笔者从源码层面分析了 Delta Lake 数据删除的整个流程，读者可以根据源码进行再次查看，加深印象。

另外，笔者将 Delta Lake 增删改方面的内容都更新完毕，读者可以结合这个系列的三篇文章，再温习一下，也可以参与 Delta Lake 项目开发，比如支持 SQL 方式增删改操作和增加权限管理等需求特性。

上一篇： ZYNQ 双核运行并交互，一个linux，一个裸核

下一篇：一些注意事项