spark shuffer介绍，和操作

程序员文章站 2022-06-07 23:15:21

...

一.序言

简单copy下来的，记录一下，翻译有问题请指出。

Shuffle operations

Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

操作 spark 触发的事件里面，包含shuffle，shuffle是spark 通过跨分区操作来新打乱数据的一种方式。

通常会包含 executors和machines 之间拷贝数据，导致shuffle是一件非常昂贵的操作。

Background

To understand what happens during the shuffle we can consider the example of the reduceByKey operation. The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result.

要了解shuffle 过程中发生了什么，我们可以参考reduceByKey 的例子。reduceByKey 操作会产生一个新的RDD，并且按key 进行合并到一个tuple(类似:map), 按key 执行reduce函数能得到执行结果。面临的挑战是不是所有的key 都分布在同一个分区，甚至同一台机器。但是他们必须合并才能得到结果。

In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation. During computations, a single task will operate on a single partition - thus, to organize all the data for a single reduceByKey reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.

在spark里面，数据通常都不跨分区，在一个必要的地方执行具体的操作。在计算期间，一个单一的任务将在单一的分区上操作，因此整理的数据都会在一个 reduceByKey reduce 任务执行。spark 需要去执行所有的这些操作。它必须从所有的分区找到所有的keys，然后汇集这些数据根据每个key进行合并，得到最终结果。这就是shuffle。

Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:

mapPartitions to sort each partition using, for example, .sorted
repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning
sortBy to make a globally ordered RDD

尽管shuffled之后每个新的分区的元素在都是确定的，但是这些元素本身是没有顺序的，如果需要获得排序后的shuffle数据，可以使用:

mapPartitions:每个分区使用排序，例如.sorted

repartitionAndSortWithinPartitions:从新分区的时候排序

sortBy to make a globally ordered RDD：做一个全局排序的RDD

Operations which can cause a shuffle include repartition operations like repartition and coalesce, ‘ByKey operations (except for counting) likegroupByKey and reduceByKey, and join operations like cogroup and join.

会导致shuffle操作的分区操作有 repartition and coalesce, “ByKey” 的操作如：groupByKey and reduceByKey, 和 join 操作如： cogroup and join.

上一篇：西汉的巅峰时期是什么时候？为何将西汉推上巅峰的汉宣帝知名度却不高？

下一篇：汉水之战中为什么两人会因为先去而吵起来呢难道都想表现自己吗

spark shuffer介绍，和操作

Hadoop和Spark的Shuffer过程对比解析

HTML5的自定义属性data-*详细介绍和JS操作实例

git和vscode操作指令介绍

MongoDB入门教程之聚合和游标操作介绍

SQL语句的数据操作语言 (DML) 和数据定义语言 (DDL)使用介绍

队列的介绍和使用（用数组实现队列操作实例）

MySQL left join操作中on和where放置条件的区别介绍

PHP获取和操作配置文件php.ini的几个函数介绍

Linux操作系统简介和流行的厂商版本介绍

Spark源码分析集群架构介绍和SparkContext源码解析

spark shuffer介绍，和操作

Hadoop和Spark的Shuffer过程对比解析

HTML5的自定义属性data-*详细介绍和JS操作实例

git和vscode操作指令介绍

MongoDB入门教程之聚合和游标操作介绍

SQL语句的数据操作语言 (DML) 和 数据定义语言 (DDL)使用介绍

队列的介绍和使用（用数组实现队列操作实例）

MySQL left join操作中on和where放置条件的区别介绍

PHP获取和操作配置文件php.ini的几个函数介绍

Linux操作系统简介和流行的厂商版本介绍

Spark源码分析 集群架构介绍和SparkContext源码解析

SQL语句的数据操作语言 (DML) 和数据定义语言 (DDL)使用介绍

Spark源码分析集群架构介绍和SparkContext源码解析