ceph 对象存储 块存储_针对Ceph对象存储的Flashcache的Bcache
程序员文章站
2024-03-08 12:44:28
...
ceph 对象存储 块存储
Fast SSDs are getting cheaper every year, but they are still smaller and more expensive than traditional HDD drives. But HDDs have much higher latency and are easily saturated. However, we want to achieve low latency for the storage system, and a high capacity too. There’s a well-known practice of optimizing performance for big and slow devices — caching. As most of the data on a disk is not accessed most of the time but some percentage of it is accessed frequently, we can achieve a higher quality of service by using a small cache.
快速SSD每年都在变得越来越便宜,但与传统的HDD驱动器相比,它们仍然更小,更昂贵。 但是HDD具有更高的延迟并且很容易饱和。 但是,我们希望实现存储系统的低延迟和高容量。 有一种众所周知的优化大型和慢速设备性能的做法-缓存。 由于磁盘上的大多数数据大多数时候都不被访问,而是经常访问其中的某些百分比,因此我们可以使用较小的缓存来获得更高的服务质量。
Server hardware and operating systems have a lot of caches working on different levels. Linux has a page cache for block devices, a dirent cache and an inode cache on the filesystem layer. Disks have their own cache inside. CPUs have caches. So, why not add one more persistent cache layer for a slow disk?
服务器硬件和操作系统具有许多在不同级别上运行的缓存。 Linux在文件系统层上具有块设备的页面缓存,Dirent缓存和inode缓存。 磁盘内部有自己的缓存。 CPU具有缓存。 那么,为什么不为慢速磁盘增加一个持久性缓存层呢?
In this article, we’ll be explaining what we used, what problems we've had and how we solved them by replacing block device caching software. We’ll start with an issue we’ve been having with flashcache in our Ceph cluster with HDD backend.
在本文中,我们将解释我们使用的东西,遇到的问题以及如何通过替换块设备缓存软件来解决这些问题。 我们将从带有HDD后端的Ceph集群中闪存缓存一直存在的问题开始。
环境 (The Environment)
Ceph is a modern software-defined object storage. It can be used in different ways, including the storage of virtual machine disks and providing an S3 API. We use it in different cases:
Ceph是现代软件定义的对象存储。 它可以以不同的方式使用,包括存储虚拟机磁盘并提供S3 API。 我们在不同情况下使用它:
- RBD devices for virtual machines. 虚拟机的RBD设备。
- CephFS for some internal applications. CephFS用于某些内部应用程序。
- Plain RADOS object storage with self-written client. 使用自写客户端的普通RADOS对象存储。
The last one is the one we’ll be talking about here.
最后一个是我们将在这里讨论的那个。
Every 24-HDD server has an NVMe SSD, which was split for 48 partitions (actually, it was a RAID-1 array of 2 NVMes, but now we just use two NVMe devices separately). So there’s one partition for the OSD journal and one partition for a cache for each OSD daemon.
每个24-HDD服务器都有一个NVMe SSD,该SSD分为48个分区(实际上,它是2个NVMes的RAID-1阵列,但是现在我们只使用两个NVMe设备)。 因此,每个OSD守护程序都有一个分区用于OSD日志,一个分区用于缓存。
Flash快取 (The Flashcache)
What is flashcache? It is a kernel module initially developed by Facebook, which allows caching of a drive on another drive. It is built using the device mapper and can work in four different modes: write-through, write-around, write-back and write-only. The documentation can be found
什么是闪存缓存? 它是Facebook最初开发的内核模块,它允许将一个驱动器缓存到另一个驱动器上。 它是使用设备映射器构建的,并且可以在四种不同的模式下工作:直写,全写,回写和仅写。 该文档可在here. Write-around and write-through caches are not persistent across device removals and reboots. As most of the time, we need to cache read and write workloads, we used the write-back mode.此处找到。 写回和直写式缓存在设备移除和重新引导后并不持久。 大多数时候,我们需要缓存读写工作负载,因此我们使用了回写模式。
Flashcache can be added to a device that is already in use. This is one of its benefits. All it needs to do is just stop the service, create a cache device and start the service using the newly created virtual flashcache device. Like all device mapper-based devices, Flashcache devices will be named
可以将Flashcache添加到已使用的设备中。 这是它的好处之一。 它要做的就是停止服务,创建缓存设备并使用新创建的虚拟闪存设备启动服务。 与所有基于设备映射器的设备一样,Flashcache设备在系统中将被命名为dm- dm-[0-9] in the system.[0-9] 。
We have been using flashcache for a long period of time as a caching layer for Ceph with virtual machine’s disks. As described in the documentation, it was developed for random read/write workload. We’re able to configure the «sequential threshold», though, which is the maximum size of a request in kilobytes that will be cached. All requests greater than the specified size will be passed through the cache to the slow device.
我们一直在使用flashcache作为具有虚拟机磁盘的Ceph的缓存层。 如文档中所述,它是为随机读取/写入工作负载而开发的。 不过,我们可以配置“顺序阈值”,即将要缓存的请求的最大大小(以千字节为单位)。 所有大于指定大小的请求都将通过缓存传递到慢速设备。
Because our experience with it has been good, we’ve tried to start using it under different workload: with Ceph and self-written clients over RADOS.
因为我们的经验非常好,所以我们尝试在不同的工作负载下开始使用它:通过Ceph和基于RADOS的自写客户端。
问题 (The Problem)
As soon as we started using flashcache in a different way, things changed. The first issue was with flashcache’s locking behavior. There was a bug, which led to a deadlock. Under high memory pressure, when flashcache needed to free some memory, it had to start a new thread to do so. But to start a new thread, it needed to allocate memory, which was impossible. The result of this bug was a sudden host hang.
一旦开始以其他方式使用Flashcache,情况就发生了变化。 第一个问题是闪存的锁定行为。 有一个错误,导致死锁。 在高内存压力下,当flashcache需要释放一些内存时,它必须启动一个新线程来这样做。 但是要启动一个新线程,它需要分配内存,这是不可能的。 该错误的结果是主机突然挂起。
Another problem we faced was the high rate of HDD utilization, which was higher the larger the cluster became. Different disks were utilized up to 100% for dozens of seconds. To understand what was happening, we started studying the workload profile. We were actively using a Ceph cache tier at that time and it was extremely hard to predict what kind of workload was being generated by its internal mechanisms, but at least it could promote and evict objects.
我们面临的另一个问题是HDD利用率很高,集群越大,利用率越高。 不同磁盘的利用率高达100%,持续了数十秒钟。 为了了解正在发生的事情,我们开始研究工作负载配置文件。 当时我们正在积极使用Ceph缓存层,很难预测其内部机制正在生成什么样的工作负载,但是至少它可以促进和驱逐对象。
So we knew that we had cache-tier eviction and promotion, flashcache dirty blocks cleaning and Ceph recovery operations. All of those operations might cause a high HDD utilization rate potentially. We started tracing HDD operations to better understand what’s going on. The first result was as expected. There were a lot of recovery operations passing through the cache.
因此,我们知道我们有缓存层驱逐和升级,闪存缓存脏块清理和Ceph恢复操作。 所有这些操作都可能导致高HDD利用率。 我们开始跟踪HDD操作,以更好地了解发生了什么。 第一个结果与预期的一样。 缓存中有许多恢复操作。
Understandably, there would be a high rate of HDD utilization caused by sequential reads during a recovery process because of its sequential nature, but we saw the same problems when the cluster was in a stable state. We then seized the opportunity to trace block requests while the utilization rate was high and no recovery was happening. We used
可以理解,由于恢复过程的顺序性质,在恢复过程中顺序读取会导致HDD利用率很高,但是当群集处于稳定状态时,我们会遇到相同的问题。 然后,我们抓住了机会,在利用率很高且没有恢复的情况下跟踪块请求。 我们使用blktrace for tracing and blktrace进行跟踪,并使用btt with btt和bno_plot script to build 3d graphs:bno_plot脚本来构建3d图形:
There are three main threads working with HDD at the same time in this picture:
此图中有三个同时使用HDD的主线程:
- The flashcache dirty blocks cleaning thread (kworker on the image), which was writing on the disk. Flashcache脏块会清除正在磁盘上写入的清理线程(图像上的kworker)。
-
The Ceph OSD filestore thread, which was reading and asynchronous writing to the disk.
Ceph OSD文件存储线程,正在读取和异步写入磁盘。
- The filestore sync thread, which was sending fdatasync() to the dirty blocks, when the OSD journal had to be cleared. 当必须清除OSD日志时,文件存储同步线程将fdatasync()发送到脏块。
What does all this mean? It means that sometimes we had a sequential workload from the Ceph OSD daemon. And sometimes we had flashcache dirty block cleaning operations. When these operations happened in the same period of time, we faced HDD device saturation.
这是什么意思呢? 这意味着有时我们从Ceph OSD守护程序中获得了顺序的工作负载。 有时我们进行了flashcache脏块清理操作。 当这些操作在同一时间段内发生时,我们面临HDD设备饱和的情况。
调整Flashcache (Tuning Flashcache)
Flashcache’s cache structure is a
Flashcache的缓存结构是一个set associative hash. The cache device is divided into parts called 集合哈希 。 缓存设备分为称为sets. Each set is divided into blocks. Slow devices are divided into parts, each of which has an associated set. As the caching disk is much smaller, every set is associated with more than one slow device part.集的部分。 每一组分为若干块。 慢速设备分为多个部分,每个部分都有一个相关的集合。 由于缓存磁盘要小得多,因此每个集合都与一个以上的慢速设备部件相关联。
When flashcache tries to put some data on the disk, it has to find a set for it. Then it has to find a place on the cache inside the set where a new block could be stored. While the cache set lookup is fast because it’s mapped using the block number, block lookup inside the set is just a linear search for a free (or a clean) space.
当flashcache尝试在磁盘上放置一些数据时,它必须为其找到一个集合。 然后,它必须在集合内的高速缓存上找到可以存储新块的位置。 尽管高速缓存集查找是快速进行的,因为它是使用块号进行映射的,但集内的块查找只是对可用空间(或干净空间)的线性搜索。
As you can see from the graphs, we have to make flashcache deal with dirty blocks more gently. But the problem here is that all the settings around dirty blocks cleaning are related to sets. There’s no IO feedback control mechanism that could help us. According to the documentation:
从图中可以看到,我们必须使Flashcache更轻松地处理脏块。 但是这里的问题是脏块清洁的所有设置都与设置相关。 没有可以帮助我们的IO反馈控制机制。 根据文档:
dev.flashcache.<cacheddev>.fallow_delay = 900
In seconds. Clean dirty blocks that have been "idle" (not read or written) for fallow_delay seconds. Default is 15 minutes. Setting this to 0 disables idle cleaning completely.
dev.flashcache.<cacheddev>.fallow_clean_speed = 2
The maximum number of "fallow clean" disk writes per set per second. Defaults to 2.
...
(There is little reason to tune these)
dev.flashcache.<cachedev>.max_clean_ios_set = 2
Maximum writes that can be issued per set when cleaning blocks.
dev.flashcache.<cachedev>.max_clean_ios_total = 4
Maximum writes that can be issued when syncing all blocks.
Changing all these settings didn’t help us to reduce the load flashcache generated
更改所有这些设置并不能帮助我们减少生成的负载闪存
on the HDDs while cleaning dirty blocks.
清洁脏块时,将它们放在硬盘上。
Another problem we mentioned earlier was that we had a high sequential load passing through the cache (green X marks on the last image). It is possible to set a higher sequential threshold, which should have helped us at least to cache more sequential writes and reduce HDD utilization.
我们前面提到的另一个问题是,通过缓存的序列负载较高(最后一个图像上的绿色X标记)。 可以设置更高的顺序阈值,这应该至少有助于我们缓存更多的顺序写入并降低HDD利用率。
Unfortunately, it didn’t help because of flashcache’s architecture. If there are a lot of writes to the cache set, it will be overfilled with dirty blocks. And if there’s no place to cache a new data block, for example, if all blocks in the current set are dirty, then this block is placed on the HDD device. This behavior is called
不幸的是,由于闪存的体系结构,它没有帮助。 如果对高速缓存集的写入很多,它将被脏块过度填充。 而且,如果没有地方可以缓存新的数据块,例如,如果当前集中的所有块都脏了,则将该块放置在HDD设备上。 此行为称为conflict miss or 冲突未命中或collision miss and is a known problem of the set associative cache.冲突未命中,并且是集合关联缓存的已知问题。
Flashcache操作问题 (Flashcache Operations Issues)
Flashcache configuration management is hard. In fact, it uses sysctl for configuration. For example, if we have md2p9 as a cache device for ata-HGST_HUS726040ALE614_N8G8TN1T-part2, all Flashcache device options look like:
Flashcache配置管理很难。 实际上,它使用sysctl进行配置。 例如,如果我们将md2p9作为ata-HGST_HUS726040ALE614_N8G8TN1T-part2的缓存设备,则所有Flashcache设备选项如下所示:
…
…
Imagine you are going to change the caching or cached device, then the new flashcache device is going to be created and you’ll have to change your sysctl.d configuration file. You need to remove previously used configuration and add new options for the new flashcache device. So it’s easier to have a custom script that deletes the old file and creates a new one, getting device names from the /proc/flashcache directory.
假设您要更改缓存或缓存的设备,然后将要创建新的闪存设备,则必须更改sysctl.d配置文件。 您需要删除以前使用的配置,并为新的闪存设备添加新选项。 因此,使用自定义脚本删除旧文件并创建一个新脚本会更容易,该脚本从/ proc / flashcache目录获取设备名称。
Then, to reduce the human factor, there should be a udev rule, which calls the script on a «dm» device «add» or «change» event. If you want to change some settings for some hosts, or some devices on a host (which is a rare case, but possible), you have to customize the script using a configuration management system. All these manipulations become complex and unintuitive.
然后,为了减少人为因素,应该有一个udev规则,该规则在«dm»设备«add»或«change»事件上调用脚本。 如果要为某些主机或主机上的某些设备更改某些设置(这是极少数情况,但是可以),则必须使用配置管理系统自定义脚本。 所有这些操作变得复杂且不直观。
Another drawback with flashcache is that it’s not in the kernel mainline. When the kernel API changes, it’s time to add one more:
Flashcache的另一个缺点是它不在内核主线中。 当内核API更改时,该再添加一个:
#if LINUX_VERSION_CODE < KERNEL_VERSION (your version here)
to the sources, add some code and rebuild the module. All of these problems forced us to look for another solution.
在源代码中,添加一些代码并重建模块。 所有这些问题迫使我们寻找另一种解决方案。
缓存 (Bcache)
Bcache is different from flashcache. It was initially developed by Kent Overstreet and is now actively maintained by Coly Li. It is merged into the kernel mainline, so we don’t need to worry too much about testing it with newer kernels.
Bcache与flashcache不同。 它最初由Kent Overstreet开发,现在由Coly Li积极维护。 它被合并到内核主线中,因此我们不必担心使用更新的内核对其进行测试。
Bcache doesn’t use a device mapper, it is a separate virtual device. Like flashcache, it consists of three devices:
Bcache不使用设备映射器,它是一个单独的虚拟设备。 与flashcache一样,它由三个设备组成:
- backing device — a slow, cached device; 后备设备–慢速缓存的设备;
- cache device — your fast NVMe; 缓存设备-快速的NVMe;
- bcache device — the device used by an application. bcache设备-应用程序使用的设备。
For more information on this, refer to the
有关更多信息,请参阅documentation.文档 。
It supports write-back, write-through and write-around caching. As we mentioned earlier, good write-back caching is what we were looking for. Changing the cache mode is easy, we just have to write a new mode name to the sysfs file. These sysfs trees are created when new bcache devices are created.
它支持回写,直写和全写缓存。 如前所述,我们一直在寻找良好的回写缓存。 更改缓存模式很容易,我们只需要向sysfs文件中写入一个新的模式名称即可。 这些sysfs树是在创建新的bcache设备时创建的。
echo writeback > /sys/class/block/bcache0/bcache/cache_mode
It looks much easier than configuring every device via a dynamically changed sysctl configuration file. And it is much easier because it’s possible to use udev rules to change settings, the same way as for the standard block device!
它看起来比通过动态更改的sysctl配置文件配置每个设备要容易得多。 而且更容易,因为可以使用udev规则更改设置,就像使用标准块设备一样!
It is possible to create a bcache device with more than one backing device, but we don’t use this functionality. Instead, we split NVMe drives for every cache.
可以使用多个支持设备创建一个bcache设备,但是我们不使用此功能。 相反,我们为每个缓存拆分了NVMe驱动器。
Bcache如何运作 (How Bcache Works)
Bcache doesn’t use standard cache allocation policies, e.g. direct map or set-associative. Instead, it operates on b+tree. In this tree, it stores keys and pointers to the allocated data blocks. If it has some data to write, it prepares the key and allocates space for that data.
Bcache不使用标准的缓存分配策略,例如直接映射或集合关联。 相反,它在b + tree上运行。 在此树中,它存储指向已分配数据块的键和指针。 如果有一些要写入的数据,它将准备**并为该数据分配空间。
After the data is written, the keys are inserted into the btree. When there’s a read request to be served, bcache looks for a key in the btree. If there’s no such key, it just returns zeros, if a key is found, data is gathered via pointers. This request processing method makes conflict miss events impossible. The documentation can be viewed
写入数据后,将**插入到btree中。 当有读取请求要处理时,bcache在btree中查找**。 如果没有这样的键,则仅返回零;如果找到了键,则通过指针收集数据。 这种请求处理方法使冲突遗漏事件成为不可能。 可以在here. It’s not comprehensive, but it explains basic principles on how bcache operates.这里查看文档。 它并不全面,但是它解释了bcache如何运行的基本原理。
We mentioned earlier that we want to control the speed of the write-back rate. And bcache makes this possible. It uses a
前面我们提到过,我们要控制回写速率的速度。 bcache使这成为可能。 它使用PI controller to count the write-back rate. The rate of write-back operations depends on the difference between the current amount of dirty blocks and the threshold. The more dirty blocks are over the threshold, the higher the write-back rate becomes. It also depends on the time the dirty data exceeds the threshold. The longer the threshold is exceeded, the higher the write-back rate becomes. You may find basic information on how PID controllers work on PI控制器计算回写率。 回写操作的速率取决于当前脏块数量和阈值之间的差异。 越脏的块超过阈值,回写率就越高。 它还取决于脏数据超过阈值的时间。 超过阈值的时间越长,回写率就越高。 您可以在Wikipedia.Wikipedia上找到有关PID控制器如何工作的基本信息。
With modern kernels, we can set the minimum write-back rate, which is the lowest value of the rate, even if the count of dirty blocks has not reached the threshold yet. And there are two more settings:
使用现代内核,即使脏块的数量尚未达到阈值,我们也可以设置最小回写率,即最小率。 还有另外两个设置: writeback_rate_p_term_inverse and writeback_rate_p_term_inverse和writeback_rate_i_term_inverse. Both of these settings affect how fast the PI controller responds to the write load.writeback_rate_i_term_inverse 。 这两个设置都会影响PI控制器对写入负载的响应速度。
These features make bcache a really interesting solution, so we decided to test it a bit and try to run it in production.
这些功能使bcache成为一个非常有趣的解决方案,因此我们决定对其进行一些测试,然后尝试在生产环境中运行它。
Bcache测试和调整 (Bcache Testing and Tuning)
In this section we’ll be showing a few tuning options, some test results and
在本节中,我们将显示一些调整选项,一些测试结果和fdatasync() behavior. Let’s start with basic tuning.fdatasync()行为。 让我们从基本调整开始。
We already said that we need bcache to cache write requests and read requests. So the first thing to do is to set the
我们已经说过,我们需要bcache来缓存写请求和读请求。 因此,第一件事就是设置write-back mode. Then turn 回写模式。 然后关闭sequential_cutoff off. This option tells bcache to pass through requests with a size greater than the specified value. A value of 0 means turn off this mechanism.sequence_cutoff 。 此选项告诉bcache传递大小大于指定值的请求。 值为0表示关闭此机制。
~$ echo writeback > /sys/class/block/bcache6/bcache/cache_mode
~$ echo 0 > /sys/class/block/bcache6/bcache/sequential_cutoff
There are also congestion thresholds in the
bcache/cache
directory, but we don’t change these, because they are high enough for us.bcache/cache
目录中也有拥塞阈值,但是我们不更改这些阈值,因为它们对我们来说足够高。 We used fio for testing:
我们使用fio进行测试:
[test]
ioengine=libaio
rw=randwrite
bs=4k
filename=/dev/bcache6
direct=1
fdatasync=10
write_iops_log=default
log_avg_msec=1000
filesize=8G
As you may see, we used more or less kind of default options. Option
如您所见,我们或多或少使用了默认选项。 选项direct=1 prevents the use of page cache and enables the libaio engine to work properly. An fdatasync system call will be called after every 10 write operations to flush the device’s cache. And a few options to ask fio to write IOPS results to a file averaged for a second. This workload, of course, looks like a corner case, because in real life we don’t have this straight one-thread random write load all the time. But it’ll show us something about bcache behavior.direct = 1禁止使用页面缓存,并使libaio引擎正常工作。 每执行10次写入操作后,将调用fdatasync系统调用以刷新设备的缓存。 还有一些选项要求Fio将IOPS结果写入平均一秒钟的文件中。 当然,这种工作负载看起来像个极端情况,因为在现实生活中,我们一直都没有这种直接的单线程随机写入负载。 但这将向我们展示有关bcache行为的一些信息。
We’ve done a few tests with two types of cache state at the start: clean cache; and cache which slowed its write-back rate to a minimum after the previous test.
我们在一开始就对两种类型的缓存状态进行了一些测试:清理缓存; 缓存在先前的测试后将其回写率降至最低。
We cleaned the cache, and stats before testing were as follows:
我们清理了缓存,测试之前的统计信息如下:
rate: 4.0k/sec
dirty: 0.0k
target: 5.3G
proportional: -136.2M
integral: -64.5k
change: 0.0k/sec
next io: -4035946ms
These stats are from
这些统计信息来自
cat /sys/class/block/bcache6/bcache/writeback_rate_debug
, and we’ll be showing them in this section. We started the test with default bcache settings. Let’s look at the results:cat /sys/class/block/bcache6/bcache/writeback_rate_debug
,我们将在本节中显示它们。 我们使用默认的bcache设置开始测试。 让我们看一下结果:
This is interesting. In the beginning, the cache was empty and we had about 25K write operations per second. But when the dirty data reached its threshold, it slowed down the write requests rate. How did that happen? The answer is
这是有趣的。 一开始,缓存是空的,每秒大约有25K写入操作。 但是,当脏数据达到其阈值时,它会降低写入请求的速度。 那是怎么发生的? 答案是fdatasync().fdatasync() 。
All flush requests are sent to the backing device too. When the number of dirty blocks becomes higher than the threshold, the bcache increases the write-back rate and writes data to the backing device. When the backing device is loaded with writes, it starts responding to the flush requests more slowly. You may run the same test and check it with iostat. It’ll show a lot of write requests to the backing device which are the flushes unless you use kernel >= 5.5, which has a specific counter for the flush requests.
所有刷新请求也都发送到备用设备。 当脏块的数量大于阈值时,bcache会增加回写率并将数据写入后备设备。 当支持设备加载写操作时,它开始更慢地响应刷新请求。 您可以运行相同的测试,然后使用iostat进行检查。 除非您使用内核> = 5.5,否则它将显示很多对后备设备的写入请求,除非您使用内核> = 5.5,该内核具有用于刷新请求的特定计数器。
The write-back rate increased fast and the backing device was heavily loaded.
回写率快速提高,并且后备设备负担很重。
OK, let’s start this test once more, but now with different starting conditions:
好的,让我们再次开始此测试,但是现在使用不同的开始条件:
rate: 4.0k/sec
dirty: 5.1G
target: 5.3G
proportional: -6.4M
integral: 6.3M
change: -26.0k/sec
next io: 552ms
The number of dirty blocks is close to the threshold, so we expect that it’ll get to the threshold fast and the bcache will react faster and the slowdown won’t be so aggressive.
脏块的数量接近阈值,因此我们希望它会很快达到阈值,并且bcache会更快地做出React,并且减慢速度不会那么大。
And so it is! OK, we’ve got the idea. Let’s change something that should have an effect on the write-back behavior. We mentioned earlier the
就是这样! 好,我们知道了。 让我们更改一些应该影响回写行为的东西。 前面我们提到了writeback_rate_p_term_inverse and writeback_rate_p_term_inverse和writeback_rate_i_term_inverse options. As you might have guessed, these options are inversed proportional and inversed integral terms.writeback_rate_i_term_inverse选项。 您可能已经猜到了,这些选项是成比例的反比例项和成逆的积分项。
~$ echo 180 > /sys/class/block/bcache6/bcache/writeback_rate_p_term_inverse
The default «p» is 40, so this change should significantly affect the write-back behavior. Make it less reactive. We started the test again with a clean cache.
默认的«p»是40,因此此更改将显着影响回写行为。 使其React性降低。 我们使用干净的缓存再次开始测试。
State of write-back after it slowed the rate to the minimum:
将速度降至最低后的回写状态:
rate: 4.0k/sec
dirty: 4.7G
target: 5.3G
proportional: -3.1M
integral: -39.5k
change: 0.0k/sec
next io: 357ms
And test again.
并再次测试。
It’s clear to see that the bcache was much more gentle with its write-back rate. By the way, it was gentle in both directions — increasing and decreasing. Take a look at the number of dirty bytes after the first test with default settings and after the first test with a changed «p» term. There were fewer dirty blocks after it reached the minimum rate in the later one. It just took more time with a high inverse «p» term to decrease the write-back rate to the default.
显而易见,bcache的回写率更加柔和。 顺便说一下,它在两个方向上都很温和-增大和减小。 看看使用默认设置的第一次测试之后和更改了«p»项的第一次测试之后的脏字节数。 在下一个达到最低速率后,脏块减少了。 较高的逆«p»项花费了更多时间才能将回写率降低为默认值。
Let’s play with the integral term a bit, too. The integral term shouldn’t play such a big role in the current conditions. The initial conditions are the same as for the previous tests.
让我们也使用积分项。 在当前情况下,积分项不应发挥如此大的作用。 初始条件与之前的测试相同。
echo 30000 > /sys/class/block/bcache6/bcache/writeback_rate_i_term_inverse
State of write-back after it slowed down the rate to the minimum:
将速度降至最低后的回写状态:
rate: 4.0k/sec
dirty: 5.2G
target: 5.3G
proportional: -2.3M
integral: -0.5k
change: 0.0k/sec
next io: 738ms
And the second start:
第二个开始:
It can be seen that the integral term doesn’t play such a big role here. In fact the integral term affects the PI-controller’s speed and helps to reduce the noise from the input, for example short bursts of write operations on the bcache device. Playing with the PI-controller is really interesting, try it out with your real workload.
可以看出,积分项在这里没有起到很大的作用。 实际上,积分项会影响PI控制器的速度,并有助于减少来自输入的噪声,例如bcache设备上的短写入操作突发。 与PI控制器一起玩非常有趣,请根据实际工作量进行尝试。
The good thing is that it’s possible to cache all the write requests, whether sequential or random. And its write-back rate depends on whether the load and cache are full. This is exactly what we were looking for.
好消息是可以缓存所有写请求,无论是顺序的还是随机的。 其回写率取决于加载和缓存是否已满。 这正是我们想要的。
结论 (Conclusion)
We decided to use bcache in production and it didn’t let us down. Bcache in write-back mode works much better than flashcache did. We are also able to use relatively smaller cache devices than flashcache allowed us to.
我们决定在生产中使用bcache,这并没有使我们失望。 写回模式下的Bcache比Flashcache更好。 我们还可以使用比flashcache允许使用的相对较小的缓存设备。
We’ve started to set up all-new hosts on bcache and have seen many improvements in the system’s behavior. We’ve never seen those series of HDD drive utilization with bcache. When the whole system was under a high load, hosts with bcache showed much better latency than hosts with flashcache did, due to HDD saturation.
我们已经开始在bcache上设置全新的主机,并且看到了系统行为的许多改进。 我们从未见过使用bcache的那些系列HDD驱动器利用率。 当整个系统处于高负载下时,由于HDD饱和,具有bcache的主机比具有flashcache的主机表现出更好的延迟。
ceph 对象存储 块存储