文章详情|aerospike缓冲和缓存机制

aerospike缓冲和缓存机制 所属分类 aerospike 浏览量 2077

Buffering and Caching in Aerospike
https://discuss.aerospike.com/t/buffering-and-caching-in-aerospike/5623

current write block

When a record is written and needs to be stored on disk,
it is put into an in-RAM buffer that holds the current write block,
the write block that asd is currently filling up.
When the current write block is full, then it is persisted to disk and asd starts a new write block.
Thus, all writes to a write block are coalesced into a single, write-block-size device write.
This leads to low write IOPS.

当前写入块，写入缓冲区 ，满了之后，写入磁盘，并重新创建新的写入块
对一个写块的所有写操作合并为一个单独的写操作
降低 IOPS

post-write queue

The most recently persisted write blocks are kept in RAM after being written to disk.
The idea is that subsequent reads are more likely to hit recently written records than older records.
In particular, XDR would hit recently written records.
If such a record is read, its data can be retrieved from the write block in the post-write-queue
that contains the record’s data rather than having to be retrieved from the device.

post-write队列

最近保存的写块在写入磁盘之后保存在RAM中。
其思路是，后续读取更有可能读取最近写入的记录，而不是较早的记录。
读取这样的记录，可以从该队列中的写块检索其数据
它包含记录的数据，而不必从设备中检索。

缓存最近写入的数据，提升读取该类数据的性能， 直接从队列里检索

如果 data-in-memory 设置为 true ， 则不需要开启该功能

page cache
In general, the Linux kernel does any device I/O via the page cache.
When a process writes data, the data goes to the page cache,
i.e., to RAM, and the Linux kernel takes care of writing it to the device asynchronously at a later point in time.
The page cache operates with a granularity of 4-KiB pages.
So, if two small writes hit the same 4-KiB page in rapid succession,
these two writes will be coalesced into one 4-KiB write later,
when the Linux kernel decides to asynchronously write the page to the underlying device.
Between data in a page getting modified (in RAM)
and the page actually getting written to disk, the page is said to be dirty.

页缓存

通常，Linux内核通过页面缓存执行任何设备I/O。
当进程写入数据时，数据会进入页面缓存，
Linux内核负责在稍后的某个时间点异步地将其写入设备。
页面缓存的操作粒度为4KB页面。
如果两个小的写入连续快速地到达相同的4-KB页面，这两个写入稍后会合并成一个4KB的写入，
异步写入底层设备。

未写入底层设备的修改过的page, 脏页 dirty page

Reads also go through the page cache.
When a process reads data from a device,
the Linux kernel reads the 4-KiB pages that hold the data into the page cache, i.e., to RAM.
From there, the Linux kernel copies the data to the read buffer provided by the process.

读取也要经过页面缓存。
当进程从设备读取数据时，Linux内核将保存数据的4-KiB页面读入页面缓存，即RAM。
从那里，Linux内核将数据复制到进程提供的读取缓冲区。

The page cache uses least-recently-used eviction.
The pages that contain data that was recently read or written are thus kept in RAM,
so that subsequent reads hopefully won’t have to go to the underlying device,
but will find their data already in the page cache.

LRU 逐出算法
最近读写的页面在内存中 ，后面的读取就不用到底层设备，直接从页缓存读取

后续读取直接命中页缓存

The page cache’s lifetime is bounded by the Linux kernel’s lifetime.
The data will be safe even if a process crashes after writing data to the page cache
but before the Linux kernel actually writes the data to disk.
The page cache is system-wide and not bound to a process.
The page cache only loses data when the kernel panics
or if there is a sudden power loss, in other terms,
when the kernel doesn’t get cleanly shut down.
A clean OS shutdown will write all dirty pages to disk.

页缓存是系统级别的
页缓存只在内核恐慌时丢失数据或者如果突然断电
进程将数据写入页缓存后奔溃，数据也是安全的
一次干净的操作系统关闭将把所有脏页写到磁盘上。

hardware caches
Disk devices and controllers can also contain caches.
Again, the idea is to coalesce writes and to keep recently read or written data in RAM.
The difference is just that for hardware caches, the RAM sits on the disk device or the controller.
How exactly these caches work and which guarantees they come with differs from device to device.
Sometimes the cache of a device can also be configured,
i.e., it allows to select from a set of different behaviors.
Some of the caches are battery-backed,
so that a sudden power loss would not cause data loss, others aren’t.

硬件缓存

磁盘设备和控制器也可以包含缓存。
写合并，最近读写的数据保存在RAM中。
对于硬件缓存，RAM位于磁盘设备或控制器上。
硬件缓存的工作方式由具体的设备决定

有时可以配置设备的缓存，选择不同的行为
有些缓存是电池支持的，因此，突然断电不会导致数据丢失

Cache hierarchy:
The cache hierarchy uses the following 3 layers:

the current write block
the page cache
the hardware caches

缓存层级 3层结构
当前写入块
页面缓存
硬件缓存

All three of them can delay the persistence of written record data,
temporarily keeping the data in RAM,
where it can be affected by unexpected events such as a sudden loss of power.
The post-write-queue doesn’t factor into this,
as it only keeps already written data around,
but doesn’t delay the data on its way to persistent storage.

这三种方式会导致延迟写入
post-write-queue 不会导致延迟，只是缓存最新读写数据，用于提升读性能

These three buffers and caches form three layers of a hierarchy that written data moves through.

As the first layer, asd keeps data in the current write block
(unless if commit-to-device is set to true).
除非将 commit-to-device 设为 true

When asd decides to actually persist a write block,
the page cache comes into play as a second layer and may further delay persistence.

Once the Linux kernel decides to write the data from the page cache to the underlying device,
the hardware caches come into play as a third layer and may further delay persistence.

Finally, the device’s firmware will decide to move the data from the hardware cache to persistent media.
Only then will the data be safe from any unexpected events such as a sudden power loss.

最后，设备的固件将决定将数据从硬件缓存移动到持久介质。
只有这样，数据才不会受到任何意外事件的影响，比如突然断电。

相关配置

Aerospike server version 4.3.1 and above:

By default, reads from and writes to devices use O_DIRECT and O_DSYNC.
This bypasses the latter two layers in the three-layer cache hierarchy,
the page cache and the hardware caches.
However, data can still be lost in first layer, the current write block.
This buffer loss window is bounded, though,
by how often asd writes partial write blocks to the underlying device.
Refer to the flush-max-ms configuration directive.
Therefore, the buffer loss window (in case of a crash or power loss) can be quantified and controlled.

默认情况，对设备的读写使用O_DIRECT和O_DSYNC。
这绕过了三层缓存层次结构中的后两层，页面缓存和硬件缓存。
然而，数据仍然可能丢失在第一层，即当前写块中。
这个缓冲区丢失窗口是有界的，根据asd向底层设备写入部分写块的频率。
参考 flush-max-ms 配置指令。
因此，缓冲损失窗口(在崩溃或电源损失的情况下)可以量化和控制。

By default, reads from and writes to files don’t use any flags.
Therefore, caching applies to both. All reads are cached in the page cache,
and for writes data can theoretically be lost in the page cache or in the hardware caches.
As with devices, data can also be lost in the first layer.
However, while the loss window in the first layer is bounded
(refer to the flush-max-ms configuration directive),
there are no definitive bounds regarding the page cache or the hardware caches.
So, this default has slightly worse guarantees than the default for devices.
As mentioned previously in this article,
though, data loss in the second and third layer requires a kernel panic or sudden power loss.
If asd crashes, the data will be preserved in these layers,
even with these somewhat lesser guarantees.

默认情况下，对文件的读写不使用任何标志。三层缓存都会生效
对于写数据，理论上可以在页面缓存或硬件缓存中丢失。
与设备一样，数据也可能在第一层丢失。
但是，第一层中的损失窗口是有界的 (参考 flush-max-ms 配置)，
对于页面缓存或硬件缓存没有明确的界限。
因此，这个默认值比设备默认值的保证稍差一些。
但是，第二层和第三层的数据丢失需要内核恐慌或突然断电。
如果asd崩溃，数据将保存在这些层中

注意设备和文件的区别 ，使用设备，默认后两层缓存无效。
使用文件，三层缓存都会生效，数据丢失的可能性比设备高

direct-files
This configuration directive enables O_DIRECT and O_DSYNC for files,
i.e., it brings the parameters for files in line with the defaults for devices.
Reads and writes now bypass the page cache and the hardware caches.
Data can still be lost in the first layer,
though, just like for devices which the next configuration directive addresses.

针对文件开启 O_DIRECT and O_DSYNC，与设备的默认配置一样
读写绕过 页缓存和硬件缓存

flush-max-ms

This directive configures the interval at which asd writes a
partially filled current write block to device (in milliseconds).
This reduces risk of buffer loss window in layer one of the cache hierarchy,
the current write block, to the given millisecond interval.
Note that this only applies to the first layer of the cache hierarchy.
If used with files, but don’t use direct-files,
then caching still happens in the page cache and in the hardware caches.

只影响第一层缓存 写缓冲flush间隔 毫秒

commit-to-device
This configuration directive takes flush-max-ms to its logical conclusion:
synchronously write record data to the underlying device during a write transaction.
In contrast to flush-max-ms, though, this affects all three layers of the cache hierarchy:
if O_DIRECT and O_DSYNC aren’t enabled yet, this will enable them.
For devices, O_DIRECT and O_DSYNC are enabled by default,
so this aspect of commit-to-device only applies to files.
This also means that the direct-files directive is not needed
when using commit-to-device with files.
In any case, this configuration directive disables caching in all three layers of the hierarchy.

写入时同步写入底层设备。
与flush-max-ms不同，这会影响缓存层次结构的所有三层
如果还没有启用O_DIRECT和O_DSYNC，那么将启用它们
对于设备，默认情况下启用了O_DIRECT和O_DSYNC，因此，只适用于文件

设置为true 会禁用所有层的缓存

read-page-cache
This configuration directive removes O_DIRECT and O_DSYNC for record reads done by transactions.
This means that the read data will not only go to asd, but also into the page cache.
If the same record gets read again by a subsequent transaction,
it will not need to go to the device to get it,
but the read will be satisfied from the page cache.
This configuration directive doesn’t change anything about writes,
i.e., it doesn’t affect any write guarantees established
by the above configuration options, say, commit-to-device

读取的数据不仅会进入asd，还会进入页缓存。
这个配置不会改变写操作

So, if reads are cached in the page cache, but writes bypass the page cache,
won’t reads potentially read stale cached data?
This is not an issue, because the Linux kernel guarantees page cache coherence.
Even though in our scenario writes aren’t cached,
a write of some data invalidates a cached copy of that data in the page cache.
Therefore, writes don’t overwrite data in the page cache,
but they do invalidate it, if needed.
A subsequent read of the written data is thus forced to hit the device again
and read the fresh data, not a stale version of it.

如果读页缓存，而写绕过页缓存，会读取到可能已过期的缓存数据吗?
这不是问题，因为Linux内核保证页面缓存的一致性。

即使没有缓存写操作，某些数据的写入会使页面缓存中该数据的缓存副本失效。

因此，写操作不会覆盖页面缓存中的数据，但如果需要，他们确实会让它失效。
因此，随后对写入数据的读取将被迫再次命中设备，并读取新的数据，而不是陈旧版本的数据。

aerospike存储引擎配置实例

aerospike存储机制

aerospike架构概述

aerospike写入失败处理queue too deep

aerospike写块大小设置FAQ

五大最佳开源java性能监控工具