Buffering and Caching in Aerospike
current write block
When a record is written and needs to be stored on disk,
it is put into an in-RAM buffer that holds the current write block,
the write block that asd is currently filling up.
When the current write block is full, then it is persisted to disk and asd starts a new write block.
Thus, all writes to a write block are coalesced into a single, write-block-size device write.
This leads to low write IOPS.
The most recently persisted write blocks are kept in RAM after being written to disk.
The idea is that subsequent reads are more likely to hit recently written records than older records.
In particular, XDR would hit recently written records.
If such a record is read, its data can be retrieved from the write block in the post-write-queue
that contains the record’s data rather than having to be retrieved from the device.
如果 data-in-memory 设置为 true ， 则不需要开启该功能
In general, the Linux kernel does any device I/O via the page cache.
When a process writes data, the data goes to the page cache,
i.e., to RAM, and the Linux kernel takes care of writing it to the device asynchronously at a later point in time.
The page cache operates with a granularity of 4-KiB pages.
So, if two small writes hit the same 4-KiB page in rapid succession,
these two writes will be coalesced into one 4-KiB write later,
when the Linux kernel decides to asynchronously write the page to the underlying device.
Between data in a page getting modified (in RAM)
and the page actually getting written to disk, the page is said to be dirty.
未写入底层设备的修改过的page, 脏页 dirty page
Reads also go through the page cache.
When a process reads data from a device,
the Linux kernel reads the 4-KiB pages that hold the data into the page cache, i.e., to RAM.
From there, the Linux kernel copies the data to the read buffer provided by the process.
The page cache uses least-recently-used eviction.
The pages that contain data that was recently read or written are thus kept in RAM,
so that subsequent reads hopefully won’t have to go to the underlying device,
but will find their data already in the page cache.
The page cache’s lifetime is bounded by the Linux kernel’s lifetime.
The data will be safe even if a process crashes after writing data to the page cache
but before the Linux kernel actually writes the data to disk.
The page cache is system-wide and not bound to a process.
The page cache only loses data when the kernel panics
or if there is a sudden power loss, in other terms,
when the kernel doesn’t get cleanly shut down.
A clean OS shutdown will write all dirty pages to disk.
Disk devices and controllers can also contain caches.
Again, the idea is to coalesce writes and to keep recently read or written data in RAM.
The difference is just that for hardware caches, the RAM sits on the disk device or the controller.
How exactly these caches work and which guarantees they come with differs from device to device.
Sometimes the cache of a device can also be configured,
i.e., it allows to select from a set of different behaviors.
Some of the caches are battery-backed,
so that a sudden power loss would not cause data loss, others aren’t.
The cache hierarchy uses the following 3 layers:
the current write block
the page cache
the hardware caches
All three of them can delay the persistence of written record data,
temporarily keeping the data in RAM,
where it can be affected by unexpected events such as a sudden loss of power.
The post-write-queue doesn’t factor into this,
as it only keeps already written data around,
but doesn’t delay the data on its way to persistent storage.
These three buffers and caches form three layers of a hierarchy that written data moves through.
As the first layer, asd keeps data in the current write block
(unless if commit-to-device is set to true).
除非将 commit-to-device 设为 true
When asd decides to actually persist a write block,
the page cache comes into play as a second layer and may further delay persistence.
Once the Linux kernel decides to write the data from the page cache to the underlying device,
the hardware caches come into play as a third layer and may further delay persistence.
Finally, the device’s firmware will decide to move the data from the hardware cache to persistent media.
Only then will the data be safe from any unexpected events such as a sudden power loss.
Aerospike server version 4.3.1 and above:
By default, reads from and writes to devices use O_DIRECT and O_DSYNC.
This bypasses the latter two layers in the three-layer cache hierarchy,
the page cache and the hardware caches.
However, data can still be lost in first layer, the current write block.
This buffer loss window is bounded, though,
by how often asd writes partial write blocks to the underlying device.
Refer to the flush-max-ms configuration directive.
Therefore, the buffer loss window (in case of a crash or power loss) can be quantified and controlled.
参考 flush-max-ms 配置指令。
By default, reads from and writes to files don’t use any flags.
Therefore, caching applies to both. All reads are cached in the page cache,
and for writes data can theoretically be lost in the page cache or in the hardware caches.
As with devices, data can also be lost in the first layer.
However, while the loss window in the first layer is bounded
(refer to the flush-max-ms configuration directive),
there are no definitive bounds regarding the page cache or the hardware caches.
So, this default has slightly worse guarantees than the default for devices.
As mentioned previously in this article,
though, data loss in the second and third layer requires a kernel panic or sudden power loss.
If asd crashes, the data will be preserved in these layers,
even with these somewhat lesser guarantees.
但是，第一层中的损失窗口是有界的 (参考 flush-max-ms 配置)，
This configuration directive enables O_DIRECT and O_DSYNC for files,
i.e., it brings the parameters for files in line with the defaults for devices.
Reads and writes now bypass the page cache and the hardware caches.
Data can still be lost in the first layer,
though, just like for devices which the next configuration directive addresses.
针对文件开启 O_DIRECT and O_DSYNC，与设备的默认配置一样
This directive configures the interval at which asd writes a
partially filled current write block to device (in milliseconds).
This reduces risk of buffer loss window in layer one of the cache hierarchy,
the current write block, to the given millisecond interval.
Note that this only applies to the first layer of the cache hierarchy.
If used with files, but don’t use direct-files,
then caching still happens in the page cache and in the hardware caches.
只影响第一层缓存 写缓冲flush间隔 毫秒
This configuration directive takes flush-max-ms to its logical conclusion:
synchronously write record data to the underlying device during a write transaction.
In contrast to flush-max-ms, though, this affects all three layers of the cache hierarchy:
if O_DIRECT and O_DSYNC aren’t enabled yet, this will enable them.
For devices, O_DIRECT and O_DSYNC are enabled by default,
so this aspect of commit-to-device only applies to files.
This also means that the direct-files directive is not needed
when using commit-to-device with files.
In any case, this configuration directive disables caching in all three layers of the hierarchy.
This configuration directive removes O_DIRECT and O_DSYNC for record reads done by transactions.
This means that the read data will not only go to asd, but also into the page cache.
If the same record gets read again by a subsequent transaction,
it will not need to go to the device to get it,
but the read will be satisfied from the page cache.
This configuration directive doesn’t change anything about writes,
i.e., it doesn’t affect any write guarantees established
by the above configuration options, say, commit-to-device
So, if reads are cached in the page cache, but writes bypass the page cache,
won’t reads potentially read stale cached data?
This is not an issue, because the Linux kernel guarantees page cache coherence.
Even though in our scenario writes aren’t cached,
a write of some data invalidates a cached copy of that data in the page cache.
Therefore, writes don’t overwrite data in the page cache,
but they do invalidate it, if needed.
A subsequent read of the written data is thus forced to hit the device again
and read the fresh data, not a stale version of it.
aerospike写入失败处理queue too deep