文章详情|elasticsearch5.0索引性能优化

elasticsearch5.0索引性能优化 所属分类 elasticsearch 浏览量 2336

根据原文翻译整理
https://www.elastic.co/guide/en/elasticsearch/reference/5.0/tune-for-indexing-speed.html

bulk 批量

Bulk requests will yield much better performance than single-document index requests.
In order to know the optimal size of a bulk request,
you should run a benchmark on a single node with a single shard.
First try to index 100 documents at once, then 200, then 400, etc.

最佳批量大小
单节点单分片基准测试

When the indexing speed starts to plateau then you know you reached the optimal size of a bulk request for your data.
当索引速度开始平稳时，就达到了数据批量请求的最佳大小。

大并发 过大批请求会给集群带来内存压力

advisable to avoid going beyond a couple tens of megabytes per request
even if larger requests seem to perform better.

避免单个请求超过几十兆字节。

多线程并发发送请求

A single thread sending bulk requests is unlikely to be able to
max out the indexing capacity of an elasticsearch cluster.
In order to use all resources of the cluster,
you should send data from multiple threads or processes.
In addition to making better use of the resources of the cluster,
this should help reduce the cost of each fsync.

并发发送批量请求 利用集群处理能力
降低单个fsync成本

Make sure to watch for TOO_MANY_REQUESTS (429) response codes
(EsRejectedExecutionException with the Java client),
which is the way that elasticsearch tells you
that it cannot keep up with the current indexing rate.
When it happens, you should pause indexing a bit before trying again,
ideally with randomized exponential backoff.

监控TOO_MANY_REQUESTS(429)响应代码
ES跟不上当前的索引请求速度，暂停索引后重试 ，最好使用随机指数回退。

Similarly to sizing bulk requests,
only testing can tell what the optimal number of workers is.
This can be tested by progressively increasing the number of workers
until either I/O or CPU is saturated on the cluster.

与调整批量大小类似，只有测试才能确定最优的worker数量（请求并发线程数）。
可以通过逐步增加工作线程数来测试，直到集群上的I/O或CPU达到饱和。

Increase the refresh interval
The default index.refresh_interval is 1s,
which forces elasticsearch to create a new segment every second.
Increasing this value (to say, 30s) will allow larger segments to flush
and decreases future merge pressure.

增加刷新时间间隔
默认为1秒， 强制ES1绵中产生一个段，
增加该值，可以生成大段，降低合并压力

Disable refresh and replicas for initial loads
If you need to load a large amount of data at once,
you should disable refresh by setting index.refresh_interval to -1 and set index.number_of_replicas to 0.
This will temporarily put your index at risk since the loss of any shard will cause data loss,
but at the same time indexing will be faster since documents will be indexed only once.
Once the initial loading is finished,
you can set index.refresh_interval and index.number_of_replicas back to their original values.

初始化加载数据时 禁用自动刷新和副本
index.refresh_interval -1
index.number_of_replicas 0

可能会导致数据丢失
初始化加载数据结束后 把这两个参数恢复为原值

Disable swapping
You should make sure that the operating system is not swapping out the java process by disabling swapping.

禁用交换

Give memory to the filesystem cache
The filesystem cache will be used in order to buffer I/O operations.
You should make sure to give at least half the memory of the machine running elasticsearch to the filesystem cache.

确保将至少一半内存留给文件系统缓存。

Use auto-generated ids

When indexing a document that has an explicit id,
elasticsearch needs to check whether a document with the same id already exists within the same shard,
which is a costly operation and gets even more costly as the index grows.
By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster.

使用自动生成的ID
提供显式ID时，检查该文档是否存在， 这是一个耗时的操作
使用自动生成的ID 跳过该检查

Use faster hardware

If indexing is I/O bound,
you should investigate giving more memory to the filesystem cache (see above)
or buying faster drives.
In particular SSD drives are known to perform better than spinning disks.

SSD vs spinning disks
spinning disk 机械硬盘

Always use local storage, remote filesystems such as NFS or SMB should be avoided.
Also beware of virtualized storage such as Amazon’s Elastic Block Storage.
Virtualized storage works very well with Elasticsearch,
and it is appealing since it is so fast and simple to set up,
but it is also unfortunately inherently slower on an ongoing basis
when compared to dedicated local storage.
If you put an index on EBS, be sure to use provisioned IOPS
otherwise operations could be quickly throttled.

使用本地存储 ，避免远程文件系统
虚拟存储很吸引人，因为使用起来 简单快速，但是很慢

Stripe your index across multiple SSDs by configuring a RAID 0 array.
Remember that it will increase the risk of failure
since the failure of any one SSD destroys the index.
However this is typically the right tradeoff to make:
optimize single shards for maximum performance,
and then add replicas across different nodes so there’s redundancy for any node failures.
You can also use snapshot and restore to backup the index for further insurance.

配置RAID 0数组，将索引划分为多个ssd。
这会增加失败的风险，因为任何一个SSD的失败都会破坏索引。
这是一种权衡
优化单片性能最大化，然后在不同的节点上添加副本，这样对于任何节点故障都有冗余。
可以使用快照和恢复来备份索引，以获得进一步的保障。

Indexing buffer size
If your node is doing only heavy indexing,
be sure indices.memory.index_buffer_size is large enough
to give at most 512 MB indexing buffer per shard doing heavy indexing
(beyond that indexing performance does not typically improve).
Elasticsearch takes that setting (a percentage of the java heap or an absolute byte-size),
and uses it as a shared buffer across all active shards.

Very active shards will naturally use this buffer more than shards
that are performing lightweight indexing.

The default is 10% which is often plenty:
for example, if you give the JVM 10GB of memory,
it will give 1GB to the index buffer,
which is enough to host two shards that are heavily indexing.

为每个分片提供最多512mb的索引缓冲区

非常活跃的分片会使用更多的缓冲区。
默认值是10%，通常足够了:例如，如果给JVM 10GB内存，索引缓冲区1GB，足以容纳两个索引量很大的分片。

进程内存占用分析VSS/RSS/PSS/USS

elasticsearch5.0搜索偏好

elasticsearch5.0的一般建议

elasticsearch5.0搜索性能优化

elasticsearch5.0磁盘使用优化

plantUML安装使用