文章详情|elasticsearch5.0的一般建议

elasticsearch5.0的一般建议 所属分类 elasticsearch 浏览量 1899
根据原文翻译整理
https://www.elastic.co/guide/en/elasticsearch/reference/5.0/general-recommendations.html



Don’t return large result sets
Elasticsearch is designed as a search engine, 
which makes it very good at getting back the top documents that match a query. 
However, it is not as good for workloads that fall into the database domain, 
such as retrieving all documents that match a particular query. 
If you need to do this, make sure to use the Scroll API.

不要反回大结果集
搜索引擎擅长反回与查询匹配的前几个文档
反回与特定查询匹配的所有文档，这是传统数据库的长处
如果真的需要这样做，使用 Scroll 接口


Avoid large documents
避免大文档

Given that the default http.max_context_length is set to 100MB, 
Elasticsearch will refuse to index any document that is larger than that. 
You might decide to increase that particular setting, 
but Lucene still has a limit of about 2GB.

大文档的开销 网络 内存 磁盘
even for search requests that do not request the _source 
since Elasticsearch needs to fetch the _id of the document in all cases, 
and the cost of getting this field is bigger for large documents due to how the filesystem cache works. 
 
不返回_source字段的搜索请求，需要获取文档ID,获取大文档ID代价依赖文件系统缓存
Indexing this document can use an amount of memory that is a multiplier of the original size of the document.
索引文档需要的内存量是原始大小的几倍

Proximity search (phrase queries for instance) and highlighting also become more expensive 
since their cost directly depends on the size of the original document.

Proximity search 

For instance, the fact you want to make books searchable doesn’t necesarily mean 
that a document should consist of a whole book. 
It might be a better idea to use chapters or even paragraphs as documents, 
and then have a property in these documents that identifies which book they belong to. 

如果要搜索一本书，可以按 章节或段落建立索引



Avoid sparsity
避免稀疏


The data-structures behind Lucene, which Elasticsearch relies on in order to index and store data, 
work best with dense data, ie. when all documents have the same fields. 
This is especially true for fields that have norms enabled 
(which is the case for text fields by default) or 
doc values enabled (which is the case for numerics, date, ip and keyword by default).

lucene喜欢密集的数据 ，所有的字段有相同的字段
对于开启规范化或 doc values 的字段尤其重要 

lucene 默认 使用倒排所用，  如果要操作单个字段，则开启列存 doc values


doc ids between 0 and the total number of documents in the index
for instance searching on a term with a match query produces an iterator of doc ids, 
and these doc ids are then used to retrieve the value of the norm 
in order to compute a score for these documents. 

词查询返回文档id 迭代器， 文档id用于获取 norm 计算文档分数

The way this norm lookup is implemented currently is by reserving one byte for each document. 
The norm value for a given doc id can then be retrieved by reading the byte at index doc_id.
文档保留一个字节 
通过读取索引doc_id上的字节来检索给定文档id的norm值。

While this is very efficient and helps Lucene quickly have access to the norm values of every document, 
this has the drawback that documents that do not have a value will also require one byte of storage.


虽然这非常有效，可以帮助Lucene快速访问每个文档的规范值，但是这也有一个缺点，没有值的文档也需要一个字节的存储空间。


In practice, this means that if an index has M documents, 
norms will require M bytes of storage per field, 
even for fields that only appear in a small fraction of the documents of the index. 

norm 1个字段一个字节

doc values have multiple ways that they can be encoded depending on the type of field 
and on the actual data that the field stores

doc values 有多种编码方式，依赖字段类型和实际存储的数据
fielddata, which was used in Elasticsearch pre-2.0 before being replaced with doc values
ES 2.0版本之前 使用  fielddata
the impact was only on the memory footprint since fielddata was not explicitly materialized on disk.
由于fielddata没有显式地物化在磁盘上，因此仅对内存占用有影响。

Note that even though the most notable impact of sparsity is on storage requirements, 
it also has an impact on indexing speed and search speed since these bytes for documents 
that do not have a field still need to be written at index time 
and skipped over at search time.

请注意，即使稀疏性最显著的影响是存储需求，它也会影响索引速度和搜索速度，
因为对于没有字段的文档，这些字节仍然需要在索引时写入，并在搜索时跳过。

It is totally fine to have a minority of sparse fields in an index. 
But beware that if sparsity becomes the rule rather than the exception, 
then the index will not be as efficient as it could be.

索引中有少数稀疏字段是完全可以的。但是要注意的是，如果稀疏性成为规则而不是异常，那么索引的效率就会降低。


This section mostly focused on norms and doc values 
because those are the two features that are most affected by sparsity. 
Sparsity also affect the efficiency of the inverted index (used to index text/keyword fields) 
and dimensional points (used to index geo_point and numerics) but to a lesser extent.


本节主要关注规范和文档值，因为这是受稀疏性影响最大的两个特性。
稀疏性还会影响反向索引(用于索引文本/关键字字段)和维度点(用于索引geo_point和数字)的效率，但影响程度较小。



避免稀疏的几点建议

Avoid putting unrelated data in the same index
避免把不相关的文档放在同一个索引中
此建议不适用于需要在文档之间使用父/子关系的情况，因为此功能仅支持位于相同索引中的文档。

Normalize document structures
Even if you really need to put different kinds of documents in the same index, 
maybe there are opportunities to reduce sparsity. 
For instance if all documents in the index have a timestamp field but some call it timestamp 
and others call it creation_date, it would help to rename it so 
that all documents have the same field name for the same data.

文档结构规范化  
timestamp  creation_date 
字段名规范化  重命名 

Avoid types
Types might sound like a good way to store multiple tenants in a single index. 
They are not: given that types store everything in a single index, 
having multiple types that have different fields in a single index 
will also cause problems due to sparsity as described above. 
If your types do not have very similar mappings, 
you might want to consider moving them to a dedicated index.

避免在同一个索引下使用多个type 
如果type mapping 差别很大 ，建议放到不同的索引中

Disable norms and doc_values on sparse fields
If none of the above recommendations apply in your case, 
you might want to check whether you actually need norms and doc_values on your sparse fields. 
norms can be disabled if producing scores is not necessary on a field, 
this is typically true for fields that are only used for filtering. 
doc_values can be disabled on fields that are neither used for sorting nor for aggregations. 
Beware that this decision should not be made lightly since these parameters cannot be changed on a live index, 
so you would have to reindex if you realize that you need norms or doc_values.


禁用稀疏字段上的规范和doc_values
检查稀疏字段是否真的需要规范和doc_values。
如果某个字段不需要生成分数，则可以禁用规范，对于仅用于筛选的字段，这通常是正确的。
可以在既不用于排序也不用于聚合的字段上禁用doc_values。
注意，这个决定不应该轻易做出，因为这些参数不能在活动索引上更改，
因此，如果需要规范或doc_values，则必须重新索引。
proc中进程内存信息

进程内存占用分析VSS/RSS/PSS/USS

elasticsearch5.0搜索偏好

elasticsearch5.0索引性能优化

elasticsearch5.0搜索性能优化

elasticsearch5.0磁盘使用优化