发布时间 2019-01-29 修改时间 2019-01-29
Don’t return large result sets
Elasticsearch is designed as a search engine,
which makes it very good at getting back the top documents that match a query.
However, it is not as good for workloads that fall into the database domain,
such as retrieving all documents that match a particular query.
If you need to do this, make sure to use the Scroll API.
如果真的需要这样做，使用 Scroll 接口
Avoid large documents
Given that the default http.max_context_length is set to 100MB,
Elasticsearch will refuse to index any document that is larger than that.
You might decide to increase that particular setting,
but Lucene still has a limit of about 2GB.
大文档的开销 网络 内存 磁盘
even for search requests that do not request the _source
since Elasticsearch needs to fetch the _id of the document in all cases,
and the cost of getting this field is bigger for large documents due to how the filesystem cache works.
Indexing this document can use an amount of memory that is a multiplier of the original size of the document.
Proximity search (phrase queries for instance) and highlighting also become more expensive
since their cost directly depends on the size of the original document.
For instance, the fact you want to make books searchable doesn’t necesarily mean
that a document should consist of a whole book.
It might be a better idea to use chapters or even paragraphs as documents,
and then have a property in these documents that identifies which book they belong to.
The data-structures behind Lucene, which Elasticsearch relies on in order to index and store data,
work best with dense data, ie. when all documents have the same fields.
This is especially true for fields that have norms enabled
(which is the case for text fields by default) or
doc values enabled (which is the case for numerics, date, ip and keyword by default).
对于开启规范化或 doc values 的字段尤其重要
lucene 默认 使用倒排所用， 如果要操作单个字段，则开启列存 doc values
doc ids between 0 and the total number of documents in the index
for instance searching on a term with a match query produces an iterator of doc ids,
and these doc ids are then used to retrieve the value of the norm
in order to compute a score for these documents.
词查询返回文档id 迭代器， 文档id用于获取 norm 计算文档分数
The way this norm lookup is implemented currently is by reserving one byte for each document.
The norm value for a given doc id can then be retrieved by reading the byte at index doc_id.
While this is very efficient and helps Lucene quickly have access to the norm values of every document,
this has the drawback that documents that do not have a value will also require one byte of storage.
In practice, this means that if an index has M documents,
norms will require M bytes of storage per field,
even for fields that only appear in a small fraction of the documents of the index.
doc values have multiple ways that they can be encoded depending on the type of field
and on the actual data that the field stores
doc values 有多种编码方式，依赖字段类型和实际存储的数据
fielddata, which was used in Elasticsearch pre-2.0 before being replaced with doc values
ES 2.0版本之前 使用 fielddata
the impact was only on the memory footprint since fielddata was not explicitly materialized on disk.
Note that even though the most notable impact of sparsity is on storage requirements,
it also has an impact on indexing speed and search speed since these bytes for documents
that do not have a field still need to be written at index time
and skipped over at search time.
It is totally fine to have a minority of sparse fields in an index.
But beware that if sparsity becomes the rule rather than the exception,
then the index will not be as efficient as it could be.
This section mostly focused on norms and doc values
because those are the two features that are most affected by sparsity.
Sparsity also affect the efficiency of the inverted index (used to index text/keyword fields)
and dimensional points (used to index geo_point and numerics) but to a lesser extent.
Avoid putting unrelated data in the same index
Normalize document structures
Even if you really need to put different kinds of documents in the same index,
maybe there are opportunities to reduce sparsity.
For instance if all documents in the index have a timestamp field but some call it timestamp
and others call it creation_date, it would help to rename it so
that all documents have the same field name for the same data.
Types might sound like a good way to store multiple tenants in a single index.
They are not: given that types store everything in a single index,
having multiple types that have different fields in a single index
will also cause problems due to sparsity as described above.
If your types do not have very similar mappings,
you might want to consider moving them to a dedicated index.
如果type mapping 差别很大 ，建议放到不同的索引中
Disable norms and doc_values on sparse fields
If none of the above recommendations apply in your case,
you might want to check whether you actually need norms and doc_values on your sparse fields.
norms can be disabled if producing scores is not necessary on a field,
this is typically true for fields that are only used for filtering.
doc_values can be disabled on fields that are neither used for sorting nor for aggregations.
Beware that this decision should not be made lightly since these parameters cannot be changed on a live index,
so you would have to reindex if you realize that you need norms or doc_values.