首页  

elasticsearch5.0重建索引API     所属分类 elasticsearch 浏览量 1674
根据原文翻译整理
https://www.elastic.co/guide/en/elasticsearch/reference/5.0/docs-reindex.html



Reindex does not attempt to set up the destination index. 
It does not copy the settings of the source index. 
You should set up the destination index prior to running a _reindex action, 
including setting up mappings, shard counts, replicas, etc.

Reindex不尝试设置目标索引。不复制源索引的设置。
运行reindex操作之前先设置目标索引,包括设置映射、分片数、副本数等。

The most basic form of _reindex just copies documents from one index to another. 
This will copy documents from the twitter index into the new_twitter index

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}


Just like _update_by_query, _reindex gets a snapshot of the source index 
but its target must be a different index so version conflicts are unlikely.
_reindex获取源索引的快照,但是它的目标必须是不同的索引,所以不可能出现版本冲突。

"dest": {
    "index": "new_twitter",
    "version_type": "internal"
  }
  
将version_type设置为external,保留源索引中的版本,创建任何丢失的文档,并更新目标索引中比源索引中版本更老的文档

Settings op_type to create will cause _reindex to only create missing documents in the target index. 
All existing documents will cause a version conflict:

"op_type": "create"
"dest": {
    "index": "new_twitter",
    "op_type": "create"
  }
只创建不存在的文档, 存在的文档产生版本冲突

By default version conflicts abort the _reindex process 
but you can just count them by settings "conflicts": "proceed" in the request body

默认时版本冲突会导致终止,可以设置 忽略版本冲突 继续处理

POST _reindex
{
  "conflicts": "proceed",
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "op_type": "create"
  }
}

You can limit the documents by adding a type to the source or by adding a query. 
This will only copy tweet's made by kimchy into new_twitter:

POST _reindex
{
  "source": {
    "index": "twitter",
    "type": "tweet",
    "query": {
      "term": {
        "user": "kimchy"
      }
    }
  },
  "dest": {
    "index": "new_twitter"
  }
}

可以指定type 和 指定的条件

the post type in the twitter index and the tweet type in the blog index.
{
  "source": {
    "index": ["twitter", "blog"],
    "type": ["tweet", "post"]
  },
  "dest": {
    "index": "all_together"
  }
}

limit the number of processed documents by setting size
记录数设置 
{
  "size": 1,
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}
排序并设置最大记录数
{
  "size": 10000,
  "source": {
    "index": "twitter",
    "sort": { "date": "desc" }
  },
  "dest": {
    "index": "new_twitter"
  }
}

only a subset of the fields from the original documents can be reindexed using source filtering
可设置源的过滤字段 
"_source": ["user", "tweet"]

Like _update_by_query, _reindex supports a script that modifies the document. 
支持使用脚本更新文档
Unlike _update_by_query, the script is allowed to modify the document’s metadata. 
脚本支持更新文档元数据
POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "external"
  },
  "script": {
    "inline": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}",
    "lang": "painless"
  }
}

ctx.op = "noop"
ctx.op = "delete"

noop counter in the response body
deleted counter in the response body.
  
Setting _version to null or clearing it from the ctx map is 
just like not sending the version in an indexing request. 

By default if _reindex sees a document with routing then the routing is preserved 
unless it’s changed by the script. You can set routing on the dest request to change this

By default _reindex uses scroll batches of 1000. 
You can change the batch size with the size field in the source element


滚动批次默认大小 1000,
使用 source.size 自定义

Reindex can also use the Ingest Node feature by specifying a pipeline like this
{
  "source": {
    "index": "source"
  },
  "dest": {
    "index": "dest",
    "pipeline": "some_ingest_pipeline"
  }
}

reindexing from a remote Elasticsearch cluster
从远程集群重新索引

{
  "source": {
    "remote": {
      "host": "http://otherhost:9200",
      "username": "user",
      "password": "pass"
    },
    "index": "source",
    "query": {
      "match": {
        "test": "data"
      }
    }
  },
  "dest": {
    "index": "dest"
  }
}


Remote hosts have to be explicitly whitelisted in elasticsearch.yaml 
using the reindex.remote.whitelist property. 
It can be set to a comma delimited list of allowed remote host and port combinations 
(e.g. otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*). 

远程主机需要在白名单里

Scheme is ignored by the whitelist - only host and port are used.


Reindexing from a remote server uses an on-heap buffer that defaults to a maximum size of 200mb. 
If the remote index includes very large documents you’ll need to use a smaller batch size.

堆缓冲区最大约200M, 如果文档比较大,需要减小滚动批次大小

url  参数

pretty 
refresh, wait_for_completion, wait_for_active_shards, timeout,  requests_per_second.

requests_per_second
throttles the number of requests per second that the reindex issues
控制每秒重索引发出的请求数

The throttling is done waiting between bulk batches so that it can manipulate the scroll timeout. 
在批量批次之间等待执行节流,以便能够操作滚动超时。

按批执行 滚动批次大小 流控 超时 

Since the batch isn’t broken into multiple bulk requests 
large batch sizes will cause Elasticsearch to create many requests 
and then wait for a while before starting the next set. 

批处理没有被分解成多个批量请求,很大的批大小会创建许多请求,需要等待一段时间执行下一个批次。


返回结果
{
  "took" : 639,
  "updated": 0,
  "created": 123,
  "batches": 1,
  "version_conflicts": 2,
  "retries": {
    "bulk": 0,
    "search": 0
  }
  "throttled_millis": 0,
  "failures" : [ ]
}

wait_for_completion=false
以后台任务形式运行 返回 task 可以 取消 查看任务状态 
Elasticsearch will also create a record of this task as a document at .tasks/task/${taskId}. 
This is yours to keep or remove as you see fit.
在.tasks/task/${taskId}中将此任务的记录创建为文档  可以保留或删除


You can fetch the status of all running reindex requests with the Task API

GET _tasks?detailed=true&actions=*reindex

"action" : "indices:data/write/reindex"

GET /_tasks/taskId:1


The advantage of this API is that it integrates with wait_for_completion=false 
to transparently return the status of completed tasks. 
If the task is completed and wait_for_completion=false was set on it 
them it’ll come back with a results or an error field. 
The cost of this feature is the document that wait_for_completion=false 
creates at .tasks/task/${taskId}. It is up to you to delete that document.

取消任务
POST _tasks/task_id:1/_cancel

节流 流控

POST _reindex/task_id:1/_rethrottle?requests_per_second=-1

修改字段名

POST test/test/1?refresh
{
  "text": "words words",
  "flag": "foo"
}

rename flag to tag

POST _reindex
{
  "source": {
    "index": "test"
  },
  "dest": {
    "index": "test2"
  },
  "script": {
    "inline": "ctx._source.tag = ctx._source.remove(\"flag\")"
  }
}

 Sliced Scroll
 手动切片滚动
 
POST _reindex
{
  "source": {
    "index": "twitter",
    "slice": {
      "id": 0,
      "max": 2
    }
  },
  "dest": {
    "index": "new_twitter"
  }
}

手动并行化过程

use _reindex in combination with Painless to reindex daily indices 
to apply a new template to the existing documents.

Reindex can be used to extract a random subset of an index for testing


 
{
  "size": 10,
  "source": {
    "index": "twitter",
    "query": {
      "function_score" : {
        "query" : { "match_all": {} },
        "random_score" : {}
      }
    },
    "sort": "_score"    
  },
  "dest": {
    "index": "random_twitter"
  }
}


Reindex defaults to sorting by _doc so random_score won’t have any effect 
unless you override the sort to _score.

"sort": "_score"

上一篇     下一篇
elasticsearch5.0文档查询更新API

elasticsearch5.0批量读取API

elasticsearch5.0批量更新API

elasticsearch5.0词向量信息查询接口

elasticsearch5.0刷新机制

linux之grep命令