首页   快速返回

elasticsearch5.0词向量信息查询接口     所属分类 elasticsearch
根据原文翻译整理
https://www.elastic.co/guide/en/elasticsearch/reference/5.0/docs-termvectors.html


Term Vectors
Returns information and statistics on terms in the fields of a particular document. 
返回特定文档字段中的词条信息和统计信息。
The document could be stored in the index or artificially provided by the user. 
文档可以存储在索引中,也可以由用户人工提供
Term vectors are realtime by default, not near realtime. 
This can be changed by setting realtime parameter to false.

词向量信息查询接口是实时的 

http://localhost:9200/twitter/tweet/1/_termvectors?pretty=true

http://localhost:9200/twitter/tweet/1/_termvectors?fields=message


Three types of values can be requested: term information, 
term statistics and field statistics. By default, all term information 
and field statistics are returned for all fields but no term statistics.

Term information
term frequency in the field (always returned)
term positions (positions : true)
start and end offsets (offsets : true)
term payloads (payloads : true), as base64 encoded bytes

词条信息
词频
出现位置 开始 结束 偏移

If the requested information wasn’t stored in the index, 
it will be computed on the fly if possible.
如果请求的信息没有存储在索引中,则会在可能的情况下动态计算。



Start and end offsets assume UTF-16 encoding is being used. 
词条位置偏移使用 UTF-16  !!!

Term statistics
Setting term_statistics to true (default is false) will return
词条统计 默认不返回 

total term frequency (how often a term occurs in all documents)
document frequency (the number of documents containing the current term)

词频 总出现次数,在所有文档里的出现次数
文档词频 ,文档里的出现次数

By default these values are not returned since term statistics can have a serious performance impact.
默认不返回,因为词统计对性能影响很大

Field statistics
Setting field_statistics to false (default is true) will omit :

document count (how many documents contain this field)
sum of document frequencies (the sum of document frequencies for all terms in this field)
sum of total term frequencies (the sum of total term frequencies of each term in this field)

包含该字段的文档个数
字段所有词的文档词频之和
字段所有词的词频之和


Terms Filtering
With the parameter filter, the terms returned could also be filtered based on their tf-idf scores. 
This could be useful in order find out a good characteristic vector of a document. 
This feature works in a similar manner to the second phase of the More Like This Query. 

tf-idf scores

根据tf-idf分数筛选返回的词
这对于找到一个好的文档特征向量是很有用的。
这个特性的工作方式与第二个阶段的查询类似。

max_num_terms
Maximum number of terms that must be returned per field. Defaults to 25.

min_term_freq
Ignore words with less than this frequency in the source doc. Defaults to 1.

max_term_freq
Ignore words with more than this frequency in the source doc. Defaults to unbounded.

min_doc_freq
Ignore terms which do not occur in at least this many docs. Defaults to 1.

max_doc_freq
Ignore words which occur in more than this many docs. Defaults to unbounded.

min_word_length
The minimum word length below which words will be ignored. Defaults to 0.

max_word_length
The maximum word length above which words will be ignored. Defaults to unbounded (0).

The term and field statistics are not accurate. Deleted documents are not taken into account. 
词和字段统计并不准确。删除的文档不被考虑。

By default, when requesting term vectors of artificial documents, 
a shard to get the statistics from is randomly selected. 
Use routing only to hit a particular shard.

用户提供文档词向量信息查询 ,随机选择一个分片 , 可以指定路由选择特定的分片


create an index that stores term vectors, payloads 
 
PUT http://localhost:9200/twitter/

{
  "mappings": {
    "tweet": {
      "properties": {
        "text": {
          "type": "text",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "analyzer" : "fulltext_analyzer"
         },
         "fullname": {
          "type": "text",
          "term_vector": "with_positions_offsets_payloads",
          "analyzer" : "fulltext_analyzer"
        }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}


添加文档

http://localhost:9200/twitter/tweet/1/_termvectors?pretty=true
{
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

Term vectors which are not explicitly stored in the index are automatically computed on the fly. 

Artificial documents
用户提供提供文档

http://localhost:9200/twitter/tweet/_termvectors
{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "twitter test test test"
  }
}

对指定的字段进行分词统计

Per-field analyzer

a different analyzer than the one at the field may be provided by using the per_field_analyzer parameter
指定单个字段的分析器 keyword  text

{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "twitter test test test"
  },
  "fields": ["fullname"],
  "per_field_analyzer" : {
    "fullname": "keyword"
  }
}


{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "twitter test test test"
  },
  "fields": ["fullname"],
  "per_field_analyzer" : {
    "fullname": "text"
  }
}

the terms returned could be filtered based on their tf-idf scores. 

{
    "doc": {
      "message": "hello  es  lucene"
    },
    "term_statistics" : true,
    "field_statistics" : true,
    "positions": false,
    "offsets": false,
    "filter" : {
      "max_num_terms" : 3,
      "min_term_freq" : 1,
      "min_doc_freq" : 1
    }
}

上一篇     下一篇
elasticsearch5.0批量读取API

elasticsearch5.0批量更新API

elasticsearch5.0重建索引API

elasticsearch5.0刷新机制

linux之grep命令

elasticsearch5.0搜索API概述