转载

Elasticsearch API 介绍

Elasticsearch 提供了很多 REST API 来管理 Cluster, Index 和 Document，在这些 API 中，有一些通用的参数：

pretty: 返回格式化的 JSON 结果，方便阅读。
human: 为 size, time 等返回结果带上方便阅读的单位。
source: 可以把 POST 的请求体放在该参数中，然后使用 GET 请求发送。

以下的接口说明中全部省略了服务器地址。

集群 API

Cluster API 用于管理 Cluster，常用的接口有：

GET /_cluster/stats?pretty&human : 查看 Cluster 状态，带上 pretty 和 human 参数可以方便阅读返回结果。

索引 API

Index API 用于创建、更新和删除 Index，常用的接口有：

PUT /blogs : 创建名为 blogs 的索引。
GET /blogs?pretty : 获取 blogs 索引。
DELETE /blogs : 删除 blogs 索引。

文档 API

Document API 用于在 Index 上添加、更新和删除文档。以上面创建的 blogs 索引，article 类型为例，下面是常用的接口：

POST /blogs/article : 在 blogs 索引上添加一个 article 类型的新文档， ID 自动生成。
PUT /blogs/article/1 : 添加或更新 ID 为 1 的文档。
POST /blogs/article/1/_update : 更新 ID 为 1 的文档。
GET /blogs/article/1&pretty : 获取 ID 为 1 的文档。
DELETE /blogs/article/1 : 删除 ID 为 1 的文档。
POST /blogs/article/_bulk : 批量执行添加、更新或删除等文档操作。

其他 API

除了上面这些 API，Elasticsearch 还提供一些其他的辅助 API，比如 cat API 和 explain API 等。

cat API

可以用来查看一些基本信息，常用的参数有：

v: 连同列名一起返回。
help: 显示可用的列。
h: 选择返回的列。

常用的接口有:

GET /_cat/indices?v : 查看所有的索引，v 参数可以连同列名一起返回。
GET /_cat/count : 查看集群上的文档数目。
GET /_cat/count/blogs : 查看 blogs 索引下的文档数目。
GET /_cat/nodes : 查看所有的节点。
GET /_cat/plugins : 查看所有的插件。

explain API

可以查看某个 Document 在特定查询语句下的得分明细，例如：

curl -XGET 'localhost:9200/blogs/article/1/_explain' -d '{
      "query" : {
        "term" : { "title" : "hello world" }
      }
}'

搜索 API

Search API 用于检索我们添加到 Index 的文档，因为涉及内容较多，所以专门捡出来介绍。

检索的时候，Elasticsearch 会将搜索的关键词与 Document 的 Field 进行匹配，如果 Field 是 string 类型，在 Index 上添加 Document 和检索的时候需要分别对 Field 和关键词做分词处理。

搜索请求可以使用简单的 URI Search ，也可以使用略复杂但更灵活的 Query DSL 。

URI Search

URI Search 使用一个 q 参数来设置关键词。

例如在 blogs 索引的 article 类型里查找 title 字段包含 hello 关键词的文档：

GET /blogs/article/_search?pretty&q=title:hello

如果要同时检索多个 Index 或多个 Type，可以使用逗号 (,) 隔开 Index 和 Type：

GET /blogs,blogs2/article/_search?pretty&q=title:hello

另外，还可以使用 _all 来代表所有的 Index 或 Type，或者省略 Index 和 Type，请求所有的 Index 和 Type，详细说明可参考这里。

如果想一次匹配多个字段，可以在 q 里不定义 Field，所有在 Index 定义里没有取消 include_in_all 的 Field 都会被匹配。

下面的请求会同时匹配 title 和 content 两个 Field：

GET /blogs/article/_search?pretty&q=hello

Query DSL

Query DSL 的搜索参数设置在 HTTP 的请求体里，如果不方便使用请求体，也可以把 Query DSL 写在 source 参数里，效果一样。

常用的 Query DSL 查询语句有 match , multi_match , term 等，除了单独使用这些查询语句，还可以使用 bool 等复合语句把前面的查询语句组合起来一起使用。

match 和 multi_match 用于对 string 类型的 Field 进行全文匹配，term 用于对 Field 进行精确匹配。match 和 term 匹配单个 Field，multi_match 可以同时匹配多个 Field。

match 语句

match 语句分 boolean (默认), phrase 和 phrase_prefix 三种类型，可以设置 type 来改变 match 语句的类型。

boolean 类型的 match 语句默认使用 or 操作符过滤结果，只需要 Field 里匹配一个关键词即可，可以通过设置 minimum_should_match 来修改最小匹配数。如果需要匹配所有的关键词，可以设置 operator 为 and。

例如搜索的关键词为 中国交通银行 ，被 analyzer 分为 中国 ， 交通 和 银行 三个词，默认的 match 语句只要 Field 的分词结果里有 中国 ， 交通 和 银行 中的任意一个词就算匹配成功；如果设置 minimum_should_match 为 2, 则至少需要匹配两个词；如果设置 operator 为 and，三个词必须同时存在才能匹配，但是 Field 里三个词的出现顺序不做要求。

match_phrase 语句

match_phrase 语句即 phrase 类型的 match 语句，比 bollean 类型要求更严格些，默认便需要 Field 的分词结果里包含所有的关键词分词，而且 Field 里分词的出现顺序必须和搜索的关键词里分词的出现顺序完全一致。可以设置 slop 来降低对分词出现顺序的要求。

例如搜索的关键词为 中国银行 ，假设被 analyzer 分为 中国 和 银行 两个词，默认的 match_phrase 语句将无法匹配 中国的好银行 和 银行在中国 这两个 Field。将 slop 设置为 1 可以匹配 中国的交通银行 ，因为 中国的交通银行 会被 analyzer 分为 中国 、 交通 和 银行 三个词( 的 通常会作为 stopwords 直接从 Field 的分词结果中移除)，只要把 中国 向右偏移一个位置( 中国 银行 )，Field 的分词出现顺序就能和关键词的分词出现顺序一致。同理， slop 设置为 2 就可以匹配 银行在中国 (假设 在 没有被配置为 stopwords)。

实际使用中，phrase 类型的 match 极为有用，通常认为分词距离越近，相关度就越高，即 Closer is Better ，下文的实例中便使用了 match_phrase。

match_phrase_prefix 语句

phrase_prefix 类型的 match 语句，类似 match_phrase，只是允许对最后一个分词进行前缀匹配，通常用于输入框的候选词自动匹配。

multi_match 语句

和 match 语句相似，但是可以同时匹配多个 Field。multi_match 分为 best_fields (默认)，most_fields, cross_fields, phrase 和 phrase_prefix 五种类型。

best_fields: 使用搜索词对每个 Field 执行 match 语句，使用得分最高的 Field 的评分作为 Document 的评分，如果设置了。
most_fields: 类似于 best_fields，但是取所有 Field 的平均评分作为 Document 的评分。
cross_fields: 把所有的 Field 视作一个整体，然后使用搜索词执行 match。
phrase: 类似 best_fields，但是对每个 Field 使用 match_phrase 语句。
phrase_prefix: 类似 best_fields，但是对每个 Field 使用 match_phrase_prefix 语句。

在 multi_match 的 fields 中，还可以为 Field 设置权重，例如下面的语句中，title Field 的权重被设置为3：

{
    "query": {
        "multi_match" : {
            "query":      "hello world",
            "type":       "best_fields",
            "fields":     [ "title^3", "content" ],
            "operator":   "and"
        }
    }
}

term 语句

term 语句用于对 Field 进行精确匹配，通常经过 analyzer 分词后的 string 类型的 Field 将无法使用 term 来匹配。

bool 语句

bool 语句用于组合上面的各种查询语句，有 must, filter, should 和 must_not 四种类型。

must: 被组合的查询语句必须实现匹配，且评分贡献到总评分中。
filter: 类似 must 语句，但是评分不会贡献到总评分中。
should: 如果 bool 语句里没有 must 和 filter 语句，那么被 should 组合的查询语句至少有一个实现匹配。
must_not: 被组合的查询语句必须不匹配。

和 bool 语句里的查询语句匹配越多，Document 的分数也会越高，通常使用 must 来限定一个大致范围，然后使用 should 来提高某些匹配的 Document 的排名，例如：

{
    "query": {
        "bool" : {
            "must" : {
                "match" : { "title" : "hello" }
            },
            "should" : [
                {
                    "match" : { "content" : "first" }
                },
                {
                    "match" : { "content" : "cat" }
                }
            ]
        }
    }
}

其他请求参数

在 Search API 中还有一些比较重要的参数，例如：

highlight: 关键词高亮，在 Query DSL 里使用，高亮的 Field 必须在查询语句中出现，例如：

{
    "query" : {...},
    "highlight" : {
        "fields" : {
            "title": {},
            "content" : {}
        }
    }
}

from, size: 搜索结果分页，URI Search 和 Query DSL 都可以使用。from 是起始页码，默认是 0，size 是每页的结果数目，默认是 10。
sort: 搜索结果排序，在 Query DSL 里使用，默认按照 _score 倒序排列，可以配置成根据多个 Field 进行综合排序，例如先按照评分倒序排列，评分相同的按照日期倒序排列：
```
{
    "query" : {...},
    "sort": [
        {"_score" : "desc"},
        {"date" : "desc"}
    ]
}
```

返回结果格式

搜索请求的返回结果为 JSON 格式，大致如下：

{
    "took" : 226,
    "timed_out" : false,
    "_shards" : {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
    },
    "hits" : {
        "total" : 1,
        "max_score" : 1.0,
        "hits" : [ {
            "_index" : "blogs",
            "_type" : "article",
            "_id" : "AVaHSgLT4Fm3JNywIamW",
            "_score" : 1.0,
            "_source" : {
                "title" : "hello world",
                "date" : "2016-08-14T09:03:20",
                "content" : "This is my first blog post."
            }
        } ]
    }
}

其中比较重要的字段意义如下：

took: 耗时，单位毫秒。
timed_out: 请求是否超时。
hits: 本次请求匹配到 Document，如果请求没有设置分页参数 from 和 size，默认返回第一页的前10个 Document。
- total: 匹配的 Document 总数。
- hits: 当前页的 Document 列表。
  - _index: Document 所在的 Index。
  - _type: Document 的 Type。
  - _score: Document 相关度分数，分数越高相关度越高。
  - _id: Document 的 ID，如果添加 Document 的时候没有设置，会自动生成一个随即的字符串。
  - _source: Document 的 Field，可以在请求时使用 _source_include 参数来限定返回哪些 Field。