对于Elasticsearch而言
当使用match查询的时候
召回率=匹配到的文档数量/所有文档的数量,所以匹配到的文档数量越多,召回率就越高。
准确度指的就是匹配到的文档中,我们真正查询想要的文档相关度分数越高,返回结果中排在越前面,准确度就越高。
我们知道使用match匹配的话,如果我们的搜索文本是java spark,那么在返回结果中,只要包含有java或者是spark的文档都会返回。所以只使用match匹配的话,查询的召回率会非常高,但是准确度就会很低。
对于match_phrase短语搜索,会导致必须所有的term都在文档的字段中出现,而且距离在slop限定范围内才能匹配得上。如果我们的搜索文本是java spark,那么在返回结果中只包含java和只包含spark的文档不会返回,并且如果文档包含java也包含spark,但是距离范围大于slop限定的范围,那么也不会返回。这样准确度会很高,但是召回率就会过低,可能会没有文档返回,或是返回文档过少。
有时我们可能希望匹配到几个term中的部分,就可以作为结果返回,这样就可以提高召回率。同时我们也希望用上match_phrase根据距离提升分数的功能,让几个term距离越近分数就越高,优先返回。也就是如果我们的搜索文本是java spark,那么在返回结果中只要包含java或者是spark的文档就返回,但是如果文档既包含java也包含spark,并且距离非常近,那么这样的文档分数会非常高,会在结果中优先被返回。
用bool组合match和match_phrase,来实现,must条件中用match,保证尽量匹配更多的结果,should中用match_phrase来提高我们想要的文档的相关度分数,让这些文档优先返回。
示例:
只使用match
GET /test_index/_search { "query": { "bool": { "must": [ { "match": { "test_field": "java spark" } } ] } } }
输出结果:
{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 1.031828, "hits" : [ { "_index" : "test_index", "_type" : "_doc", "_id" : "1", "_score" : 1.031828, "_source" : { "test_field" : "spark is best big data solution based on scala ,an programming language similar to java spark" } }, { "_index" : "test_index", "_type" : "_doc", "_id" : "2", "_score" : 0.21110919, "_source" : { "test_field" : "i think java is the best programming language" } } ] } }
只使用match_phrase
GET /test_index/_search { "query": { "bool": { "should": [ { "match_phrase": { "test_field": { "query": "java spark", "slop": 10 } } } ] } } }
输出结果
{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.7704125, "hits" : [ { "_index" : "test_index", "_type" : "_doc", "_id" : "1", "_score" : 0.7704125, "_source" : { "test_field" : "spark is best big data solution based on scala ,an programming language similar to java spark" } } ] } }
混合使用match和近似匹配实现召回率和精准度的平衡
GET /test_index/_search { "query": { "bool": { "must": [ { "match": { "test_field": "java spark" } } ], "should": [ { "match_phrase": { "test_field": { "query": "java spark", "slop": 10 } } } ] } } }
输出结果:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 1.8022406, "hits" : [ { "_index" : "test_index", "_type" : "_doc", "_id" : "1", "_score" : 1.8022406, "_source" : { "test_field" : "spark is best big data solution based on scala ,an programming language similar to java spark" } }, { "_index" : "test_index", "_type" : "_doc", "_id" : "2", "_score" : 0.21110919, "_source" : { "test_field" : "i think java is the best programming language" } } ] } }