假设有两个句子
java is my favourite programming langurage, and I also think spark is a very good big data system. java spark are very related, because scala is spark's programming langurage and scala is also based on jvm like java.
适用match query 搜索java spark
{ { "match": { "content": "java spark" } } }
match query 只能搜索到包含java和spark的document,但是不知道java和spark是不是离得很近。
假设我们想要java和spark离得很近的document优先返回,就要给它一个更高的relevance score,这就涉及到了proximity match近似匹配。
下面给出要实现的两个需求:
(1)搜索java spark,就靠在一起,中间不能插入任何其它字符
(2)搜索java spark,要求java和spark两个单词靠的越近,doc的分数越高,排名越靠前
准备数据:
PUT /test_index/_create/1 { "content": "java is my favourite programming language, and I also think spark is a very good big data system." } PUT /test_index/_create/2 { "content": "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java." }
对于需求1 搜索java spark,就靠在一起,中间不能插入任何其它字符:
使用match query搜索无法实现
GET /test_index/_search { "query": { "match": { "content": "java spark" } } }
结果:
{ "took" : 16, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 0.4255141, "hits" : [ { "_index" : "test_index", "_type" : "_doc", "_id" : "2", "_score" : 0.4255141, "_source" : { "content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java." } }, { "_index" : "test_index", "_type" : "_doc", "_id" : "1", "_score" : 0.37266707, "_source" : { "content" : "java is my favourite programming language, and I also think spark is a very good big data system." } } ] } }
使用match phrase搜索就可以实现
GET /test_index/_search { "query": { "match_phrase": { "content": "java spark" } } }
结果:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.35695744, "hits" : [ { "_index" : "test_index", "_type" : "_doc", "_id" : "2", "_score" : 0.35695744, "_source" : { "content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java." } } ] } }
假设我们有两个document
doc1: hello world, java spark doc2: hi, spark java hello doc1(0) world doc1(1) java doc1(2) doc2(2) spark doc1(3) doc2(1)
position详情如下:
GET /_analyze { "text": ["hello world, java spark"], "analyzer": "standard" }
{ "tokens" : [ { "token" : "hello", "start_offset" : 0, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "world", "start_offset" : 6, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "java", "start_offset" : 13, "end_offset" : 17, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "spark", "start_offset" : 18, "end_offset" : 23, "type" : "<ALPHANUM>", "position" : 3 } ] }
GET /_analyze { "text": ["hi, spark java"], "analyzer": "standard" }
{ "tokens" : [ { "token" : "hi", "start_offset" : 0, "end_offset" : 2, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "spark", "start_offset" : 4, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "java", "start_offset" : 10, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 2 } ] }
索引中的position,match_phrase hello world, java spark doc1 hi, spark java doc2 hello doc1(0) wolrd doc1(1) java doc1(2) doc2(2) spark doc1(3) doc2(1)
使用match_phrase查询要求找到每个term都在一个共有的那些doc,就是要求一个doc,必须要包含查询的每个term,并且满足位置运算。
doc1 --> java和spark --> spark position恰巧比java大1 --> java的position是2,spark的position是3,恰好满足条件 doc1符合条件 doc2 --> java和spark --> java position是2,spark position是1,spark position比java position小1,而不是大1 --> 光是position就不满足,那么doc2不匹配 doc2不符合条件
含义:query string搜索文本中的几个term,要经过几次移动才能与一个document匹配,这个移动的次数就是slop。
实际举一个例子:
对于hello world, java is very good, spark is also very good. 假设我们要用match phrase 匹配到java spark。可以发现直接进行查询会查不到
PUT /test_index/_create/1 { "content": "hello world, java is very good, spark is also very good." } GET /test_index/_search { "query": { "match_phrase": { "content": "java spark" } } }
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 0, "relation" : "eq" }, "max_score" : null, "hits" : [ ] } }
此时使用
GET /_analyze { "text": ["hello world, java is very good, spark is also very good."], "analyzer": "standard" }
结果:
{ "tokens" : [ { "token" : "hello", "start_offset" : 0, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "world", "start_offset" : 6, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "java", "start_offset" : 13, "end_offset" : 17, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "is", "start_offset" : 18, "end_offset" : 20, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "very", "start_offset" : 21, "end_offset" : 25, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "good", "start_offset" : 26, "end_offset" : 30, "type" : "<ALPHANUM>", "position" : 5 }, { "token" : "spark", "start_offset" : 32, "end_offset" : 37, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "is", "start_offset" : 38, "end_offset" : 40, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "also", "start_offset" : 41, "end_offset" : 45, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "very", "start_offset" : 46, "end_offset" : 50, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "good", "start_offset" : 51, "end_offset" : 55, "type" : "<ALPHANUM>", "position" : 10 } ] }
java is very good spark is java spark java --> spark java --> spark java --> spark
可以发现java的position是2,spark的position是6,那么我们只需要设置slop大于等于3(也就是移动3词就可以了)就可以搜到了
GET /test_index/_search { "query": { "match_phrase": { "content": { "query": "java spark", "slop": 3 } } } }
结果:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.21824157, "hits" : [ { "_index" : "test_index", "_type" : "_doc", "_id" : "1", "_score" : 0.21824157, "_source" : { "content" : "hello world, java is very good, spark is also very good." } } ] } }
此时加上slop的match phrase就是proximity match近似匹配了。加上slop之后虽然是近似匹配可以搜索到很多结果,但是距离越近的会优先返回,也就是相关度分数就会越高。