https://github.com/weixsong/elasticlunr.js
javascript全文检索工具,有兴趣的小伙伴请帮忙点赞。
Elasticlunr.js is a lightweight full-text search engine in Javascript for browser search and offline search. Elasticlunr.js is developed based on Lunr.js, but more flexible than lunr.js. Elasticlunr.js provides Query-Time boosting and field search. Elasticlunr.js is a bit like Solr, but much smaller and not as bright, but also provide flexible configuration and query-time boosting.
View Documentation
Example
Open your browser's developer tools on this page to follow along or you could use Node.js to try in different way.
A very simple search index can be created using the following scripts:
var index = elasticlunr(function () { this.addField('title'); this.addField('body'); this.setRef('id'); });
Adding documents to the index is as simple as:
var doc1 = { "id": 1, "title": "Oracle released its latest database Oracle 12g", "body": "Yestaday Oracle has released its new database Oracle 12g, this would make more money for this company and lead to a nice profit report of annual year." } var doc2 = { "id": 2, "title": "Oracle released its profit report of 2015", "body": "As expected, Oracle released its profit report of 2015, during the good sales of database and hardware, Oracle's profit of 2015 reached 12.5 Billion." } index.addDoc(doc1); index.addDoc(doc2);
Then searching is as simple:
index.search("Oracle database");
Also, you could do query-time boosting by passing in a configuration:
index.search("Oracle database profit", { fields: { title: {boost: 2}, body: {boost: 1} } });
Elasticlunr.js is developed based on lunr.js, but more flexible than lunr.js. The main featues are as followings:
Elasticlunr.js is developed based on lunr.js, but more flexible than lunr.js. The main featues are as followings:
Because elasticlunr.js has a very perfect scoring mechanism, so for most of your requirement, simple search would be easy to meet your requirement.
index.search("Oracle database profit");
It's easy to setup which fields to search in by passing in a JSON configuration, and setup boosting for each search field. If you setup this configuration, then elasticlunr.js will only search the query string in the specified fields with boosting weight. If on fields is specified, elasticlunr.js will search all the fields that your configured when you created the index.
The scoring mechanism used in elasticlunr.js is very complex , please goto details for more information.
index.search("Oracle database", { fields: { title: {boost: 2}, body: {boost: 1} } });
Elasticlunr.js also support boolean logic setting, if no boolean logic is setted, elasticlunr.js use "OR" logic defaulty. By "OR" default logic, elasticlunr.js could reach a high Recall.
index.search("Oracle database profit", { fields: { title: {boost: 2}, body: {boost: 1} }, boolean: "OR" });
If user want to increase RECALL , user could configure to expand query tokens. For example, user query "micro", assume that "microwave" and "microscope" are both in the index, then documents contain "microwave" or "microscope" also will be returned. Each expanded query token's results are penalized because expanded token is not user query token.
index.search("micro", { fields: { title: {boost: 2, bool: "AND"}, body: {boost: 1} }, bool: "OR", expand: true });
Every document and search query that enters lunr is passed through a text processing pipeline . The pipeline is simply a stack of functions that perform some processing on the text. Pipeline functions act on the text one token at a time, and what they return is passed to the next function in the pipeline.
By default lunr adds a stop word filter and stemmer to the pipeline. You can also add your own processors or remove the default ones depending on your requirements. The stemmer currently used is an English language stemmer, which could be replaced with a non-English language stemmer if required, or a Metaphoning processor could be added.
var index = lunr(function () { this.pipeline.add(function (token, tokenIndex, tokens) { // text processing in here }) this.pipeline.after(lunr.stopWordFilter, function (token, tokenIndex, tokens) { // text processing in here }) })
Tokenization is how lunr converts documents and searches into individual tokens, ready to be run through the text processing pipeline and entered or looked up in the index.
The default tokenizer included with lunr is designed to handle general english text well, although application, or language specific tokenizers can be used instead.
Stemming increases the recall of the search index by reducing related words down to their stem, so that non-exact search terms still match relevant documents. For example 'search', 'searching' and 'searched' all get reduced to the stem 'search'.
lunr automatically includes a stemmer based on Martin Porter's algorithms.
Stop words are words that are very common and are not useful in differentiating between documents. These are automatically removed by lunr. This helps to reduce the size of the index and improve search speed and accuracy.
The default stop word filter contains a large list of very common words in English. For best results a corpus specific stop word filter can also be added to the pipeline. The search algorithm already penalises more common words, but preventing them from entering the index at all can be very beneficial for both space and speed performance.