Ref/search!

How to Code Relevance-Weighted Search with Node & MongoDB

Upvotes: 0

Tags:metajavascriptnode.jsmongodbreferencesearchhow to code

By AstroMacGuffin dated Sat Jul 23 2022 19:14:09 GMT+0000 (Coordinated Universal Time) last updated Thu Aug 18 2022 14:17:35 GMT+0000 (Coordinated Universal Time)

![A magnifying glass](/static/img/120px-Crystal_Clear_action_viewmag.png) In this post I teach you how to implement a search that sorts results by relevance. Search relevance is calculated as the total number of times each search keyword shows up in the article, added together. To accomplish this, you'll create a search index whenever the article is saved. Regular expressions will be used to match the keywords against the mongo database, so the search keyword "search" will match the database keyword "searching", but the search keyword "searching" won't match the database keyword "search". ### Ingredients - some kind of CRUD app already written for articles using Node & MongoDB - an `articles_search` collection for storing strength of keyword relevance per word per article - a method for removing punctuation and special characters from the content you want to index - database methods for deleting and inserting into `articles_search` - a mongo query that matches search terms against a field in `articles_search` - a loop to add up total keyword strength per article - a sorting algorithm to sort articles by total keyword strength Whenever you add or update an article: - delete all keyword entries for that article - parse the contents you want to index from scratch; delete old entries for that article from `articles_search`, then add new - update the article content normally

For the search, it's actually two find operations (probably because I haven't looked into how MongoDB handles join operations yet). First you get the entries from `articles_search` so you can have the `articles._ids`; second you get the article content. Some of my code to get you through the relatively hard parts: ### First, a Useful Utility Method ```js _renderSearchTerms(theString) { return theString .replace(/$/gi, ' ') .replace(/$/gi, ' ') .replace(/\./gi, ' ') .replace(/!/gi, ' ') .replace(/@/gi, ' ') .replace(/#/gi, ' ') .replace(/\$/gi, ' ') .replace(/\%/gi, ' ') .replace(/\^/gi, ' ') .replace(/&/gi, ' ') .replace(/\*/gi, ' ') .replace(/-/gi, ' ') .replace(/_/gi, ' ') .replace(/=/gi, ' ') .replace(/\+/gi, ' ') .replace(/\{/gi, ' ') .replace(/\[/gi, ' ') .replace(/\}/gi, ' ') .replace(/\]/gi, ' ') .replace(/:/gi, ' ') .replace(/;/gi, ' ') .replace(/"/gi, ' ') .replace(/'/gi, ' ') .replace(/`/gi, ' ') .replace(/,/gi, ' ') .replace(/>/gi, ' ') .replace(/</gi, ' ') .replace(/\//gi, ' ') .replace(/\?/gi, ' ') .replace(/\|/gi, ' ') .replace(/\\/gi, ' ') .replace(/\n/gi, ' ') .split(' '); } ``` The above utility method: - removes all the unwanted characters from a string you intend to index for search, replacing them with spaces - splits the result into an array, using a space as the split delimeter - returns it ### Whenever you Add or Update an Article, Set the Keywords Info in `articles_search` Next is the method that sets the `articles_search` entries for an article. ```js async setSearchTerms(arg) { const d = mdb.db("amgdotcom"); const t = d.collection("articles_search"); let id = (arg._id !== undefined) ? arg._id : arg.id; if (id === undefined) return false; await this.deleteSearchTerms(id); ``` We don't want duplicate entries, so we've deleted all the old search terms for this article. ```js let terms = [ ...this._renderSearchTerms(arg.title), ...this._renderSearchTerms(arg.content) ]; ``` This creates a merged array; note that there may very well be words in the `title` that are also in the `content` ...that's fine, because right now the `terms` array contains the entire article as an array broken into words -- no sums yet, just a raw list of words exactly as they appeared in the original article. Not until we... ```js let termCount = {}; terms.forEach((item) => { item = item.trim().toLowerCase(); if (item !== '') { if (termCount[item] === undefined) termCount[item] = 1; else termCount[item]++; } }); ``` *Now* we have the sums. Next we simply collate a `docs` array full of objects compatible with MongoDB's node driver method, `[db].[collection].insertMany()`. ```js let docs = []; // rewriting terms var terms = Object.keys(termCount); terms.forEach((item) => { docs.push({ article_id: ObjectId(id), term: item, count: termCount[item] }); }); ``` Finally we insert the `articles_search` entries into the collection (`t` was defined all the way on the 2nd line of this function). ```js t.insertMany(docs); } ``` ### And Finally, the Search Method ```js async searchArticles(term, type='blog') { ``` The `term` parameter needs to be a space-separated string of keywords. ```js try { let r = []; let article_weights = {}; const d = mdb.db("amgdotcom"); let t = d.collection("articles_search"); ``` Just setting up variables to be used later. The collection `t` will be reassigned later, so it's not a `const` like it otherwise would be. ```js let query = { $or: [] }; let terms = term.split(' '); for (let i = 0; i < terms.length; i++) { query['$or'].push({term: new RegExp(terms[i], 'gi')}); } ``` We're dynamically building a `query` object, above. ```js let options = { sort: { count: -1 }, projection: { article_id: 1, count: 1 }, }; ``` All we need from the `articles_search` collection is the `articles._id` value stored in `articles_search.article_id`, and the `count` of how relevant the keyword was to that article. We don't care about the keyword, but we do sort by `count`, descending. ```js const cursor1 = await t.find(query, options); await cursor1.forEach((item) => { if (!article_weights.hasOwnProperty(item.article_id)) article_weights[item.article_id] = { count: item.count }; else article_weights[item.article_id].count += item.count; }); ``` Above, we're priming the `article_weights` object now, calculating the total relevance. The goal is to know how relevant the article is when you take all keywords in the search term into account at once. ```js // resetting query if (type === 'all') query = { $or: [] }; else query = { type: type, $or: [] }; let article_ids = Object.keys(article_weights); for (let i = 0; i < article_ids.length; i++) { query['$or'].push({_id: ObjectId(article_ids[i])}); } ``` Once again, we're building a dynamic query above. This time we want every article that's relevant at all. ```js // resetting t t = d.collection('articles'); // resetting options options = { projection: { _id: 1, title: 1, content: 1, tags: 1, username: 1, creation: 1, updated: 1, type: 1, }, }; ``` Now we're preparing to do a `.find()` on the `articles` collection. I didn't bother sorting on the lazy assumption that I have to do that manually. (Seriously, I should just look up how MongoDB joins work, if at all.) ```js const cursor2 = t.find(query, options); ``` I love the MongoDB Node.js driver so far. ```js await cursor2.forEach((item) => { r.push({ id: item._id, title: item.title, content: item.content, tags: item.tags, username: item.username, creation: item.creation, updated: item.updated, type: item.type, }); }); ``` The `r` array is the return value. Here we're packing it full of everything needed to render each article that was relevant to the search. ```js r.sort((a, b) => { if (article_weights[a.id].count > article_weights[b.id].count) return -1; else if (article_weights[a.id].count === article_weights[b.id].count) return 0; else return 1; }); return r; ``` Above, we sort the return value so that the more total keyword matches for an article, the earlier in the array it'll be sorted. ```js } catch (e) { console.log(`There was an error: ${e}`); } } ``` The other methods are easy, basic CRUD stuff. Have fun with your fancy new relevance-weighted search!