Tuning Site Search for Covid-19
Recently updated on
The recent spike in usage of the term covid-19 introduced some inconsistent results in Imaginary’s custom site search engine, iScraper, a tool which utilizes Elasticsearch as its indexing engine. We’ll take a closer look at these results and show how we corrected the problem with an Elasticsearch configuration change.
iScraper is a SaaS search engine tool created by Imaginary Landscape to allow clients to better tune search results on their websites.
The inconsistency we discovered was that searches for “covid” were returning results containing “covid” but not “covid-19.” However, searches for “covid19” were returning results containing “covid-19.” Oddly enough, searches for “covid-19” were returning results containing “19” but not “covid.”
In Plain(ish) English - The part is not a whole
Elasticsearch has a dizzying array of methodologies for breaking up search terms into component parts. Called tokenizers, these rules quickly get into the weeds with partial word and structured text tokenizers.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html.
That said, in almost all cases, a hyphen acts as a natural tokenizer and should have broken “covid-19” into two component parts, “covid” and “19.” A transition from letters to numbers is also a natural tokenizer. So, what was causing this belt-and-suspenders tokenization to fail?
Turns out it was colliding with an earlier ad hoc customization, made to the code to fix some other inconsistency.
This configuration allowed hyphenated terms to be smooshed together so that, for example, a search for “xray” would get results for "x-ray,” but it did not preserve each portion of the hyphenated term as a standalone word, valid in its own right. A search for “ray” would not pull up results for “x-ray,” just as we were finding with searches for “covid” not finding “covid-19” results.
This customization was just for hyphenation and not for number breaks. The combination of the ad hoc customization with the normal number tokenizer was the cause of “covid-19” searches returning “19” but not “covid.”
Examining the problem we discovered another customization that was better equipped to manage this new bloom of pandemic-related searches - Custom Synonym Sets.
Custom Synonym Sets
Custom Synonym Sets allowed us to create a set of related words. We created a new covid set that includes covid, covid-19, covid19 and for good measure, corona. Now searches using any of these terms will generate results that include any of these terms.
On a highly scientific epidemiological site there might be a reason to keep “corona” search results separate from “covid” results. Terms that have significant distinctions in some contexts are synonymous in others. In fact, many industries have their own vocabulary and often interchange multiple words for the same thing. Custom Synonym Sets are an elegant way to address these situations. Having this covid synonym set gets the user to content relating to the current situation without forcing search term precision.
Technically Speaking
Looking at this again but with a more technical lens, our Elasticsearch configuration has an analyzer that produces the following searchable tokens from a hyphenated word like "COVID-19":
covid-19 covid19 19
That analyzer looks like this:
"concatenate_on_hyphens": { "type": "word_delimiter", "preserve_original": True, "generate_word_parts": False, "catenate_all": True }
Word Parts
The reason the "covid" token is missing from the token list is this line:
"generate_word_parts": False,
Because generate_word_parts is set to False, a “covid" token is not generated from "covid-19”.
But if this is the case, why is the second half "19" included?
Number Parts
Numbers are handled distinctly with a “generate_number_parts” setting which is True by default. That default setting is not contradicted by our configurations, so the numeral "19" is preserved as a token.
Synonyms
Elasticsearch has a "synonyms" option which allows an administrator to provide a list of words that Elasticsearch treats as the same. The following configuration was added to our custom analyzer.
"covid_19": { "type": "synonym", "synonyms": ["corona, covid, covid-19, covid19”] }
With the synonym option in place, searches now pull appropriate results for content referencing the current pandemic by whichever term the searcher uses.
Final Thoughts
Working through this issue provided a couple takeaways. One is that language is complex and wonderful, and making a change like ignoring hyphens to correct one issue can easily cause another.
The issue also led to the discovery of synonym sets, which has been useful and will help make search better, especially for industries and events. Next time, however, we would prefer the event not be a pandemic!