Website Search using Django and PostgreSQL Trigrams
Recently updated on
Over the years I've become increasingly wary of the word "easy" in software documentation. Pick a software project at random, and there's a good chance the documentation will lead off with something like "Booloogent makes the process of frobnifying your wakalixes easy!"
And then you try to use the package or feature or whatever, and you find that it is not easy. It might not work at all, and you might grow angry with the author of Booloogent for lying, or with yourself for failing at something so easy, or maybe both at once.
So it's refreshing - even exhilarating - when a thing really does turn out to be easy. And such is the case with Django's "trigram_similar" lookup.
One of Imaginary Landscape’s healthcare clients needed a search function (in their Django-based website) that would allow users to find medical service facilities by entering either the provider's name or some component of their address, such as a street name or a city. The search widget also had to provide auto-complete suggestions in real time, so getting results had to be snappy.
Because I had recently been playing with PostgreSQL's "full text search" capabilities, I started there. I added a new search vector field to the "Facility" model and populated it with data from each Facility's "name," "address," etc. This worked great. Results came back lightning quick when the user's input matched any of the values from the targeted fields.
But it wasn't good enough. The client wanted partial matches too. For example, one location was in "Rockford, IL." Shouldn't entering "Rock" be enough to bring back a suggestion of "Rockford"? Unfortunately, the Postgres full-text search doesn't work on substrings, so I had to keep looking. Next stop: trigrams.
And what are trigrams? To quote the PostgreSQL docs:
"A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share."
You can see this in action by firing up a Postgres shell:
$ psql <my-database>
If your database doesn't have the "pg_trgm" extension, you have to create it:
=> CREATE EXTENSION pg_trgm;
That's the only extra step you might need to take. Now you can see how the extension breaks a string into trigrams. Let's use the earlier example of "Rockford":
=> SELECT show_trgm('Rockford'); show_trgm --------------------------------------------- {" r"," ro",ckf,for,kfo,ock,ord,"rd ",roc} (1 row)
And you can also see how the trigram-similarity operator "%" can help to find partial matches:
=> SELECT DISTINCT city FROM facility WHERE city % 'Rock'; city ------------- Rockford Rockton Rock Island Rock Falls (4 rows)
Great! But how do we use it in a Django context? Well, you have to create the "pg_trgm" extension as described above, then add "django.contrib.postgres" to your `INSTALLED_APPS` setting. Alternatively, you can also make Django create the "pg_trgm" extension by running the "TrigramExtension" migration operation.
Once those steps are done, you can use the "trigram_similar" filter to fetch objects using partial matching:
$ ./manage.py shell >>> from my_app.models import Facility >>> Facility.objects.filter(city__trigram_similar="Rock") <GeoQuerySet [<Facility: 1060 Addison Lane, Rockford, IL, 61107>, ...
Easy, right?
But it's not perfect: You can't tweak the sensitivity of the matching. Although the string "Rock" is enough to bring back facilities in those four cities, a string of "Linc" -- for example -- is not enough to bring back a Facility with an address of, say, "212 E. Lincoln Road." "Linco" doesn't work either. You have to get all the way to "Lincol" before you start seeing results:
>>> Facility.objects.filter(street__trigram_similar='Lincol') <GeoQuerySet [<Facility: 212 E. Lincoln Road, Ann Arbor, Michigan, 48103>, ...
And you need still more characters to find Facilities on "Lincoln Street" or "Lincoln Avenue."
The reason is that although "Rock" is half of "Rockford" (and somewhere around half of the other "Rock" cities), "Linc" is a much smaller fraction of a whole address like "212 E. Lincoln Road," and an even smaller fraction of an address like "10015 N. Lincoln Avenue." Therefore the "trigram_similar" lookup decides that the match is not close enough and withholds the result.
If we want to match "Linc" more easily, we have to lower the threshold for calling something a match. The threshold has to be a number between 0 and 1. The default is 0.3, as can be seen in the Postgres shell with:
=> select show_limit(); show_limit ------------ 0.3 (1 row)
"show_limit" is one of the functions brought in when the trigram extension is created. See the full list.
So we need to lower the limit for “street.” On the other hand, the default value worked well for "city." Lowering the threshold too much could negatively affect queries such as our successful “Rock” example. Can we adjust the limit on a per-query basis?
Tuning starts by accessing the Postgres functions directly with wrapper classes:
>>> from django.db import models >>> class Similarity(models.Func): >>> function = "similarity"
Now we can call the Postgres "similarity" function from Django. We can reproduce the behavior of "filter(city__trigram_similar='Rock')" like this:
>>> Facility.objects.annotate( >>> match=Similarity("city", models.Value("Rock")) >>> ).filter(match__gt=0.3) <GeoQuerySet [<Facility: 1060 Addison Lane, Rockford, IL, 61107>, ...
But we can also lower the threshold for "address" if we want to:
>>> Facility.objects.annotate( >>> match=Similarity("street", models.Value("Linc")) >>> ).filter(match__gt=0.15) <GeoQuerySet [<Facility: 212 E. Lincoln Road, Ann Arbor, Michigan, 48103>, ...
This technique was based on a solution I found on Reddit.
So far so good! Now the question becomes, how can you set up a query that will try to match *either* "street" *or* "city"? I managed this with:
>>> from django.db.models.functions import Greatest >>> Facility.objects.annotate(match=Greatest( >>> Similarity("facility__street", models.Value(query)), >>> Similarity("facility__city", models.Value(query)) >>> )).filter(match__gt=0.2)
Unfortunately we've now left "easy" behind. The code is starting to become messy, as it now requires a custom class, queryset annotations, and the 'Greatest' function. And we're not even using "trigram_similar" anymore. It also uses a somewhat arbitrarily chosen threshold value that splits the difference between "0.15," which was helpful for the "street" query, and "0.3," which seemed right for the "city" query.
I tried using a wrapper class to access the Postgres "set_limit" function, so I could set the threshold to whatever I wanted. Then maybe I could have gone back to using the "trigram_similar" lookup. I could never get this to work though, and "set_limit" is deprecated in any case.
Ironically, at the end of all these experiments, I didn't end up using trigrams in my solution. The limitations on adjusting the sensitivity threshold, at least through Django, proved too great. I suppose that can be the downside of true easiness. Still, in another situation, trigrams could prove the perfect easy-to-use tool.
Further Reading
If you want to make trigram-matching faster, you can create a "GIN," or possibly a "GIST" index. This might not be necessary for tables that don't have a massive number of rows, but here are two good articles about it: