Some time ago I worked on a small autocomplete web service for fun and curiosity. Part of it consists pretty much of what I’m going to talk about in this post.
We’re gonna build a small completion suggester in Go.
A couple of assumptions I’ll have for this experiment are:
- we don’t care about sub-word matching
- we want a little typo tolerance
- we’ll do case-insensitive matching
The prefix-based map
We’re gonna take advantage of PrefixMap, a small map implementation of mine that helps with prefix-based key matching. You may have seen this kind of map called Radix Tree, Trie or Prefix Tree as specified on Wikipedia, but I couldn’t find a naming convention for this kind of tree in the Go land. Maybe you can suggest a better name in the comments.
Anyway, the prefix-based map will be our main data source for suggestions. We’ll use it to store all the suggestions we want to match through our engine.
The Levenshtein distance
We’ll use the Levenshtein distance as a metric to compute how far the typed in string is from the matches that we’ve found. In particular, we’ll define our own similarity metric as:
[code]
levenshtein(match,substr)
1.0 – —————————
max(|match|,|substr|)
[/code]
Where substr
is the typed in string and match
is the candidate match we have found.
A similarity of 1.0
means that substr
and match
are equal.
The problem
We have a list of strings that the input will potentially match against. We want to find a few candidate matches that honor a certain similarity threshold we define.
This is a very simplified version of what happens when you start typing text into an auto-completion-enabled input field. Every time you type in, the string you’ve typed in so far gets evaluated against a data source to find the most relevant suggestions for you. And all of this has to happen pretty fast to be effective. You want your auto-completion service to be faster than the user who’s typing so that he can save time and select the suggestion rather than typing it all.
For this particular reason, the implementation of the Prefix Map we’re gonna use will be able to efficiently find all the values for the given prefix. This will save us from having to populate a more traditional map with all possible prefixes for a given key in advance. Instead, thanks to the tree-like structure values will be stored into the map, we’ll be able to just traverse the values in the tree that share a common prefix.
An example solution
For this specific example, our data source is going to be a list of world countries. Our target users will have to select the right country so they will start typing it in and we’ll provide a few suggestions to save them typing.
The autocomplete code
First of all, let’s start from the low-hanging fruits.
We’ve defined our concept of similarity so we’ll start by writing a function for it.
As you can see, the function accepts the ld
parameter as one of the inputs. That’s the Levenshtein Distance computed between the words we want to know the similarity of.
Now, let’s populate our PrefixMap with the country list we have. For the purpose of the exercise I’m gonna use the list of country names in English provided by this online service that I’ve turned into a Go slice. You can find a gist for it here.
Now that we have everything we need, let’s work on the main component of our little program: the matching code.
... values := datasource.GetByPrefix(strings.ToLower(input)) results := make([]*Match, 0, len(values)) for _, v := range values { value := v.(string) s := similarity(len(value), len(input), LevenshteinDistance(value, input)) if s >= similarityInput { m := &Match{value, s} results = append(results, m) } } fmt.Printf("Result for target similarity: %.2f\n", similarityInput) PrintMatches(results) ...
We’re taking advantage of the GetByPrefix method from PrefixMap. GetByPrefix
will return a flattened collection of all the values in the map that belong to the specified prefix. Pretty handy, isn’t it?
A further filtering I’m applying there, as you can see, is the similarity verification step. I’m going through the list of matches that we have retrieved to filter them according to the similarity input we have received as input in our program.
You can find the full example implementation on GitHub.
The output
This is really it. Here’s a few example invocations of the little code we’ve just written:
$ go run autocompleter.go -similarity 0.3 itlaly Result for target similarity: 0.30 match: Italy similarity: 0.67
$ go run autocompleter.go -similarity 0.2 united Result for target similarity: 0.20 match: United Arab Emirates similarity: 0.25 match: United Kingdom similarity: 0.36 match: United States similarity: 0.38
$ go run autocompleter.go France Result for target similarity: 0.30 match: France similarity: 1.00 match: Metropolitan France similarity: 0.32
Further improvements
As you have seen, our little program allows for little typos in some situations, except when it occurs at the beginning of our input. This is because we’re using a PrefixMap which will not match anything at all if we start with the wrong prefix, of course. An improvement in this sense would probably be to fall back to a full-text search when no matches have been found for the specified similarity in the first pass.
Hopefully this post has been somewhat useful for you, or at least entertaining. Let me know if you need any clarification or if you have to recommend a better approach to achieve this.
Leave a Reply