Why aren’t you using pretrained models?
Pretrained neural networks have reached the point where they are good enough for many applications without further training. Many models trained on billions of parameters are freely available. However, not every company has people with machine learning experience. It is true that domain knowledge and care are required to build and deploy robust ML pipelines for end-users.
There is, however, huge potential in applying simple ML solutions to internal or personal challenges. To show how simply this can be done, let’s build a semantic search function that could be of use for anyone tasked with writing (English) text.
A simple matter of programming, machine learning, pretraining
A dictionary is a valuable tool for writers. However, modern dictionaries that come with popular operating systems have been found to be
Another issue with dictionaries is that they are one-directional. To find better words and expressions, you need to think of a word, look it up, and then chase references to explore the possibilities. What if, in addition to this forward search, a computer could look in the other direction (from the definitions to the words)?
Building a reverse dictionary
We’ll use a technique called sentence embedding to make the definitions and examples from Webster’s dictionary searchable. This is the semantic part of semantic search. Essentially, the meaning of a phrase is encoded into a vector of numbers, the output of a neural network.
With the Webster-Vectors in memory (we get about 270000 from the dictionary used), we can now query this dataset by encoding a search phrase into a query vector. To search for words, we compare how close the query is to vectors in the dataset.
To find similar vectors, we run a nearest neighbor algorithm. It takes the query vector as input, looks through our dataset, and provides, for example, the top ten closest results. All that’s left to do for us is to return the words associated with these neighbor vectors. This will (ideally) result in a list of words close to the meaning of the search phrase.