enterprise search
How to choose the right embedding model
The most important component of a RAG system.
This post should follow the one titled "Stop talking about Vector DB / start talking about Embedding Models" but we reverse the order here.

A key component of a system that responds to queries using "knowledge" from documents, documentation, KBs, tickets, internal comms, etc is a natural language search that finds the relevant content. Such system is an example of RAG (Retrieval Augmented Generation) illustrated below
Natural language search is enabled by "embedding models" which are able (quite magically!) to determine how similar passages of text are; or, closer to our use case, which documents contain the relevant information to a given query. To appreciate what embedding models are doing, consider the 2D graph below where distances between points corresponds to how similar different passages of text are and think what a regular keyword search would do.
Getting the embeddings right is the key to the success of a RAG system. Without the right content returned in the top k results, we'll be in the garbage-in / garbage-out scenario. The LLM will be responding using insufficient or irrelevant information.

Hugginface has a leaderboard of embedding models at https://huggingface.co/spaces/mteb/leaderboard but how do you choose which one is best for your use case and how do you know if the chosen model works well enough? Read on.

Suppose we have 1,000 documents that we'd like to search over. Take one of them and think of 5 questions that the document answers. When you ask any of these questions you'd expect the document to come up near the top of the search results. Do this for each document. Now you have 5,000 questions and you can evaluate a search (i.e., an embedding model) by giving it a point if the document that the question was proposed for is returned within the first say 10 results. This metric is called recall@10.

How do we do this in practice? Lucky that someone solved language for us. You can ask GPT-4 (or your favorite LLM, but mileage may vary) something like this for every document that you have:
«For the document below, please create 10 queries that the document has a good complete answer for. Use diverse vocabulary making sure to use query terms that are not used in the document.
[text of the document]»
recall@k can be computed for different values of k: e.g., recall@1 measures how often the target document is returned as the top result.

Our favorite model is e5-large-v2 which in our tests outperforms OpenAI's embeddings (ada-002). The values we get for recall@8 on different data sets are above 90%.