Getting the embeddings right is the key to the success of a RAG system. Without the right content returned in the top k results, we'll be in the garbage-in / garbage-out scenario. The LLM will be responding using insufficient or irrelevant information.
Hugginface has a leaderboard of embedding models at
https://huggingface.co/spaces/mteb/leaderboard but how do you choose which one is best for your use case and how do you know if the chosen model works well enough? Read on.
Suppose we have 1,000 documents that we'd like to search over. Take one of them and think of 5 questions that the document answers. When you ask any of these questions you'd expect the document to come up near the top of the search results. Do this for each document. Now you have 5,000 questions and you can evaluate a search (i.e., an embedding model) by giving it a point if the document that the question was proposed for is returned within the first say 10 results. This metric is called recall@10.
How do we do this in practice? Lucky that someone solved language for us. You can ask GPT-4 (or your favorite LLM, but mileage may vary) something like this for every document that you have: