One of the most popular and effective approaches to integrating artificial intelligence into business operations today is the Retrieval-Augmented Generation (RAG). This architecture combines retrieving relevant data from external sources with generating meaningful responses based on that information. It benefits scenarios that require current, domain-specific, or extended-context information.
Depending on the system's design and goals, RAG systems can extract information from a wide range of sources, including document repositories, databases, websites, APIs, search engines, and more. This article focuses on using vector indexes, which enable information retrieval based on semantic similarity rather than simple keyword matching. This approach is especially beneficial for organizations dealing with complex, unstructured data, helping to improve the quality and relevance of AI-generated results.
RAG provides a flexible and scalable solution without fine-tuning large models for businesses aiming to implement AI in real-world processes. It is a compelling option for enhancing internal tools, automating customer interactions, or building intelligent assistants referencing up-to-date knowledge sources.
This article will be especially relevant for CTOs and business owners seeking effective ways to apply AI in their organizations, particularly where handling large volumes of information and delivering accurate, contextual responses are essential.
That’s how it works:
Preparation / Indexing:
Document Embedding: all documents are split into text chunks and converted into high-dimensional vectors using an embedding model.
Vector Storage: the generated vectors are stored in a vector database or in-memory.
Querying:
Query Embedding: the embedding model transforms the user’s query into a vector.
Retrieval: the system searches for top-K semantically similar chunks, comparing the user’s input vector with vectors in the index.
Augmented Generation: the retrieved chunks are passed along with the query to a large language model (LLM), which uses this context to generate a well-informed answer.
Today, there are many different embedding models that use different language models and embedding algorithms. This discussion focuses on applying RAG to the task of intelligent search within a repository of employee CVs and analyzes how the choice of specific embedding models affects the final accuracy of the system.
Tech Stack and Model Selection
The application was implemented using the LlamaIndex framework for Python. It allows you to quickly build a RAG approach with just a few lines of code.
We selected the smallest self-hosted embedding models and the most affordable ones from service providers for testing and experiments. To run the self-hosted models, we used Ollama.
Let’s review the chosen embedding models (Table 1).
Table 1. Comparable Embedding Models
As the LLM for testing, we used a self-hosted Llama3.2 model with 3 billion parameters. It’s Meta’s instruction-tuned text-only model is optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks [definition from https://ollama.com/library/llama3.2:3b. Since we evaluate the impact of the embedding models in this paper, it doesn’t matter which LLM is used. For example, Llama 3.2 can be replaced with one of the OpenAI models.
Key Python Code
Let’s build a simple RAG solution.
!pip install llama-index
import nest_asyncio
nest_asyncio.apply()
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
Setting chunk_size to 2048 means that one chunk will contain 2048 characters. In our case, this means that each chunk represents all the data from one CV file.
Build a Query Engine. It’s a central abstraction that defines how a user’s query is processed over a knowledge base. It coordinates retrieval, response synthesis, and sometimes prompt construction to return an answer.
query_engine = vector_index.as_query_engine(
llm=llm,
similarity_top_k=2,
system_prompt="""
You are a specialist in sourcing candidates for various IT-related positions.
You have access to a resume database of available developers, designers,
testers, and other IT professionals.
Your task is to analyze the user's input and extract the key requirements
for the vacancy, such as technology stack, certifications, years of experience,
candidate expertise, expected role in the team.
Based on this information and the provided context, select the most suitable
candidates from the database.
If you cannot find any, reply that you have no information about such a profile.
Your response must include the name of the selected candidate and a brief
explanation highlighting the key reasons why this candidate was chosen, such as
matching skills, relevant experience, certifications, and fit for the role.
"""
)
The similarity_top_k parameter controls how many chunks will be found by the embedding model and passed along with the user's request to LLM.
Start querying and check the response
response = query_engine.query("Do we have dotnet developer?")
print(response)
>>>Yes, there is a Middle Umbraco Developer with a strong background in .NET technologies, including experience in building and maintaining content management systems using Umbraco.
Test 1. Model Comparison by Human
As part of this evaluation, we investigated how different embedding models impact LLM-generated responses by asking the same set of questions across an identical group of CVs. It is important to emphasize that the observations and interpretations presented here reflect the author’s personal perspective. Let’s take a closer look at the specifics of the experiment!
We created five CV documents representing IT professionals with diverse levels of experience and areas of specialization. These served as the external dataset for the RAG pipeline. Each CV includes a summary of key skills, previous work experience, educational background, and certifications where applicable. All character similarities are unintentional, and the names used are entirely fictional.
A separate query engine was created for the selected embedding model. Each engine was asked the same three questions derived from the CV content:
Do we have Umbraco developers?
Do we have Lead Front-end developers?
Do we have any certified candidates with extensive Docker experience?
Each question is asked three times per engine to ensure more reliable and consistent results. We don't persist the vector index to disk. It exists entirely in memory, meaning that with every run of the program, a new set of vectors is generated.
Let’s review some results (Table 2-3).
Table 2. Most Incorrect Answers
Table 3. Most Correct Answers
Based on this experiment, it can be concluded that Embedding Models text-embedding-3-small and BAAI/bge-small-en-v1.5 provide the most accurate and explainable responses. However, because the results are subjective, these models should be compared using more precise metrics.
Test 2. Model Evaluation Using LlamaIndex
In LlamaIndex, there are several tools called evaluators that are used to automatically assess the quality of outputs generated by RAG systems. The framework also provides a variety of metrics to evaluate the efficiency of the embedding model, which was the focus of our analysis. Specifically, we calculated the following values:
MRR (Mean Reciprocal Rank): measures how high in the list of retrieved chunks the first relevant chunk is ranked. For each query, the reciprocal rank is defined as 1 divided by the position of the first correct (relevant) chunk in the ordered list. The average of these reciprocal ranks across all queries is the final MRR score. A higher MRR means a higher probability that the system will return relevant results earlier in the top ranking list, which in turn demonstrates better search efficiency.
Hit Rate (Top-k Accuracy): It denotes the fraction of queries for which at least one relevant chunk is retrieved as one of the top k returned chunks. This is generally expressed as Hit@k (e.g., Hit@5 or Hit@10). The larger Hit Rate of a model indicates the model is increasingly probable to have at least one correct answer returning at a high position in the result list, which is more significant in practical use, as users tend to examine results near the top of the list.
First, we created an array of queries that would be used throughout the test. For this, we used another LLM — gpt-4o-mini, deployed in Azure AI Foundry | Azure OpenAI Service.
Create a question dataset
from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.llms.azure_openai import AzureOpenAI
question_llm = AzureOpenAI(
model="gpt-4o-mini",
temperature=0.0,
engine="AZURE_DEPLOYMENT_NAME",
api_key="AZURE_API_KEY",
api_version="AZURE_API_VERSION",
azure_endpoint="AZURE_ENDPOINT",
)
question_dataset = generate_question_context_pairs(
nodes=nodes,
llm=question_llm,
num_questions_per_chunk=5,
qa_generate_prompt_tmpl="""\
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge. Generate only questions based on the below query.
You are a Teacher/ Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz. The questions should be diverse in nature across the document. Restrict the questions to the context information provided. The response must be information from the context provided.
"""
)
It generated five questions per chunk (or CV file in our case). Here are some of the questions:
What are the core skills that Alex Taylor possesses as a Middle Umbraco Developer, and how do they relate to his experience in building content management systems?
Describe the purpose of Vuex in Alice's projects and provide an example of a project where it was utilized.
What are the core technologies that John Lee specializes in as a Senior Vue.js Developer?
List two certifications that John Smith has obtained related to his professional expertise.
What educational qualification does John Smith hold, and from which institution did he graduate?
Initialize RetriverEvaluator and calculate the metrics
As in the previous section, each query engine was evaluated three times to obtain more accurate data. Let’s review the results (Table 4).
Table 4. Retriever Evaluation results
text-embedding-3-small is the best-performing Embedding Model, followed closely by BAAI/bge-small-en-v1.5. nomic-embed-text and bge-m3:567m are significantly weaker in both metrics and less stable across iterations.
Conclusion
In our specific evaluation, using a basic RAG setup with test CVs, the embedding model text-embedding-3-small showed the highest effectiveness when combined with the LLM Llama3.2. The model BAAI/bge-small-en-v1.5 ranked second, delivering slightly lower performance. The chosen self-hosted models produced less favorable outcomes. Still, it's essential to understand that these results can vary depending on multiple factors, such as the size of the CV dataset, chunk configuration, and the number of retrieved nodes during query execution.
In addition, we are also satisfied with the effectiveness of the embedding model text-embedding-3-small when used with the OpenAI models, such as gpt-4o-mini, but the results are not included in the article.
If you have questions or want to learn how Retrieval-Augmented Generation (RAG) can enhance your AI adoption, we recommend reading about our service, AI assistants. It shows how RAG-based systems enable assistants to deliver context-aware responses. Understanding intelligent retrieval and context integration is essential when starting with AI or running large language models in production. These elements help you get the most value from your data.
Prepared by a Certified Microsoft Developer, this article reflects our team's high quality and deep experience. UKAD has talented developers on board and can help you assemble a dedicated team by selecting the most suitable experts from our talent pool. We assist with conducting technical interviews, validating culture fit, and finalizing all necessary agreements. Partner with us to effectively integrate AI into your business processes and achieve practical, measurable results!
Alexey Zhadan
Certified .NET developer
Alexey is a highly skilled .NET developer with over 6 years of experience in backend development, Azure cloud solutions, and API integration. He holds several key certifications including Microsoft Certified: Azure Developer Associate (AZ-204), Azure Administrator Associate (AZ-104), and Azure Fundamentals (AZ-900). With strong proficiency in ASP.NET Core, Azure DevOps, and modern CMS platforms like Umbraco, Alexey combines technical excellence with a proactive, problem-solving mindset.
Share article
Safe steps to the legendary partnership!
At UKAD, we exclusively recruit certified professionals dedicated to continuous development. Our team prioritizes ongoing improvement, consistently enhancing their knowledge and expertise through prestigious certifications