Retrieval-Augmented Generation (RAG): From the Basics to Implementation

Understand and implement an RAG system from scratch

A RAG-based QA assistant can serve a wide range of domains — from e-commerce customer support and public service chatbots to research assistants, healthcare information systems, and legal precedents analysis.

Why RAG?

While LLMs are becoming more powerful every day, they still have key limitations: they are trained on enormous datasets that are nonetheless only a partial and outdated snapshot of available knowledge. “Partial” means that the model’s training corpus includes public websites, books, and codebases, but never internal documents such as corporate manuals, proprietary research data, or user-specific data. “Outdated” means that once a model is released, it has no awareness of new events or evolving facts.

When faced with a question beyond its training data, the model is prone to a typical problem — hallucination, that is, the LLM may guess and fabricate an answer that appears plausible but is actually false. One of the technologies that can help address this problem is RAG, which tells LLM to generate a rational response based on some truth.

In a nutshell, an LLM is a ‘brain with memory of what happened outside but no updates’, while RAG empowers it to expand its vision.

What is RAG?

Retrieval-Augmented Generation is still a form of generation, which is performed by LLMs. When you chat with an intelligent QA assistant, the user experience is not much different from what you have had with OpenAI ChatGPT. What is seen on stage in both scenarios is that you ask a question and get a response from the chatbot.

Behind the scenes, however, there is something more that has happened — the retrieval of relevant information to augment the LLM’s context. For example, when answering real-time questions such as current weather, stock prices, or exchange rates, the RAG system retrieves information from live sources before generating a response. And to supplement the model with the understanding of internal or private materials, retrieve from the database where the knowledge base is stored, which is illustrated as follows.

 

Similar to how we store and retrieve data using SQL (like PostgreSQL) and NoSQL (like MongoDB), this time we still store our data in a pool, except the fact that the data is highly unstructured and is massive and cannot be searched by keywords. Here comes the concept of vector and embedding.

 

Everything Can Be Embedded.

As we say about OOP, “Everything can be an object”, now we say, “Everything can be embedded.” That is, everything can be represented as a vector.

In this context, embeddings are vectors. They are vectors with meanings. As neural networks operate purely on numbers — performing operations such as multiplication and addition, we have to transform texts, images, and so on, which we human beings can comprehend, into a numeric form by which neural networks can comprehend our communication material. This form is called embedding, a vector representing the meaning of a piece of natural language. And the pool storing embeddings is called a vector database.

Representing Meaning in Multiple Dimensions

These vectors are high-dimensional, normally hundreds of dimensions, 512d, 1024d, something like that. While we can visualize a 2D vector in a plane coordinate system as (x, y) and then add the third axis to represent a 3D vector as (x, y, z), it is difficult to visualize a vector with more than 3 dimensions. But maybe we can think in this way. One dimension represents one trait of a piece of text. Take the following as an example.

All of “cat”, “pine”, “table”, “bee”, “bat”, “whale” represent a tangible entity..

Are they creatures? cat, pine, bee, bat, whale=[1,1], table=[1,0]

Are they animals? cat, bee, bat, whale=[1,1,1], pine=[1,1,0], table=[1,0,0]

Are they mammals? cat, bat, whale=[1,1,1,1], bee=[1,1,1,0], pine=[1,1,0,0], table=[1,0,0,0]

Do they have wings? cat, whale=[1,1,1,1,0], bat=[1,1,1,1,1], bee=[1,1,1,0,1], pine=[1,1,0,0,0], table=[1,0,0,0,0]

Do they live in water? whale=[1,1,1,1,0,1], cat=[1,1,1,1,0,0], bat=[1,1,1,1,1,0], bee=[1,1,1,0,1,0], pine=[1,1,0,0,0,0], table=[1,0,0,0,0,0]

……

When searching for similarity with “tiger” in the above-mentioned words, “cat” will win.

With the same logic, a high-dimensional vector represents hundreds of traits of a piece of text (what we call a “chunk”), an image, or anything. Values in the array are continuous rather than discrete (0s and 1s).

How to Embed Documents

How to embed them, i.e., to transform documents into vectors? Fortunately, a wide collection of trained models (click and find available resources) is ready for use. We’ll explore the internal mechanisms later — for now, let’s focus on the workflow.

The following code is an example of how to use CLIP-vit-base-patch32 to embed an image.

from transformers import CLIPProcessor, CLIPModel # transformers is a framework for model usage, developed by Hugging Face. Install it using "pip install transformers"
# Load CLIP model and processor
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

from PIL import Image # Before it, run "pip install Pillow"
def get_image_embedding(image_path):
image = Image.open(image_path)
inputs = clip_processor(images=image, return_tensors="pt")
with torch.no_grad():
image_features = clip_model.get_image_features(**inputs)
return image_features[0].numpy()

'''
Call the function with an image and print the returned result,
you will get an array similar to
[-0.0132809 0.04572856 0.08792043 -0.03914355 -0.02482264
0.05938271 0.0719912 -0.00396497 0.02297862 -0.0803934
0.01876233 0.05230444 -0.03180214 0.0940239 -0.06793992
...
0.01547239 -0.01027456 0.04728566]
It contains 512 values because openai/clip-vit-base-patch32 transform an image into a 512d vector.
'''

Please read the entire article in my Medium stories. Here is the link: 👉 Understand and implement an RAG system from scratch