Retrieval-Augmented Generation (RAG): From the Basics to Implementation

Understand and implement an RAG system from scratch

A RAG-based QA assistant can serve a wide range of domains — from e-commerce customer support and public service chatbots to research assistants, healthcare information systems, and legal precedents analysis.Why RAG?While LLMs are becoming more powerful every day, they still have key limitations: they are trained on enormous datasets that are nonetheless only a partial and outdated snapshot of available knowledge. “Partial” means that the model’s training corpus includes public websites, books, and codebases, but never internal documents such as corporate manuals, proprietary research data, or user-specific data. “Outdated” means that once a model is released, it has no awareness of new events or evolving facts.When faced with a question beyond its training data, the model is prone to a typical problem — hallucination, that is, the LLM may guess and fabricate an answer that appears plausible but is actually false. One of the technologies that can help address this problem is RAG, which tells LLM to generate a rational response based on some truth.In a nutshell, an LLM is a ‘brain with memory of what happened outside but no updates’, while RAG empowers it to expand its vision.What is RAG?Retrieval-Augmented Generation is still a form of generation, which is performed by LLMs. When you chat with an intelligent QA assistant, the user experience is not much different from what you have had with OpenAI ChatGPT. What is seen on stage in both scenarios is that you ask a question and get a response from the chatbot.Behind the scenes, however, there is something more that has happened — the retrieval of relevant information to augment the LLM’s context. For example, when answering real-time questions such as current weather, stock prices, or exchange rates, the RAG system retrieves information from live sources before generating a response. And to supplement the model with the understanding of internal or private materials, retrieve from the database where the knowledge base is stored, which is illustrated as follows. Similar to how we store and retrieve data using SQL (like PostgreSQL) and NoSQL (like MongoDB), this time we still store our data in a pool, except the fact that the data is highly unstructured and is massive and cannot be searched by keywords. Here comes the concept of vector and embedding. Everything Can Be Embedded.As we say about OOP, “Everything can be an object”, now we say, “Everything can be embedded.” That is, everything can be represented as a vector.In this context, embeddings are vectors. They are vectors with meanings. As neural networks operate purely on numbers — performing operations such as multiplication and addition, we have to transform texts, images, and so on, which we human beings can comprehend, into a numeric form by which neural networks can comprehend our communication material. This form is called embedding, a vector representing the meaning of a piece of natural language. And the pool storing embeddings is called a vector database.Representing Meaning in Multiple DimensionsThese vectors are high-dimensional, normally hundreds of dimensions, 512d, 1024d, something like that. While we can visualize a 2D vector in a plane coordinate system as (x, y) and then add the third axis to represent a 3D vector as (x, y, z), it is difficult to visualize a vector with more than 3 dimensions. But maybe we can think in this way. One dimension represents one trait of a piece of text. Take the following as an example.All of “cat”, “pine”, “table”, “bee”, “bat”, “whale” represent a tangible entity..Are they creatures? cat, pine, bee, bat, whale=[1,1], table=[1,0]Are they animals? cat, bee, bat, whale=[1,1,1], pine=[1,1,0], table=[1,0,0]Are they mammals? cat, bat, whale=[1,1,1,1], bee=[1,1,1,0], pine=[1,1,0,0], table=[1,0,0,0]Do they have wings? cat, whale=[1,1,1,1,0], bat=[1,1,1,1,1], bee=[1,1,1,0,1], pine=[1,1,0,0,0], table=[1,0,0,0,0]Do they live in water? whale=[1,1,1,1,0,1], cat=[1,1,1,1,0,0], bat=[1,1,1,1,1,0], bee=[1,1,1,0,1,0], pine=[1,1,0,0,0,0], table=[1,0,0,0,0,0]……When searching for similarity with “tiger” in the above-mentioned words, “cat” will win.With the same logic, a high-dimensional vector represents hundreds of traits of a piece of text (what we call a “chunk”), an image, or anything. Values in the array are continuous rather than discrete (0s and 1s).How to Embed DocumentsHow to embed them, i.e., to transform documents into vectors? Fortunately, a wide collection of trained models (click and find available resources) is ready for use. We’ll explore the internal mechanisms later — for now, let’s focus on the workflow.The following code is an example of how to use CLIP-vit-base-patch32 to embed an image.from transformers import CLIPProcessor, CLIPModel # transformers is a framework for model usage, developed by Hugging Face. Install it using "pip install transformers" # Load CLIP model and processor clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") from PIL import Image # Before it, run "pip install Pillow" def get_image_embedding(image_path): image = Image.open(image_path) inputs = clip_processor(images=image, return_tensors="pt") with torch.no_grad(): image_features = clip_model.get_image_features(**inputs) return image_features[0].numpy() ''' Call the function with an image and print the returned result, you will get an array similar to [-0.0132809 0.04572856 0.08792043 -0.03914355 -0.02482264 0.05938271 0.0719912 -0.00396497 0.02297862 -0.0803934 0.01876233 0.05230444 -0.03180214 0.0940239 -0.06793992 ... 0.01547239 -0.01027456 0.04728566] It contains 512 values because openai/clip-vit-base-patch32 transform an image into a 512d vector. '''Please read the entire article in my Medium stories. Here is the link: 👉 Understand and implement an RAG system from scratch 

Read More

Function Calling + ReAct— From Generation to Agent

Enhancing Q&A Assistants with Tools

In the previous article, I introduced how to build an intelligent Q&A assistant based on the RAG (Retrieval-Augmented Generation) architecture. While this approach helps reduce hallucinations and grounds responses in domain-specific knowledge, it has limitations. For instance, consider questions like: “When should I visit the theme park to avoid peak seasons?” “I plan to visit the park next Sunday. What will the weather be like?” In such cases, the assistant may query the vector database but still respond: “I’m happy to help, but I don’t have information on the visit flow or weather forecast.” Clearly, more capabilities are needed. This article will introduce function calling and the ReAct pattern, demonstrating how to empower the assistant to reason, act, and utilize external tools effectively. We will explore technologies like OpenAI, LangChain, Text-to-SQL, and Qwen-Agent.Please read the full article in my Medium stories. Here is the link: 👉 Function Calling + ReAct— From Generation to Agent

Read More

From CNNs to YOLO: Understanding Object Detection

Fundamentals Before Training a Steel Defect Detection Model

This article is a learning reflection. I share how I came to understand the workflow of CNNs for image analysis, supported by visualization tools and a few sketches I drew during the learning process.With this foundation, I then look at how YOLO builds on CNNs to reframe object detection.Object Detection and Image AnalysisWhen I first started learning about neural networks, machine learning, PyTorch, etc., my understanding of image analysis was rather vague. I knew that images could be “analyzed” by models, but I did not clearly distinguish between different types of tasks.Over time, I realized that image analysis involves various tasks, while object detection is one of them, different from classification, image segmentation, etc. Instead of labelling the whole image with a single category, such as “shirt”, “dress”, or “coat”, in image classification, object detection refers to analysing:Please read the full article in my Medium stories. Here is the link: 👉 From CNNs to YOLO: Understanding Object Detection

Read More

Steel Defect Detection: From Theory to Practice

Train a Surface Defect Detection Model Based on YOLO

Understanding the theory behind CNNs and YOLO was only half of the journey. The real question for me was how these ideas can be applied in a real industrial setting.Steel defect detection turned out to be a good test case.Unlike natural images, industrial images are often repetitive, visually subtle, and far less forgiving. Defects may occupy only a small region of the surface, while the background remains nearly uniform. Minor variations in texture can easily blur the boundary between “normal” and “defective”.From a modeling perspective, steel defect detection fits well into the object detection paradigm:Please read the full article in my Medium stories. Here is the link: 👉 Steel Defect Detection: From Theory to Practice

Read More