RAG Explained for Beginners

Lately, in the industry, there has been this buzzword. RAG. Applied AI. Retrieval Augmented Generation. All these terms just floating around everywhere. Today, let's sit back, relax, and actually understand what all of this means.


What is RAG?

RAG stands for Retrieval Augmented Generation.

Break it down and it literally tells you what it does. First you retrieve information, then you generate a response. Or you can say it another way: first you retrieve context, then you generate output.

Simple enough, right? But then why do we even need it?


Why Do We Need RAG?

Why can't we just query an LLM like ChatGPT or Claude directly? What's the issue?

Here's the thing. Most LLMs are trained on massive amounts of data. We are talking the entire internet, or at least close to it. And because of that, they are highly generic. They are great at a wide range of tasks, but the moment you need something very specific, they start to fall short.

Let's take an example. And we will carry this example throughout the entire blog.


Our Running Example: Dog Knowledge Assistant

Assume you want to build a system that answers questions about dogs. Deep, specific, expert-level questions about dogs.

Now, ChatGPT or any general-purpose LLM might know something about dogs. But it won't know dogs the way a dedicated dog expert would. It might hallucinate facts, give vague answers, or just not go deep enough.

So what do we do? We build a RAG pipeline specifically for dogs.

We take every Wikipedia article about dogs, every research paper, every guide we can find, and we feed all of that into our system. Now our system has actual, reliable, specific knowledge about dogs. And it can answer questions from that knowledge, not from guessing.

That is RAG in a nutshell. Now let's go through it step by step.


The RAG Pipeline: High Level

Before we dive in, here is a bird's eye view of the full flow so you have a mental map:

Documents (PDFs, Word files, text files)
            |
     Context Ingestion
            |
       Text Chunking
            |
    Embedding Generation
            |
      Vector Database
            |
      (User asks a question)
            |
      Query Embedding
            |
      Similarity Search
            |
    Retrieved Chunks (this becomes the context)
            |
    System Prompt + Context + Question
            |
           LLM
            |
          Answer

Okay. Now let's go through each of these steps one by one.


Step 1: Context Ingestion

This is the starting point. You gather all the data you want your system to know about and you feed it in.

In our dog example, this means uploading every Wikipedia article about dogs. You can feed in Word documents, PDFs, plain text files, basically any text document you have. This becomes your system's knowledge base. Everything your RAG will ever know comes from here.


Step 2: Text Chunking

Alright so now you have your documents. The next step is chunking, also called text splitting.

Now you might be wondering, why do we even need to chunk? Why not just feed the whole document directly?

Because we don't want to dump a massive wall of text at the LLM every time someone asks a question. We want to be precise. We want to retrieve only the relevant pieces of information, not the entire Wikipedia library on dogs. So we split the documents into smaller pieces called chunks.

Chunking Strategies

There are a few ways to do this. Some notable ones:

  • Sentence Chunking: Split after every full stop. Each sentence becomes its own chunk.
  • Fixed Size Chunking: Split after every N characters, say every 500 characters, regardless of where the sentence ends.
  • Paragraph Chunking: Split by paragraphs. Each paragraph is one chunk.
  • Recursive Chunking: This is arguably the most effective approach and here is how it works. First, the text is split by paragraphs. If a paragraph chunk is still too big, it is split further by sentences. If it is still too big, it is split by words. And if it is still too big, by characters. Basically it keeps breaking things down until the chunk is small enough.

Recursive chunking is smart because it tries to keep meaningful units together for as long as possible. It only breaks things down further when it absolutely has to.

Chunk Size

Chunk size is basically how big you want each chunk to be. For example, 200 characters per chunk, or 300 tokens per chunk. This depends on your use case and the kinds of questions you expect users to ask.

Chunk Overlap

This is an important one. Chunk overlap means: how many characters or sentences from the previous chunk should also appear at the start of the next chunk?

Why does this matter? Context. If a sentence at the end of one chunk is directly connected to the idea at the beginning of the next chunk, you don't want to lose that thread. A little overlap keeps things connected and makes sure no important context gets cut off at the boundary.

Think of it like reading a book. If you rip out every page individually and read each page in isolation, you lose the flow of the story. A bit of overlap between pages keeps things making sense.


Step 3: Embeddings

Okay so now you have all your chunks ready. The next step is converting those chunks into embeddings.

Now what exactly is an embedding?

What Are Embeddings?

Embeddings are numerical vector representations of text. In simple terms, they are just a way of converting words and sentences into numbers so that a machine can understand and compare them mathematically.

Why do we need this? Because machines and LLMs cannot naturally understand human language the way we do. They need a numerical representation that captures the meaning of the text, not just the words.

For example, the word "dog" might be represented as something like [0.111, 0.952, 0.334, ...]. Just a list of numbers. And here is the key idea: words or sentences with similar meanings will have similar vectors. That is the whole point. We are coming to why that matters very soon.

How exactly these numbers are calculated? That is honestly beyond the scope of this blog and also beyond the scope of RAG engineering in general. Because when you are working in this space, you are using these technologies, not building them from scratch. You don't need to know how a car engine is built to drive a car.

Embedding Models

Just like there are models for generating text, like Claude or GPT, there are models specifically built for generating embeddings. These are called embedding models. You feed your chunks into an embedding model, and it outputs a vector for each chunk. That's it.

During training, these embedding models learn patterns from massive amounts of text. Words and sentences that frequently appear in similar contexts end up with similar vector representations. That is why "dog" and "puppy" end up close together, while "dog" and "airplane" end up far apart.


Step 4: Vector Databases

Now you have all these embeddings. You need somewhere to store them. That is what vector databases are for.

A few popular ones:

  • ChromaDB
  • FAISS
  • Pinecone

These are databases specifically designed to store embeddings along with the original text chunks and their metadata. Metadata means things like which document a chunk came from, what page it was on, stuff like that.

Persistence

Here is something practical. You don't want to recalculate all your embeddings every single time you run the system. That would be incredibly slow and wasteful.

Vector databases support something called persistence. Think of it like a cache. You calculate the embeddings once, store them, and the next time someone queries the system, it just reads from the database instead of regenerating everything from scratch. Calculate once, reuse forever. Simple.


The System Prompt

Before we get to the querying part, there is one more important thing to set up: the system prompt.

The system prompt is basically the instruction you give to the LLM about what this RAG is supposed to do. Think of it as a template. It defines the role of your assistant.

For our dog example, it might look something like this:

You are a senior dog expert. You provide information about 
various dog-related queries based on the context provided below.

Context: {context}

Question: {question}

Notice the two variables here:

  • {context}: this gets filled in with the relevant chunks retrieved from the vector database
  • {question}: this is whatever the user asked

The system prompt is what gives your RAG its purpose. Change it and you change what the assistant does.


Choosing the LLM

Now comes the fun part. You get to choose which LLM actually generates the final response.

This can be an open source model like DeepSeek or Mistral, or it can be a proprietary model like Claude Opus or GPT-4, as long as you have the API key. You have full liberty here.

You can also set the max tokens for the response, which basically means how long you want the answer to be. Do you want a short, snappy answer or a detailed explanation? That is totally up to you.


Step 5: Retrieval

Okay, so everything is set up. Now a user comes along and asks:

"What are the climatic conditions where a Saint Bernard thrives?"

Now what?

Query Embedding

First, the RAG system takes this question and converts it into an embedding. Just like it did for all the chunks. Why? Because now it can mathematically compare this question against all the chunks sitting inside the vector database.

Similarity Search

Now the system runs a similarity search. It goes into the vector database and finds the chunks whose embeddings are most similar to the query embedding.

Here is the intuition. A Saint Bernard's embedding is going to be much closer to a dog's embedding than it is to an airplane's embedding. That is the whole point. Similar concepts live close together in this vector space. So when you ask about Saint Bernards, the system finds chunks about dogs, about cold climates, about large breeds. Not chunks about aircraft engines.

A common way to measure this similarity is cosine similarity, which basically measures how close two vectors are to each other. The closer they are, the more similar the meaning. The math behind it is out of scope for this blog, but you get the idea, right?

Here is what that whole retrieval flow looks like for our question:

Question:
What are the climatic conditions where a Saint Bernard thrives?

            |

Embedding generated for the question

            |

Vector search finds chunks about:
- Saint Bernards
- Cold climates
- Swiss Alps

            |

Those chunks become the context

            |

LLM generates the answer

Top-K Retrieval

You can also set how many similar chunks you want retrieved. The top 1 result, top 3, top 5. Whatever you want. This is called Top-K retrieval. More chunks means more context for the LLM, but it also means more tokens being sent. It is a tradeoff you tune based on your use case.


Step 6: Prompt Construction

This is where it all comes together.

The chunks that were retrieved from the similarity search become the context. This context, plus the user's original question, gets injected into the system prompt template we set up earlier:

You are a senior dog expert. You provide information about 
various dog-related queries based on the context provided below.

Context:
- Saint Bernards are originally from the Swiss Alps.
- They thrive in cold climates and are well adapted to snow.
- They are not suited for hot or humid environments.
[...other retrieved chunks...]

Question: What are the climatic conditions where a Saint Bernard thrives?

This is the final prompt. The complete, filled-in thing. And this is what goes to the LLM.


Final Answer Generation

The LLM gets this prompt, with all the relevant context already baked in, and generates a response. Because it has actual, specific, reliable information to work from and not just its generic training, the answer it gives is way more accurate and grounded.

And that's it. That is a full RAG pipeline from start to finish.


Complete End-to-End Flow (Quick Recap)

1. Load documents (PDFs, Word files, text files)
            |
2. Split into chunks
            |
3. Generate embeddings for each chunk
            |
4. Store embeddings + chunks + metadata in vector database
            |
      (User asks a question)
            |
5. Convert query into an embedding
            |
6. Run similarity search, retrieve top-K chunks
            |
7. Build the final prompt: system prompt + context + question
            |
8. Send to LLM, get the answer

A Few Common Misconceptions

"RAG eliminates hallucinations."

Not entirely. RAG significantly reduces hallucinations because the LLM is working from real, retrieved content. But it can still hallucinate if the retrieved chunks happen to be irrelevant, or if the LLM just ignores the context. It is a major improvement, not a perfect fix.

"The entire document is the context."

Nope. Only the chunks retrieved from the similarity search become the context. Not your entire document library. That is the whole point of chunking and retrieval in the first place.

"You need to recalculate embeddings every time."

No, that is what persistence is for. Calculate once, store it, reuse it. That is the whole idea.


Final Thoughts

RAG is genuinely one of the most practical and powerful patterns in applied AI right now. It takes a generic LLM and turns it into a domain-specific expert by giving it access to real, curated knowledge at the exact moment it needs it.

You don't need to understand all the math behind embeddings or similarity search to build with RAG. You just need to understand the pipeline. And now you do.

Go build something.