Retrieval-Augmented Generation (RAG) for Advanced ML Engineers

Understanding Retrieval-Augmented Generation (RAG): Architecture, Variants, and Best Practices

Retrieval-Augmented Generation (RAG) is a hybrid approach that combines large language models (LLMs) with external knowledge retrieval systems. Instead of relying solely on the parametric knowledge embedded within LLM weights, RAG enables dynamic, non-parametric access to external sources—most commonly via vector databases—allowing LLMs to generate factually grounded and context-rich responses.

The simplest form of RAG can be seen when a user of generative AI includes specific domain knowledge—such as a URL or a PDF document—along with their prompt to get more accurate responses. In this case, the user manually attaches external references to help the AI generate answers based on specialized information.

A RAG system automates this process. It stores various domain-specific documents in a database and, whenever a user asks a question, it retrieves relevant information and appends it to the prompt sent to the generative AI. This enables the AI to provide more accurate, context-aware, and fact-based answers tailored to the domain.

Now, let's take a more structured look at what RAG is, when it can be used, and how to implement a simple RAG system using Gemini, FAISS, and LangChain.

Core Architecture of RAG

A standard RAG pipeline consists of two primary components:

Retriever: This component is responsible for fetching relevant external documents based on a user's query. Instead of traditional symbolic search (e.g., BM25), modern retrievers use dense vector representations and perform approximate nearest neighbor (ANN) search through a vector database like FAISS, Pinecone, Weaviate, or Chroma.
Generator: Typically a large pre-trained transformer model (e.g., GPT, T5, FLAN, or Gemini), the generator conditions on both the user query and the retrieved documents to produce a final response.

The most basic RAG model architecture can be described as follows:


User Query → Dense Retriever → Top-k Documents → LLM Generator → Final Answer

Detailed Component Breakdown

1. Vector Store / Vector Database

Vector databases serve as the foundation for storing and indexing document embeddings. Popular choices include:

FAISS: Facebook AI Similarity Search, efficient for in-memory ANN search.
Pinecone: Fully managed vector DB with scalable cloud infrastructure.
Weaviate: Semantic vector search with schema-aware filtering and hybrid search capabilities.
Chroma: Lightweight vector DB often used with LangChain and open-source setups.

These vector stores store embeddings generated from sentences, documents, or paragraphs using transformer-based encoders such as Sentence-BERT or OpenAI’s embedding APIs. Retrieval is often implemented using cosine similarity or inner product.

2. Embedding Model

Embedding models convert textual documents and queries into dense vector representations. Common models include:

Sentence-BERT (SBERT)
OpenAI text-embedding-ada
Instructor-XL and other cross-domain models

3. Retrieval Mechanism

ANN search algorithms used in vector databases include:

HNSW (Hierarchical Navigable Small World)
IVF (Inverted File Index)
ScaNN, Annoy, or proprietary algorithms in Pinecone and Milvus

4. RAG Generator Model

LLMs used in RAG include:

BART: Used in the original RAG paper by Facebook AI.
GPT models: OpenAI's GPT-3.5, GPT-4, etc.
T5: Encoder-decoder model, often used with FiD-style (Fusion-in-Decoder) RAG.

Types of RAG Implementations

Vanilla RAG: The retriever provides k documents that are concatenated and passed to the LLM for generation.
RAG-Token: Each token prediction is conditioned on a specific retrieved document.
FiD (Fusion-in-Decoder): All retrieved documents are encoded separately and fused only in the decoder layer.
Multi-hop RAG: Involves iterative retrievals and reasoning across multiple documents.
LangChain-based RAG: Modular framework that allows chain-based control over retriever-generator logic with memory, filters, and control flow.
Haystack and LlamaIndex: End-to-end RAG orchestration frameworks with support for various retrievers, generators, and pipelines.

Advantages of RAG

Factual Accuracy: By accessing up-to-date external sources, RAG reduces hallucinations.
Cost-Efficient: Smaller LLMs paired with accurate retrievers can outperform larger LLMs in factual QA tasks.
Domain Adaptability: RAG can quickly adapt to new knowledge bases without re-training the LLM.
Interpretability: Retrieved documents can be surfaced for inspection, aiding transparency.

Disadvantages of RAG

Retrieval Quality Sensitivity: Inaccurate or irrelevant retrieval leads to degraded output.
Latency: Two-step retrieval + generation introduces delays unsuitable for real-time applications.
Context Limit: Concatenating many documents may exceed the model’s input token limit.
Engineering Complexity: Managing a retriever, generator, vector DB, and pipelines increases system complexity.

Best Practices for Deploying RAG

Use domain-specific embedding models for encoding documents and queries.
Continuously evaluate retriever recall (e.g., with R@k metrics).
Apply document chunking strategies (e.g., overlapping windows) for optimal embedding coverage.
Consider prompt engineering to guide LLMs in referencing retrieved evidence explicitly.
Monitor vector DB indexing parameters (e.g., efConstruction, M, nprobe) for ANN tuning.

Example of a Simple RAG System Implementation

The sample RAG (Retrieval-Augmented Generation) system we'll implement consists of three main components. The first component sets up the embedding pipeline by initializing the embedding model. This model is used to convert both the input data and any content generated by the GenAI model into vector representations, which are then stored in a vector database (FAISS). In this example, we instantiate the Gemini 2.5 Pro language model using the ChatGoogleGenerativeAI() function. For embeddings, we use Google’s text-embedding-004 model, initialized via the GoogleGenerativeAIEmbeddings() function.


import os
from langchain_community.document_loaders import TextLoader, UnstructuredExcelLoader, PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.messages import SystemMessage
from langchain_core.tools import tool
from langgraph.graph import END, MessagesState, StateGraph
from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.checkpoint.memory import MemorySaver

from dotenv import load_dotenv

load_dotenv(".env")

llm = ChatGoogleGenerativeAI(model='gemini-2.5-pro-exp-03-25', 
                             temperature=0.9)

# Setup embedding model as GoogleGenerativeAIEmbeddings
# 'google_api_key' parameter will be assigned by 'GOOGLE_API_KEY' environment variable
embeddings = GoogleGenerativeAIEmbeddings(model = "models/text-embedding-004")

db_path = "./faiss-doc-db"

The second component involves storing a domain-specific PDF document in the FAISS vector database. (In this example, we used Meta Korea’s 2025 Terms of Service document.) We chose this document because it contains up-to-date information that is not yet widely available on the internet, making it unlikely that the GenAI model (Gemini, in this case) has seen it during training. In the actual implementation, we pass a function that searches the vector DB as a parameter to the llm.bind_tools() method. This function plays an important role: if the GenAI model can confidently answer the user’s question based solely on its pre-trained knowledge, it generates a response directly without querying the database (i.e., RAG is not used). Otherwise, it retrieves relevant content from the vector DB and uses that as context to generate the answer. This approach is intended to demonstrate how the RAG system works by using questions related to information the GenAI model likely hasn’t seen during training. For a more detailed explanation of how to create and store a vector database using LangChain, FAISS, and Gemini embeddings, refer to the guide: “How to Save and Retrieve a Vector Database using LangChain, FAISS, and Gemini Embeddings.”


# Create new vector DB if there is the database but if there is previous db then add new information
def create_vector_database(db_path, txt_path, type="text"):
    if os.path.exists(db_path):
        db = FAISS.load_local(db_path, embeddings=embeddings, allow_dangerous_deserialization=True)
    else:
        documents = [Document(page_content='RAG자료')]
        db = FAISS.from_documents(documents, embeddings)

    separators = ['\n\n', '\n', ' ', '\t']
    chunk_size = 1000
    chunk_overlap = 100
    if type == "excel":
        loader = UnstructuredExcelLoader(txt_path)
    elif type == "pdf":
        loader = PyPDFLoader(txt_path)
    else:
        loader = TextLoader(txt_path)
    docs = loader.load()   
    
    documents = RecursiveCharacterTextSplitter(
        separators=separators,
        chunk_size=chunk_size, 
        is_separator_regex=False, 
        chunk_overlap=chunk_overlap
    ).split_documents(docs)

    MAX_BUFFER_SIZE = 100000
    IDX_DELTA = MAX_BUFFER_SIZE//chunk_size        
    doc_size = len(documents)
    remainder = doc_size % IDX_DELTA
    last_idx = doc_size - remainder
    print(f"Total documents: {doc_size}")
    print(f"Last index: {last_idx}")
    print(f"Remainder: {remainder}")
    for idx in range(0, last_idx, IDX_DELTA):
        db.add_documents(documents=documents[idx:idx+IDX_DELTA])
    if last_idx < doc_size:
        db.add_documents(documents=documents[last_idx:])

    db.save_local(db_path)

# Save custom PDF document as vector database
create_vector_database(db_path, "./Meta 서비스 약관.pdf", type="pdf")

The third part of the code loads the stored vector database and uses a LangChain graph to construct a basic FAQ processing pipeline. This pipeline is designed so that if the LLM cannot find a sufficient answer based on its pre-trained knowledge, it searches the predefined vector database to retrieve relevant information and then generates a response based on that data. The code also includes an example query about Meta Korea’s 2025 Terms of Service to demonstrate how the system responds using the retrieved context.


vectorstore_faiss = FAISS.load_local("./faiss-doc-db", embeddings, allow_dangerous_deserialization=True)
faiss_retriever = vectorstore_faiss.as_retriever(search_type="similarity", search_kwargs={"k": 2})

@tool(response_format="content_and_artifact")
def retrieve(query: str):
    """Retrieve information related to a query."""
    retrieved_docs = faiss_retriever.get_relevant_documents(query)
    print('-----Retrieved document: ', retrieved_docs)
    serialized = "\n\n".join(
        (f"Source: {doc.metadata}\n" f"Content: {doc.page_content}")
        for doc in retrieved_docs
    )
    return serialized, retrieved_docs

tools = ToolNode([retrieve])

def query_or_respond(state: MessagesState):
    llm_with_tools = llm.bind_tools([retrieve])
    response = llm_with_tools.invoke(state["messages"])
    return {"messages": [response]}

def generate(state: MessagesState):
    recent_tool_messages = []
    for message in reversed(state["messages"]):
        if message.type == "tool":
            recent_tool_messages.append(message)
        else:
            break
    tool_messages = recent_tool_messages[::-1]

    docs_content = "\n\n".join(doc.content for doc in tool_messages)
    system_message_content = (
        "You are a Meta employee. Answer the given question by referrencing the attached document."
        "\n\n"
        f"{docs_content}"
    )
    conversation_messages = [
        message for message in state["messages"] 
        if message.type in ('human', 'system')
        or (message.type == 'ai' and not message.tool_calls)
    ]
    prompt = [SystemMessage(system_message_content)] + conversation_messages

    response = llm.invoke(prompt)
    return {"messages": [response]}
    

# Build graph
graph_builder = StateGraph(MessagesState)
graph_builder.add_node("query_or_respond", query_or_respond)
graph_builder.add_node("tools", tools)
graph_builder.add_node("generate", generate)

graph_builder.set_entry_point("query_or_respond")
graph_builder.add_conditional_edges("query_or_respond", tools_condition, {END: END, "tools": "tools"})
graph_builder.add_edge("tools", "generate")
graph_builder.add_edge("generate", END)

memory = MemorySaver()
graph = graph_builder.compile(checkpointer=memory)


# Specify an ID for the thread
config = {"configurable": {"thread_id": "abc123"}}

input_message = "Tell me about Meta Korea's 2025 Terms of Service."

for step in graph.stream(
    {"messages": [{"role": "user", "content": input_message}]},
    stream_mode="values",
    config=config,
):
    step["messages"][-1].pretty_print()

GenAI Response:


================================ Human Message =================================

Tell me about Meta Korea's 2025 Terms of Service in English.
================================== Ai Message ==================================
Tool Calls:
  retrieve (f5ba2c2c-815e-41ba-80c9-b0ba77d8dcd5)
 Call ID: f5ba2c2c-815e-41ba-80c9-b0ba77d8dcd5
  Args:
    query: Meta Korea 2025 Terms of Service
-----Retrieved document:  [Document(id='f4d13a5c-0883-4d3d-add6-9176488e351d', metadata={'producer': 'Skia/PDF m135', 'creator': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36', 'creationdate': '2025-05-07T23:56:14+00:00', 'title': 'Meta 서비스 약관', 'moddate': '2025-05-07T23:56:14+00:00', 'source': './Meta 서비스 약관.pdf', 'total_pages': 11, 'page': 10, 'page_label': '11'}, page_content='다.\nMeta 브랜드 자료: 이 가이드라인은 Meta의 상표, 로고 및 스크린샷의 이용에 적용되는 정책\n을 설명합니다.\n권장 가이드라인: Facebook 권장 가이드라인 및 Instagram 권장 가이드라인에서는 콘텐츠를\n권장하거나 권장하지 않는 것에 관한 규정을 설명합니다.\nLive 정책: 이 정책은 Facebook Live를 통해 방송되는 모든 콘텐츠에 적용됩니다.\n아바타 약관: 본 약관은 아바타 스토어에서 아바타 의상을 구매하고 취득하는 것을 포함하여\n아바타 및 아바타 기능 이용에 적용됩니다.\nMeta AI 약관: 본 약관은 생성형 AI 제품 및 기능의 이용에 적용됩니다.\n25. 5. 8. 오전  8:56 Meta 서비스  약관\nhttps://mbasic.facebook.com/legal/terms/plain_text_terms/ 11/11'), Document(id='470a3590-750e-4da7-8962-9d942a5e8753', metadata={'producer': 'Skia/PDF m135', 'creator': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36', 'creationdate': '2025-05-07T23:56:14+00:00', 'title': 'Meta 서비스 약관', 'moddate': '2025-05-07T23:56:14+00:00', 'source': './Meta 서비스 약관.pdf', 'total_pages': 11, 'page': 7, 'page_label': '8'}, page_content='희 제품 및 서비스에 대한 안전한 경험을 도모하며, 관련 법률을 준수하기 위해 본 약관을 수시로\n개정해야 할 수 있습니다. 저희는 해당 조항이 더 이상 적절하지 않거나 불완전한 경우, 변경 사항\n이 합리적이고 회원님의 이익을 적절히 고려하거나 안전 및 보안 목적으로 또는 관련 법률을 준수\n하기 위해 변경이 필요한 경우에만 본 약관을 개정합니다.\n개정이 법률에 의해 요구된 경우가 아닌 한 Meta는 본 약관을 변경하기 최소 30일 전에 회원님에\n게 통지하여(예: 이메일 또는 저희의 제품을 통해), 약관의 효력이 발생하기 전에 검토할 기회를\n제공합니다. 개정된 약관의 효력이 발생한 후 회원님이 계속 저희 제품을 액세스하거나 이용하실\n경우 회원님은 개정된 약관의 적용을 받게 됩니다.\n저희는 회원님이 저희 제품을 계속 이용하기를 희망하지만, 회원님은 개정된 약관에 동의하지 않\n거나 본 계약에 대한 동의를 철회하려는 경우 언제든지 계정을 삭제할 수 있으며 Facebook 및 기\n타 Meta 제품의 액세스 또는 사용을 중지해야 합니다.\n4.2 계정  일시  차단  또는  해지\n저희는 Facebook이 사람들이 자신을 표현하고 생각과 아이디어를 공유하는 것이 환영받고 안전\n하다고 느끼는 공간이 되기를 바랍니다.\n회원님이 특히 커뮤니티 규정을 포함하여 저희 약관 또는 정책을 명백하거나 심각하게 또는 반복\n해서 위반하였다고 판단될 경우, 저희는 Meta Company 제품에 대한 회원님의 액세스를 일시 차\n단하거나 영구적으로 비활성화할 수 있으며, 회원님의 계정을 영구적으로 비활성화 또는 삭제할\n수도 있습니다. 또한 저희는 회원님이 다른 사람의 지식재산권을 반복해서 침해하였거나 법률상\n25. 5. 8. 오전  8:56 Meta 서비스  약관\nhttps://mbasic.facebook.com/legal/terms/plain_text_terms/ 8/11')]
================================= Tool Message =================================
Name: retrieve

Source: {'producer': 'Skia/PDF m135', 'creator': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36', 'creationdate': '2025-05-07T23:56:14+00:00', 'title': 'Meta 서비스 약관', 'moddate': '2025-05-07T23:56:14+00:00', 'source': './Meta 서비스 약관.pdf', 'total_pages': 11, 'page': 10, 'page_label': '11'}
Content: 다.
Meta 브랜드 자료: 이 가이드라인은 Meta의 상표, 로고 및 스크린샷의 이용에 적용되는 정책
을 설명합니다.
권장 가이드라인: Facebook 권장 가이드라인 및 Instagram 권장 가이드라인에서는 콘텐츠를
권장하거나 권장하지 않는 것에 관한 규정을 설명합니다.
Live 정책: 이 정책은 Facebook Live를 통해 방송되는 모든 콘텐츠에 적용됩니다.
아바타 약관: 본 약관은 아바타 스토어에서 아바타 의상을 구매하고 취득하는 것을 포함하여
아바타 및 아바타 기능 이용에 적용됩니다.
Meta AI 약관: 본 약관은 생성형 AI 제품 및 기능의 이용에 적용됩니다.
25. 5. 8. 오전  8:56 Meta 서비스  약관
https://mbasic.facebook.com/legal/terms/plain_text_terms/ 11/11
...
    *   By using Meta Products, you agree that Meta can show you ads it believes may be relevant to your interests.
    *   Meta uses your personal information to help determine which personalized ads to show you.

As we can see from the results, the generative AI was unable to find a sufficient answer from its pre-trained knowledge, so it searched the vector database, retrieved the most relevant content, and used it to generate a final response for the user.

Conclusion

Retrieval-Augmented Generation (RAG) represents a significant paradigm shift in large-scale language modeling by combining the power of retrieval systems with the generative capabilities of LLMs. For enterprise AI, scientific research, and complex domain-specific applications, RAG provides a scalable, transparent, and cost-effective way to integrate structured and unstructured knowledge into intelligent systems.

References

Lewis, Patrick, et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. [arXiv:2005.11401]
Izacard, Gautier, and Edouard Grave. "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering." arXiv preprint (2020). [arXiv:2007.01282]
Johnson, Jeff, et al. "Billion-scale similarity search with GPUs." IEEE Transactions on Big Data (2021). [arXiv:1702.08734]
LangChain Documentation. [docs.langchain.com]
Pinecone Documentation. [docs.pinecone.io]
Weaviate Docs. [weaviate.io]
Chroma GitHub. [github.com/chroma-core/chroma]

Understanding SentencePiece: A Language-Independent Tokenizer for AI Engineers

In the realm of Natural Language Processing (NLP), tokenization plays a pivotal role in preparing text data for machine learning models. Traditional tokenization methods often rely on language-specific rules and pre-tokenized inputs, which can be limiting when dealing with diverse languages and scripts. Enter SentencePiece—a language-independent tokenizer and detokenizer designed to address these challenges and streamline the preprocessing pipeline for neural text processing systems. What is SentencePiece? SentencePiece is an open-source tokenizer and detokenizer developed by Google, tailored for neural-based text processing tasks such as Neural Machine Translation (NMT). Unlike conventional tokenizers that depend on whitespace and language-specific rules, SentencePiece treats the input text as a raw byte sequence, enabling it to process languages without explicit word boundaries, such as Japanese, Chinese, and Korean. This approach allows SentencePiece to train subword models di...

AI Practitioner

Search This Blog

Retrieval-Augmented Generation (RAG) for Advanced ML Engineers

Understanding Retrieval-Augmented Generation (RAG): Architecture, Variants, and Best Practices

Core Architecture of RAG

Detailed Component Breakdown

1. Vector Store / Vector Database

2. Embedding Model

3. Retrieval Mechanism

4. RAG Generator Model

Types of RAG Implementations

Advantages of RAG

Disadvantages of RAG

Best Practices for Deploying RAG

Example of a Simple RAG System Implementation

GenAI Response:

Conclusion

References

Labels

Comments

Post a Comment

Popular

Understanding SentencePiece: A Language-Independent Tokenizer for AI Engineers

Mastering the Byte Pair Encoding (BPE) Tokenizer for NLP and LLMs

Using Gemini API in LangChain: Step-by-Step Tutorial

Building an MCP Agent with UV, Python & mcp-use

Using Gemini API in LangChain: Step-by-Step Tutorial

Understanding Accuracy, Precision, Recall, and F1 Score in ML/DL Models

Building an MCP Agent with UV, Python & mcp-use

Using Gemini API in LangChain: Step-by-Step Tutorial

Understanding Accuracy, Precision, Recall, and F1 Score in ML/DL Models