How to Save and Retrieve a Vector Database using LangChain, FAISS, and Gemini Embeddings
Efficient storage and retrieval of vector databases is foundational for building intelligent retrieval-augmented generation (RAG) systems using large language models (LLMs). In this guide, we’ll walk through a professional-grade Python implementation that utilizes LangChain with FAISS and Google Gemini Embeddings to store document embeddings and retrieve similar information. This setup is highly suitable for advanced machine learning (ML) and deep learning (DL) engineers who work with semantic search and retrieval pipelines.
Why Vector Databases Matter in LLM Applications
Traditional keyword-based search systems fall short when it comes to understanding semantic meaning. Vector databases store high-dimensional embeddings of text data, allowing for approximate nearest-neighbor (ANN) searches based on semantic similarity. These capabilities are critical in applications like:
- Question Answering Systems
- Enterprise Knowledge Retrieval
- Legal or Medical Document Search
- LLM-powered Chat Assistants with Context Memory
Benefits of Using LangChain in this Workflow
LangChain offers abstraction layers and integrations that simplify the orchestration of complex pipelines involving document loading, chunking, embedding, storing, and retrieval. Specifically, in this setup:
- It abstracts various document loaders (PDF, Excel, text).
- Offers robust text splitting strategies via
RecursiveCharacterTextSplitter
. - Seamlessly connects to FAISS for fast similarity search.
- Integrates with Google's Gemini embedding models for powerful semantic understanding.
- Supports retrieval interfaces that can be used directly with LLM chains for contextual question answering.
Implementation Strategy
The provided Python implementation performs the following steps:
- Environment Setup: Loads the Gemini embedding model using credentials from environment variables.
- Database Initialization: Loads an existing FAISS vector database if it exists or creates a new one.
- Document Loading: Supports different formats (PDF, Excel, text) via LangChain's loaders.
- Text Splitting: Splits large documents into manageable chunks with overlap for better context preservation.
- Batch Embedding: Embeds and adds chunks to the FAISS database in batches to manage memory efficiently.
- Persistent Storage: Saves the updated FAISS vector store locally to disk.
- Semantic Retrieval: Reloads the database and performs a similarity search against the embedded vectors.
Scalability Considerations
This implementation includes batch-wise document addition (IDX_DELTA
) and buffer sizing to manage large corpora efficiently. By dividing document embeddings into manageable chunks, it avoids memory overflows and accelerates ingestion for large datasets—making it production-ready for enterprise settings.
Use Case: Retrieval-Augmented Generation (RAG)
With the saved vector database, you can now enhance LLM responses by injecting semantically retrieved chunks into prompts. This is a core pattern in modern RAG systems, allowing you to ground model outputs with trusted context.
Full Python Code
You can download the 'reciprocam.pdf' file from 'Recipro-CAM: Fast gradient-free visual explanations for convolutional neural networks'
import os
from langchain_community.document_loaders import TextLoader, UnstructuredExcelLoader, PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
load_dotenv(".env")
# Setup embedding model as GoogleGenerativeAIEmbeddings
# 'google_api_key' parameter will be assigned by 'GOOGLE_API_KEY' environment variable
embeddings = GoogleGenerativeAIEmbeddings(model = "models/text-embedding-004")
db_path = "./faiss-doc-db"
# Create new vector DB if there is the database but if there is previous db then add new information
def create_vector_database(db_path, txt_path, type="text"):
if os.path.exists(db_path):
db = FAISS.load_local(db_path, embeddings=embeddings, allow_dangerous_deserialization=True)
else:
documents = [Document(page_content='RAG Document')]
db = FAISS.from_documents(documents, embeddings)
separators = ['\n\n', '\n', ' ', '\t']
chunk_size = 1000
chunk_overlap = 100
if type == "excel":
loader = UnstructuredExcelLoader(txt_path)
elif type == "pdf":
loader = PyPDFLoader(txt_path)
else:
loader = TextLoader(txt_path)
docs = loader.load()
documents = RecursiveCharacterTextSplitter(
separators=separators,
chunk_size=chunk_size,
is_separator_regex=False,
chunk_overlap=chunk_overlap
).split_documents(docs)
MAX_BUFFER_SIZE = 100000
IDX_DELTA = MAX_BUFFER_SIZE//chunk_size
doc_size = len(documents)
remainder = doc_size % IDX_DELTA
last_idx = doc_size - remainder
print(f"Total documents: {doc_size}")
print(f"Last index: {last_idx}")
print(f"Remainder: {remainder}")
for idx in range(0, last_idx, IDX_DELTA):
db.add_documents(documents=documents[idx:idx+IDX_DELTA])
if last_idx < doc_size:
db.add_documents(documents=documents[last_idx:])
db.save_local(db_path)
# Save custom PDF document as vector database
#create_vector_database(db_path, "./reciprocam.pdf", type="pdf")
# Retrieve related document for a given query from vector DB
def retrieve(query: str):
vectorstore_faiss = FAISS.load_local(db_path, embeddings, allow_dangerous_deserialization=True)
faiss_retriever = vectorstore_faiss.as_retriever(search_type="similarity", search_kwargs={"k": 2})
"""Retrieve information related to a query."""
print(f"Query: {query}")
retrieved_docs = faiss_retriever.invoke(query)
serialized = "\n\n".join(
(f"Source: {doc.metadata}\n" f"Content: {doc.page_content}")
for doc in retrieved_docs
)
return serialized, retrieved_docs
serial_doc, ret_doc = retrieve("Let me know what is a CAM.")
print(f"Result: {serial_doc}.")
Result:
Query: Let me know what is a CAM.
Result: Source: {'source': './reciprocam.pdf', 'page': 1}
Content: The first solution suggested to address this issue is CAM Zhou et al. [2016]. This method produces a map that highlights
the important regions of an image for a particular class by multiplying a global average pooling activation vector with a
fully connected weight vector specific to the class. Essentially, the saliency map Sc for a given class cis obtained by
Sc =
∑
k
wk,c
∑
u,v
fk(u,v) (1)
where wk,c is the last FC layer’s weight between channel k and class c and fk(u,v) is the activation at (u,v) of
channel k. CAM allows AI practitioners not only to analyze the capacity of their neural network architecture but also
to understand how the network reacts to specific classes of input data. However, this method has a limitation in that
it requires the presence of a global average or max pooling layer in the architecture. This means that certain neural
network architectures may not be compatible with CAM method.
Source: {'source': './reciprocam.pdf', 'page': 5}
Content: arXiv A PREPRINT
Table 1: Comparison of different CAM-based approaches using existing metrics on six different backbones. The
evaluation scores for other CAM methods were obtained from Poppi et al. [2021].
VGG-16 ResNet-18
Method Drop
(↓)
...
Score-CAM 26.13 24.75 9.52 47.00 93.83 20.27 81.66 12.81 40.41 10.76 46.01 98.35 41.78 77.30
Recipro-CAM 21.51 34.86 9.50 46.88 92.24 27.48 80.27 20.68 36.30 10.19 44.93 97.38 33.60 79.08
ResNet-50 ResNet-101
Grad-CAM 32.99 24.27 17.49 48.48 82.80 22.24 75.27 29.38 29.35 18.66 47.47 81.97 22.51 76.40.
Comments
Post a Comment