This comprehensive guide details how to create a Python notebook for preprocessing documents in Retrieval Augmented Generation (RAG) systems with improved context handling. The notebook focuses on enhancing chunk quality by adding summaries of surrounding content and implementing strategic overlap between chunks.
Retrieval Augmented Generation has emerged as a powerful technique for combining the knowledge retrieval capabilities of vector databases with the generative abilities of large language models. However, standard chunking approaches often fail to maintain semantic continuity between document segments, leading to context loss and diminished response quality[6]. Creating intelligent chunks with appropriate context is essential for effective RAG systems.
When documents are chunked without considering their broader context, the retrieval process can surface relevant text fragments that lack the necessary surrounding information for complete understanding[2]. This limitation becomes particularly problematic for complex documents where concepts build upon previous sections or refer to upcoming content.
Our RAG preprocessor addresses these limitations through two key strategies:
- Implementing strategic chunk overlap to maintain continuity between adjacent segments
- Augmenting each chunk with concise summaries of preceding and following pages
This approach ensures that when a chunk is retrieved, it carries with it a snapshot of contextual information from surrounding document sections, significantly improving the coherence and accuracy of generated responses[8].
We begin by setting up our Python environment with the necessary libraries for document processing, language model integration, and vector storage:
# Install required libraries
!pip install pypdf litellm sentence-transformers faiss-cpu pandas numpy tqdm
# Import dependencies
import os
import json
import time
import numpy as np
import pandas as pd
from typing import Dict, List, Tuple
from tqdm.auto import tqdm
from pathlib import Path
# Document processing
import pypdf
# LLM and embeddings
import litellm
from sentence_transformers import SentenceTransformer
# Vector store
import faiss
Next, we'll configure our Gemini API key for generating summaries:
# Set your Gemini API key
os.environ["GEMINI_API_KEY"] = "your-api-key-here"
The first step in our preprocessing pipeline is to load documents and extract text by page:
def load_pdf_by_pages(pdf_path: str) -> Dict[int, str]:
"""
Load a PDF and extract text by page.
Args:
pdf_path: Path to the PDF file
Returns:
Dictionary mapping page numbers to text content
"""
pages = {}
pdf_reader = pypdf.PdfReader(pdf_path)
for i, page in enumerate(pdf_reader.pages):
pages[i] = page.extract_text()
return pages
A crucial component of our system is the ability to generate concise summaries of document pages without redundant processing. The SummaryGenerator class utilizes Gemini through LiteLLM[7] to create 4-line summaries of each page and implements caching to avoid regenerating summaries for previously processed pages:
class SummaryGenerator:
def __init__(self, cache_file: str = "summary_cache.json"):
self.cache_file = cache_file
self.cache = self._load_cache()
def _load_cache(self) -> Dict:
"""Load existing summaries from cache file."""
if os.path.exists(self.cache_file):
with open(self.cache_file, 'r') as f:
return json.load(f)
return {}
def _save_cache(self):
"""Save summaries to cache file."""
with open(self.cache_file, 'w') as f:
json.dump(self.cache, f, indent=2)
def generate_summary(self, text: str, doc_id: str, page_num: int) -> str:
"""
Generate a 4-line summary of the given text using Gemini.
Args:
text: Text to summarize
doc_id: Document identifier
page_num: Page number
Returns:
Four-line summary of the text
"""
cache_key = f"{doc_id}_{page_num}"
# Return cached summary if available
if cache_key in self.cache:
return self.cache[cache_key]
# Generate summary using Gemini via LiteLLM
prompt = f"""
Create a concise 4-line summary of the following text that captures the essential information:
{text}
"""
try:
response = litellm.completion(
model="gemini/gemini-pro",
messages=[{"role": "user", "content": prompt}]
)
summary = response.choices[0].message.content.strip()
# Ensure it's no more than 4 lines
summary_lines = summary.split('\n')
if len(summary_lines) > 4:
summary = '\n'.join(summary_lines[:4])
# Cache the summary
self.cache[cache_key] = summary
self._save_cache()
return summary
except Exception as e:
print(f"Error generating summary: {e}")
return "Summary generation failed."
Research indicates that optimal chunking strategies involve both appropriate chunk sizes and strategic overlap between adjacent chunks[4]. Our implementation allows for configurable chunk size and overlap parameters:
def chunk_with_overlap(pages: Dict[int, str], chunk_size: int = 1000, overlap: int = 200) -> List[Dict]:
"""
Split document pages into chunks with overlap.
Args:
pages: Dictionary mapping page numbers to text content
chunk_size: Target size for each chunk in characters
overlap: Overlap size in characters
Returns:
List of chunk dictionaries with metadata
"""
chunks = []
current_chunk = ""
current_pages = []
sorted_pages = sorted(pages.items())
for page_num, page_text in sorted_pages:
# If adding this page would exceed chunk size, store current chunk
if len(current_chunk) + len(page_text) > chunk_size and current_chunk:
# Create chunk with metadata
chunk = {
"text": current_chunk,
"pages": current_pages,
"start_page": current_pages[0],
"end_page": current_pages[-1]
}
chunks.append(chunk)
# Start new chunk with overlap
overlap_chars = min(len(current_chunk), overlap)
current_chunk = current_chunk[-overlap_chars:]
# Update current pages to include only the last page
if current_pages:
current_pages = [current_pages[-1]]
# Add current page to chunk
current_chunk += page_text
current_pages.append(page_num)
# Add the last chunk if there's anything left
if current_chunk:
chunk = {
"text": current_chunk,
"pages": current_pages,
"start_page": current_pages[0],
"end_page": current_pages[-1]
}
chunks.append(chunk)
return chunks
The heart of our preprocessor is the enhancement of chunks with summaries of surrounding content. This function adds summaries of preceding and following pages to each chunk, providing crucial context for the retrieval process:
def enhance_chunks_with_summaries(
chunks: List[Dict],
pages: Dict[int, str],
summary_generator: SummaryGenerator,
doc_id: str,
context_pages: int = 10
) -> List[Dict]:
"""
Enhance chunks with summaries of preceding and following pages.
Args:
chunks: List of chunk dictionaries
pages: Dictionary mapping page numbers to text content
summary_generator: Instance of SummaryGenerator
doc_id: Document identifier for caching
context_pages: Number of pages before and after to summarize
Returns:
List of enhanced chunk dictionaries
"""
max_page = max(pages.keys())
enhanced_chunks = []
for chunk in tqdm(chunks, desc="Enhancing chunks"):
start_page = chunk["start_page"]
end_page = chunk["end_page"]
# Generate summaries for preceding pages
preceding_summaries = []
for page_num in range(max(0, start_page - context_pages), start_page):
if page_num in pages:
summary = summary_generator.generate_summary(pages[page_num], doc_id, page_num)
preceding_summaries.append(f"Page {page_num+1} Summary: {summary}")
# Generate summaries for following pages
following_summaries = []
for page_num in range(end_page + 1, min(max_page + 1, end_page + 1 + context_pages)):
if page_num in pages:
summary = summary_generator.generate_summary(pages[page_num], doc_id, page_num)
following_summaries.append(f"Page {page_num+1} Summary: {summary}")
# Create enhanced text
preceding_context = "\n\n".join(preceding_summaries)
following_context = "\n\n".join(following_summaries)
enhanced_text = ""
if preceding_context:
enhanced_text += f"PRECEDING CONTEXT:\n{preceding_context}\n\n"
enhanced_text += f"MAIN CONTENT:\n{chunk['text']}\n\n"
if following_context:
enhanced_text += f"FOLLOWING CONTEXT:\n{following_context}"
# Create enhanced chunk
enhanced_chunk = chunk.copy()
enhanced_chunk["enhanced_text"] = enhanced_text
enhanced_chunks.append(enhanced_chunk)
return enhanced_chunks
For a complete RAG preprocessor, we need to generate embeddings for our enhanced chunks and store them in a vector database:
class EmbeddingGenerator:
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
def generate_embeddings(self, texts: List[str]) -> np.ndarray:
"""
Generate embeddings for a list of texts.
Args:
texts: List of text strings
Returns:
Array of embedding vectors
"""
return self.model.encode(texts, show_progress_bar=True)
class VectorStore:
def __init__(self, embedding_dim: int = 384):
self.index = faiss.IndexFlatL2(embedding_dim)
self.documents = []
def add_documents(self, documents: List[Dict], embeddings: np.ndarray):
"""
Add documents and their embeddings to the vector store.
Args:
documents: List of document dictionaries
embeddings: Array of embedding vectors
"""
if self.index.ntotal == 0:
self.index = faiss.IndexIDMap(self.index)
# Add documents to storage
start_id = len(self.documents)
for i, doc in enumerate(documents):
doc_with_id = doc.copy()
doc_with_id["id"] = start_id + i
self.documents.append(doc_with_id)
# Add embeddings to index
ids = np.arange(start_id, start_id + len(documents))
self.index.add_with_ids(embeddings, ids)
def search(self, query_embedding: np.ndarray, k: int = 5) -> List[Dict]:
"""
Search for similar documents.
Args:
query_embedding: Embedding vector of the query
k: Number of results to return
Returns:
List of similar document dictionaries
"""
k = min(k, self.index.ntotal)
if k == 0:
return []
distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
results = []
for i, idx in enumerate(indices[0]):
if idx != -1:
doc = self.documents[idx].copy()
doc["score"] = float(distances[0][i])
results.append(doc)
return results
Finally, we integrate all components into a comprehensive preprocessing pipeline:
def preprocess_document_for_rag(
pdf_path: str,
doc_id: str = None,
chunk_size: int = 1000,
chunk_overlap: int = 200,
context_pages: int = 10,
embedding_model: str = "all-MiniLM-L6-v2"
) -> VectorStore:
"""
Preprocess a document for RAG with enhanced context.
Args:
pdf_path: Path to the PDF file
doc_id: Document identifier (defaults to filename)
chunk_size: Target size for each chunk in characters
chunk_overlap: Overlap size in characters
context_pages: Number of pages before and after to summarize
embedding_model: Name of the embedding model to use
Returns:
VectorStore with processed document chunks
"""
# Set document ID if not provided
if doc_id is None:
doc_id = Path(pdf_path).stem
# Load document
print(f"Loading document: {pdf_path}")
pages = load_pdf_by_pages(pdf_path)
# Initialize summary generator
summary_generator = SummaryGenerator(cache_file=f"{doc_id}_summaries.json")
# Create chunks with overlap
print(f"Creating chunks with size {chunk_size} and overlap {chunk_overlap}")
chunks = chunk_with_overlap(pages, chunk_size=chunk_size, overlap=chunk_overlap)
# Enhance chunks with summaries
print(f"Enhancing chunks with {context_pages} pages of context")
enhanced_chunks = enhance_chunks_with_summaries(
chunks, pages, summary_generator, doc_id, context_pages=context_pages
)
# Generate embeddings
print("Generating embeddings")
embedding_generator = EmbeddingGenerator(model_name=embedding_model)
embeddings = embedding_generator.generate_embeddings(
[chunk["enhanced_text"] for chunk in enhanced_chunks]
)
# Store in vector database
print("Storing in vector database")
vector_store = VectorStore(embedding_dim=embeddings.shape[1])
vector_store.add_documents(enhanced_chunks, embeddings)
return vector_store
The effectiveness of our RAG preprocessor depends on selecting appropriate parameters. Based on real-world implementations from community discussions[4], chunk sizes between 500-2000 tokens with overlaps of 10-20% work well for most applications:
# Example of using the pipeline with recommended parameters
vector_store = preprocess_document_for_rag(
pdf_path="example.pdf",
chunk_size=1500, # Typically 500-2000 tokens work well
chunk_overlap=150, # 10% of chunk size is a good starting point
context_pages=10 # Contextualizes with 10 pages before and after
)
# Example query
query = "What is the main topic of the document?"
embedding_generator = EmbeddingGenerator()
query_embedding = embedding_generator.generate_embeddings([query])[0]
results = vector_store.search(query_embedding, k=3)
for i, result in enumerate(results):
print(f"Result {i+1} (Score: {result['score']:.4f}):")
print(f"Pages: {result['start_page']+1}-{result['end_page']+1}")
print(f"Text (excerpt): {result['enhanced_text'][:300]}...")
print("-" * 80)
Our implementation includes several optimizations to enhance performance:
- Summary caching to avoid regenerating summaries for the same pages, significantly reducing API calls to Gemini[7]
- Configurable context window to balance comprehensive context with computational efficiency
- Flexible embedding model selection to accommodate different performance requirements
- Document metadata preservation to maintain contextual information for retrieval
Studies have shown that recursive Python-specific splitting techniques with a context overlap of 15 tokens and a chunk size of around 100 tokens can be particularly effective for certain types of technical content[2]. However, optimal parameters vary based on document type, content complexity, and retrieval objectives.
This RAG preprocessor notebook demonstrates a comprehensive approach to enhancing document chunks with contextual summaries for improved retrieval and generation. By implementing strategic overlap and augmenting chunks with summaries of surrounding content, we address key limitations of traditional chunking methods.
The effectiveness of this approach lies in its ability to provide retrieval systems with not just isolated text fragments but contextually-rich segments that preserve semantic connections across the document. This results in more coherent, accurate, and comprehensive responses from RAG systems.
For further improvements, consider experimenting with different chunking strategies, such as semantic chunking based on document structure[8], or implementing multimodal RAG capabilities for documents containing images, tables, and other non-text elements[9].