Building a RAG Preprocessor with Enhanced Context Summaries

This comprehensive guide details how to create a Python notebook for preprocessing documents in Retrieval Augmented Generation (RAG) systems with improved context handling. The notebook focuses on enhancing chunk quality by adding summaries of surrounding content and implementing strategic overlap between chunks.

Understanding RAG Preprocessing Challenges

Retrieval Augmented Generation has emerged as a powerful technique for combining the knowledge retrieval capabilities of vector databases with the generative abilities of large language models. However, standard chunking approaches often fail to maintain semantic continuity between document segments, leading to context loss and diminished response quality[6]. Creating intelligent chunks with appropriate context is essential for effective RAG systems.

When documents are chunked without considering their broader context, the retrieval process can surface relevant text fragments that lack the necessary surrounding information for complete understanding[2]. This limitation becomes particularly problematic for complex documents where concepts build upon previous sections or refer to upcoming content.

The Enhanced Chunking Approach

Our RAG preprocessor addresses these limitations through two key strategies:

Implementing strategic chunk overlap to maintain continuity between adjacent segments
Augmenting each chunk with concise summaries of preceding and following pages

This approach ensures that when a chunk is retrieved, it carries with it a snapshot of contextual information from surrounding document sections, significantly improving the coherence and accuracy of generated responses[8].

Setting Up the Python Environment

We begin by setting up our Python environment with the necessary libraries for document processing, language model integration, and vector storage:

# Install required libraries
!pip install pypdf litellm sentence-transformers faiss-cpu pandas numpy tqdm

# Import dependencies
import os
import json
import time
import numpy as np
import pandas as pd
from typing import Dict, List, Tuple
from tqdm.auto import tqdm
from pathlib import Path

# Document processing
import pypdf

# LLM and embeddings
import litellm
from sentence_transformers import SentenceTransformer

# Vector store
import faiss

Next, we'll configure our Gemini API key for generating summaries:

# Set your Gemini API key
os.environ["GEMINI_API_KEY"] = "your-api-key-here"

Document Loading and Page Extraction

The first step in our preprocessing pipeline is to load documents and extract text by page:

def load_pdf_by_pages(pdf_path: str) -> Dict[int, str]:
    """
    Load a PDF and extract text by page.
    
    Args:
        pdf_path: Path to the PDF file
        
    Returns:
        Dictionary mapping page numbers to text content
    """
    pages = {}
    
    pdf_reader = pypdf.PdfReader(pdf_path)
    for i, page in enumerate(pdf_reader.pages):
        pages[i] = page.extract_text()
        
    return pages

Efficient Summary Generation with Caching

A crucial component of our system is the ability to generate concise summaries of document pages without redundant processing. The SummaryGenerator class utilizes Gemini through LiteLLM[7] to create 4-line summaries of each page and implements caching to avoid regenerating summaries for previously processed pages:

class SummaryGenerator:
    def __init__(self, cache_file: str = "summary_cache.json"):
        self.cache_file = cache_file
        self.cache = self._load_cache()
        
    def _load_cache(self) -> Dict:
        """Load existing summaries from cache file."""
        if os.path.exists(self.cache_file):
            with open(self.cache_file, 'r') as f:
                return json.load(f)
        return {}
    
    def _save_cache(self):
        """Save summaries to cache file."""
        with open(self.cache_file, 'w') as f:
            json.dump(self.cache, f, indent=2)
    
    def generate_summary(self, text: str, doc_id: str, page_num: int) -> str:
        """
        Generate a 4-line summary of the given text using Gemini.
        
        Args:
            text: Text to summarize
            doc_id: Document identifier
            page_num: Page number
            
        Returns:
            Four-line summary of the text
        """
        cache_key = f"{doc_id}_{page_num}"
        
        # Return cached summary if available
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        # Generate summary using Gemini via LiteLLM
        prompt = f"""
        Create a concise 4-line summary of the following text that captures the essential information:
        
        {text}
        """
        
        try:
            response = litellm.completion(
                model="gemini/gemini-pro",
                messages=[{"role": "user", "content": prompt}]
            )
            
            summary = response.choices[0].message.content.strip()
            
            # Ensure it's no more than 4 lines
            summary_lines = summary.split('\n')
            if len(summary_lines) > 4:
                summary = '\n'.join(summary_lines[:4])
                
            # Cache the summary
            self.cache[cache_key] = summary
            self._save_cache()
            
            return summary
        
        except Exception as e:
            print(f"Error generating summary: {e}")
            return "Summary generation failed."

Strategic Chunking with Overlap

Research indicates that optimal chunking strategies involve both appropriate chunk sizes and strategic overlap between adjacent chunks[4]. Our implementation allows for configurable chunk size and overlap parameters:

def chunk_with_overlap(pages: Dict[int, str], chunk_size: int = 1000, overlap: int = 200) -> List[Dict]:
    """
    Split document pages into chunks with overlap.
    
    Args:
        pages: Dictionary mapping page numbers to text content
        chunk_size: Target size for each chunk in characters
        overlap: Overlap size in characters
        
    Returns:
        List of chunk dictionaries with metadata
    """
    chunks = []
    current_chunk = ""
    current_pages = []
    
    sorted_pages = sorted(pages.items())
    
    for page_num, page_text in sorted_pages:
        # If adding this page would exceed chunk size, store current chunk
        if len(current_chunk) + len(page_text) > chunk_size and current_chunk:
            # Create chunk with metadata
            chunk = {
                "text": current_chunk,
                "pages": current_pages,
                "start_page": current_pages[0],
                "end_page": current_pages[-1]
            }
            chunks.append(chunk)
            
            # Start new chunk with overlap
            overlap_chars = min(len(current_chunk), overlap)
            current_chunk = current_chunk[-overlap_chars:]
            
            # Update current pages to include only the last page
            if current_pages:
                current_pages = [current_pages[-1]]
        
        # Add current page to chunk
        current_chunk += page_text
        current_pages.append(page_num)
    
    # Add the last chunk if there's anything left
    if current_chunk:
        chunk = {
            "text": current_chunk,
            "pages": current_pages,
            "start_page": current_pages[0],
            "end_page": current_pages[-1]
        }
        chunks.append(chunk)
    
    return chunks

Enhancing Chunks with Context Summaries

The heart of our preprocessor is the enhancement of chunks with summaries of surrounding content. This function adds summaries of preceding and following pages to each chunk, providing crucial context for the retrieval process:

def enhance_chunks_with_summaries(
    chunks: List[Dict], 
    pages: Dict[int, str], 
    summary_generator: SummaryGenerator,
    doc_id: str,
    context_pages: int = 10
) -> List[Dict]:
    """
    Enhance chunks with summaries of preceding and following pages.
    
    Args:
        chunks: List of chunk dictionaries
        pages: Dictionary mapping page numbers to text content
        summary_generator: Instance of SummaryGenerator
        doc_id: Document identifier for caching
        context_pages: Number of pages before and after to summarize
        
    Returns:
        List of enhanced chunk dictionaries
    """
    max_page = max(pages.keys())
    enhanced_chunks = []
    
    for chunk in tqdm(chunks, desc="Enhancing chunks"):
        start_page = chunk["start_page"]
        end_page = chunk["end_page"]
        
        # Generate summaries for preceding pages
        preceding_summaries = []
        for page_num in range(max(0, start_page - context_pages), start_page):
            if page_num in pages:
                summary = summary_generator.generate_summary(pages[page_num], doc_id, page_num)
                preceding_summaries.append(f"Page {page_num+1} Summary: {summary}")
        
        # Generate summaries for following pages
        following_summaries = []
        for page_num in range(end_page + 1, min(max_page + 1, end_page + 1 + context_pages)):
            if page_num in pages:
                summary = summary_generator.generate_summary(pages[page_num], doc_id, page_num)
                following_summaries.append(f"Page {page_num+1} Summary: {summary}")
        
        # Create enhanced text
        preceding_context = "\n\n".join(preceding_summaries)
        following_context = "\n\n".join(following_summaries)
        
        enhanced_text = ""
        if preceding_context:
            enhanced_text += f"PRECEDING CONTEXT:\n{preceding_context}\n\n"
        
        enhanced_text += f"MAIN CONTENT:\n{chunk['text']}\n\n"
        
        if following_context:
            enhanced_text += f"FOLLOWING CONTEXT:\n{following_context}"
        
        # Create enhanced chunk
        enhanced_chunk = chunk.copy()
        enhanced_chunk["enhanced_text"] = enhanced_text
        
        enhanced_chunks.append(enhanced_chunk)
    
    return enhanced_chunks

Embedding Generation and Vector Storage

For a complete RAG preprocessor, we need to generate embeddings for our enhanced chunks and store them in a vector database:

class EmbeddingGenerator:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        
    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        """
        Generate embeddings for a list of texts.
        
        Args:
            texts: List of text strings
            
        Returns:
            Array of embedding vectors
        """
        return self.model.encode(texts, show_progress_bar=True)

class VectorStore:
    def __init__(self, embedding_dim: int = 384):
        self.index = faiss.IndexFlatL2(embedding_dim)
        self.documents = []
        
    def add_documents(self, documents: List[Dict], embeddings: np.ndarray):
        """
        Add documents and their embeddings to the vector store.
        
        Args:
            documents: List of document dictionaries
            embeddings: Array of embedding vectors
        """
        if self.index.ntotal == 0:
            self.index = faiss.IndexIDMap(self.index)
            
        # Add documents to storage
        start_id = len(self.documents)
        for i, doc in enumerate(documents):
            doc_with_id = doc.copy()
            doc_with_id["id"] = start_id + i
            self.documents.append(doc_with_id)
        
        # Add embeddings to index
        ids = np.arange(start_id, start_id + len(documents))
        self.index.add_with_ids(embeddings, ids)
        
    def search(self, query_embedding: np.ndarray, k: int = 5) -> List[Dict]:
        """
        Search for similar documents.
        
        Args:
            query_embedding: Embedding vector of the query
            k: Number of results to return
            
        Returns:
            List of similar document dictionaries
        """
        k = min(k, self.index.ntotal)
        if k == 0:
            return []
            
        distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
        
        results = []
        for i, idx in enumerate(indices[0]):
            if idx != -1:
                doc = self.documents[idx].copy()
                doc["score"] = float(distances[0][i])
                results.append(doc)
                
        return results

Complete Preprocessing Pipeline

Finally, we integrate all components into a comprehensive preprocessing pipeline:

def preprocess_document_for_rag(
    pdf_path: str,
    doc_id: str = None,
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    context_pages: int = 10,
    embedding_model: str = "all-MiniLM-L6-v2"
) -> VectorStore:
    """
    Preprocess a document for RAG with enhanced context.
    
    Args:
        pdf_path: Path to the PDF file
        doc_id: Document identifier (defaults to filename)
        chunk_size: Target size for each chunk in characters
        chunk_overlap: Overlap size in characters
        context_pages: Number of pages before and after to summarize
        embedding_model: Name of the embedding model to use
        
    Returns:
        VectorStore with processed document chunks
    """
    # Set document ID if not provided
    if doc_id is None:
        doc_id = Path(pdf_path).stem
    
    # Load document
    print(f"Loading document: {pdf_path}")
    pages = load_pdf_by_pages(pdf_path)
    
    # Initialize summary generator
    summary_generator = SummaryGenerator(cache_file=f"{doc_id}_summaries.json")
    
    # Create chunks with overlap
    print(f"Creating chunks with size {chunk_size} and overlap {chunk_overlap}")
    chunks = chunk_with_overlap(pages, chunk_size=chunk_size, overlap=chunk_overlap)
    
    # Enhance chunks with summaries
    print(f"Enhancing chunks with {context_pages} pages of context")
    enhanced_chunks = enhance_chunks_with_summaries(
        chunks, pages, summary_generator, doc_id, context_pages=context_pages
    )
    
    # Generate embeddings
    print("Generating embeddings")
    embedding_generator = EmbeddingGenerator(model_name=embedding_model)
    embeddings = embedding_generator.generate_embeddings(
        [chunk["enhanced_text"] for chunk in enhanced_chunks]
    )
    
    # Store in vector database
    print("Storing in vector database")
    vector_store = VectorStore(embedding_dim=embeddings.shape[1])
    vector_store.add_documents(enhanced_chunks, embeddings)
    
    return vector_store

Example Usage and Parameter Tuning

The effectiveness of our RAG preprocessor depends on selecting appropriate parameters. Based on real-world implementations from community discussions[4], chunk sizes between 500-2000 tokens with overlaps of 10-20% work well for most applications:

# Example of using the pipeline with recommended parameters
vector_store = preprocess_document_for_rag(
    pdf_path="example.pdf",
    chunk_size=1500,  # Typically 500-2000 tokens work well
    chunk_overlap=150,  # 10% of chunk size is a good starting point
    context_pages=10    # Contextualizes with 10 pages before and after
)

# Example query
query = "What is the main topic of the document?"
embedding_generator = EmbeddingGenerator()
query_embedding = embedding_generator.generate_embeddings([query])[0]
results = vector_store.search(query_embedding, k=3)

for i, result in enumerate(results):
    print(f"Result {i+1} (Score: {result['score']:.4f}):")
    print(f"Pages: {result['start_page']+1}-{result['end_page']+1}")
    print(f"Text (excerpt): {result['enhanced_text'][:300]}...")
    print("-" * 80)

Performance Optimization Considerations

Our implementation includes several optimizations to enhance performance:

Summary caching to avoid regenerating summaries for the same pages, significantly reducing API calls to Gemini[7]
Configurable context window to balance comprehensive context with computational efficiency
Flexible embedding model selection to accommodate different performance requirements
Document metadata preservation to maintain contextual information for retrieval

Studies have shown that recursive Python-specific splitting techniques with a context overlap of 15 tokens and a chunk size of around 100 tokens can be particularly effective for certain types of technical content[2]. However, optimal parameters vary based on document type, content complexity, and retrieval objectives.

Conclusion

This RAG preprocessor notebook demonstrates a comprehensive approach to enhancing document chunks with contextual summaries for improved retrieval and generation. By implementing strategic overlap and augmenting chunks with summaries of surrounding content, we address key limitations of traditional chunking methods.

The effectiveness of this approach lies in its ability to provide retrieval systems with not just isolated text fragments but contextually-rich segments that preserve semantic connections across the document. This results in more coherent, accurate, and comprehensive responses from RAG systems.

For further improvements, consider experimenting with different chunking strategies, such as semantic chunking based on document structure[8], or implementing multimodal RAG capabilities for documents containing images, tables, and other non-text elements[9].

CraftsMan-Labs/RAG Preprocessor.md