Skip to content

Instantly share code, notes, and snippets.

@mshojaei77
Last active March 22, 2025 09:37
Show Gist options
  • Save mshojaei77/d02a817bd6dc44c925aa18b20e68b564 to your computer and use it in GitHub Desktop.
Save mshojaei77/d02a817bd6dc44c925aa18b20e68b564 to your computer and use it in GitHub Desktop.
This Python script processes PDF files from a knowledge base, generates embeddings using OllamaEmbeddings, stores them with FAISS, and performs similarity searches to answer queries, leveraging Langchain's text splitting and LLM capabilities for efficient information retrieval.

PDF Question Answering Fully Local

This tool allows you to ask questions about your PDF documents. It uses AI to understand your PDFs and provide relevant answers based on their content.

🚀 Quick Start Guide

Step 1: Setup Your Documents

  1. Create a folder named knowledge_base in the same location as the program
  2. Put all your PDF files that you want to ask questions about into the knowledge_base folder

Step 2: Install Required AI Models

Open your Terminal (Command Prompt on Windows) and run these commands:

ollama pull nomic-embed-text
ollama pull gemma3:1b

Wait for each command to complete before running the next one. This might take a few minutes depending on your internet speed.

Step 3: Ask Questions

  1. Open the file named rag.py in a text editor
  2. Find this section at the bottom of the file:
    def main():
        config = RAGConfig()
        rag_system = RAGSystem(config)
        response = rag_system.query("what is mcp-solver")  # Change this line
        print(response)
  3. Replace "what is mcp-solver" with your own question in quotes
  4. Save the file
  5. Run the program

📁 File Organization

Your folder structure should look like this:

your_folder/
│
├── knowledge_base/
│   ├── document1.pdf
│   ├── document2.pdf
│   └── ...
│
├── rag.py
└── embeddings/  (this will be created automatically)

❗ Important Notes

  • Make sure your PDFs are readable (not scanned images)
  • The first time you run a query, it might take longer as the system needs to process the PDFs
  • Subsequent queries will be faster as the system saves its processed data
  • Keep your questions clear and specific for better results

🔍 Example Questions

Instead of:

response = rag_system.query("what is mcp-solver")

You can try:

response = rag_system.query("What are the main points in Chapter 1?")
# or
response = rag_system.query("What does the document say about project management?")
# or
response = rag_system.query("Can you summarize the conclusion?")

💡 Tips

  • The system works best with clear, specific questions
  • If you get unclear answers, try rephrasing your question
  • Add new PDFs anytime by putting them in the knowledge_base folder
  • The system will automatically process new PDFs when you ask questions

⚠️ Troubleshooting

If you encounter issues:

  1. Make sure all your PDFs are in the knowledge_base folder
  2. Verify that you've installed the AI models correctly
  3. Check that your question is properly formatted with quotes
  4. Ensure your PDFs are text-based and not scanned images
from pathlib import Path
import hashlib
from PyPDF2 import PdfReader
import time
from dataclasses import dataclass
from typing import List, Optional
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.llms import Ollama
@dataclass
class RAGConfig:
knowledge_base_path: str = "knowledge_base"
embeddings_path: str = "embeddings"
embedding_model: str = "nomic-embed-text"
llm_model: str = "gemma3:1b"
chunk_size: int = 1000
chunk_overlap: int = 50
top_k: int = 2
batch_size: int = 10
class DocumentProcessor:
def __init__(self, config: RAGConfig):
self.config = config
self.text_splitter = RecursiveCharacterTextSplitter(
separators=['.\n', '.', '\n'],
chunk_size=config.chunk_size,
chunk_overlap=config.chunk_overlap
)
self.embeddings = OllamaEmbeddings(model=config.embedding_model)
def process_pdf(self, pdf_path: Path) -> Optional[FAISS]:
try:
pdf_reader = PdfReader(pdf_path)
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
documents = self.text_splitter.create_documents([text])
all_batches = [documents[i:i + self.config.batch_size]
for i in range(0, len(documents), self.config.batch_size)]
vector_db = FAISS.from_documents(all_batches[0], self.embeddings)
for i, batch in enumerate(all_batches[1:], 1):
try:
vector_db.add_documents(batch)
print(f"Processed batch {i}/{len(all_batches)-1} for {pdf_path.name}")
time.sleep(0.5)
except Exception as e:
print(f"Error processing batch {i} for {pdf_path.name}: {e}")
continue
return vector_db
except Exception as e:
print(f"Error processing {pdf_path.name}: {e}")
return None
class RAGSystem:
def __init__(self, config: RAGConfig):
self.config = config
self.processor = DocumentProcessor(config)
self.llm = Ollama(model=config.llm_model)
# Create necessary directories
Path(config.embeddings_path).mkdir(exist_ok=True)
Path(config.knowledge_base_path).mkdir(exist_ok=True)
def get_vector_db(self, pdf_path: Path) -> Optional[FAISS]:
file_hash = hashlib.md5(pdf_path.read_bytes()).hexdigest()
embedding_file = Path(self.config.embeddings_path) / f"{file_hash}.faiss"
if embedding_file.exists():
try:
return FAISS.load_local(str(embedding_file), self.processor.embeddings,
allow_dangerous_deserialization=True)
except Exception as e:
print(f"Error loading embeddings for {pdf_path.name}: {e}")
return None
else:
vector_db = self.processor.process_pdf(pdf_path)
if vector_db:
vector_db.save_local(str(embedding_file))
return vector_db
def query(self, query_text: str) -> str:
pdf_files = list(Path(self.config.knowledge_base_path).glob("*.pdf"))
if not pdf_files:
return "No PDF files found in knowledge base directory!"
all_documents = []
for pdf_path in pdf_files:
vector_db = self.get_vector_db(pdf_path)
if vector_db:
try:
result = vector_db.similarity_search(query_text, k=self.config.top_k)
all_documents.extend([doc.page_content for doc in result])
except Exception as e:
print(f"Error during similarity search for {pdf_path.name}: {e}")
if not all_documents:
return "No relevant documents found or all processing failed!"
prompt = f"answer the query {query_text} based on following context:\n {all_documents}"
return self.llm.invoke(prompt)
def main():
config = RAGConfig()
rag_system = RAGSystem(config)
response = rag_system.query("what is mcp-solver")
print(response)
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment