To process a PDF document into organized section text files, start by extracting the PDF
to plain text using pdftotext, which converts your PDF file into a single text file
containing all content with formatting stripped away. Once you have the extracted text file,
analyze the document structure to identify the pattern used for section headers (e.g., "1. Title",
"Chapter 1: Title", "# Header", etc.), then run the generic pdf_splitter_generic.py
script with the appropriate regex pattern and capture group parameters. For example:
PDF_FILE="my.pdf"
EXTRACTED_TEXT="extracted.txt"
OUTPUT_DIR="sections"
pdftotext "$PDF_FILE" "$EXTRACTED_TEXT"
python3 pdf_splitter_generic.py \
--input "$EXTRACTED_TEXT" \
--output "$OUTPUT_DIR" \
--pattern "^(\d+)\.\s+(.+)$" \
--group 2 \
--verboseThis reads through the text file line by line, identifies matching section headers using your regex pattern, extracts the appropriate text group as the heading, and writes each section to a numbered text file in the output directory, producing a well-organized set of topic-specific files that are easy to process further.
python3 pdf_splitter_generic.py \
--input /path/to/extracted.txt \
--output ./output_sections \
--pattern "^(\d+)\.\s+(.+)$" \
--group 2| Argument | Short | Required | Description |
|---|---|---|---|
--input |
-i |
✅ Yes | Path to extracted PDF text file |
--output |
-o |
✅ Yes | Output directory for section files |
--pattern |
-p |
✅ Yes | Regex pattern to match headers |
--group |
-g |
❌ No | Regex capture group to use (default: 0) |
--encoding |
-e |
❌ No | File encoding (default: utf-8) |
--verbose |
-v |
❌ No | Verbose output |
Example: "1. Introduction"
python3 pdf_splitter_generic.py \
--input pdf.txt \
--output ./sections \
--pattern "^(\d+)\.\s+(.+)$" \
--group 2Regex breakdown:
^= Start of line(\d+)= Group 1: One or more digits\.= Literal period\s+= One or more spaces(.+)$= Group 2: Rest of line (heading)
Example: "Chapter 1: Introduction"
python3 pdf_splitter_generic.py \
--input pdf.txt \
--output ./sections \
--pattern "^Chapter\s+(\d+):\s+(.+)$" \
--group 2Example: "# Main Header", "## Subheader", "### Sub-subheader"
python3 pdf_splitter_generic.py \
--input pdf.txt \
--output ./sections \
--pattern "^(#+)\s+(.+)$" \
--group 2Example: "INTRODUCTION", "METHODOLOGY"
python3 pdf_splitter_generic.py \
--input pdf.txt \
--output ./sections \
--pattern "^([A-Z][A-Z\s]+)$" \
--group 1Example: "Introduction and Background"
python3 pdf_splitter_generic.py \
--input pdf.txt \
--output ./sections \
--pattern "^([A-Z][A-Za-z\s]+)$" \
--group 1Example: "1.1 Introduction", "1.2 Background"
python3 pdf_splitter_generic.py \
--input pdf.txt \
--output ./sections \
--pattern "^(\d+\.\d+)\s+(.+)$" \
--group 2Uses the entire matched text as the heading.
--pattern "^(\d+)\.\s+(.+)$" --group 0
# Heading: "1. Introduction" (full match)Uses the first capture group (first (...) in pattern).
--pattern "^(\d+)\.\s+(.+)$" --group 1
# Heading: "1" (just the number)Uses the second capture group (second (...) in pattern).
--pattern "^(\d+)\.\s+(.+)$" --group 2
# Heading: "Introduction" (just the title)python3 pdf_splitter_generic.py \
--input paper.txt \
--output ./sections \
--pattern "^(ABSTRACT|INTRODUCTION|METHODOLOGY|RESULTS|DISCUSSION|CONCLUSION|REFERENCES)$" \
--group 1 \
--verbosepython3 pdf_splitter_generic.py \
--input textbook.txt \
--output ./chapters \
--pattern "^Chapter\s+(\d+):\s+(.+)$" \
--group 0 \
--verbosepython3 pdf_splitter_generic.py \
--input docs.txt \
--output ./docs \
--pattern "^(#{1,3})\s+(.+)$" \
--group 2 \
--verbosepython3 pdf_splitter_generic.py \
--input report.txt \
--output ./sections \
--pattern "^(\d+\.\d+\s+[A-Z][A-Za-z\s]+)$" \
--group 1 \
--verbosepython3 pdf_splitter_generic.py \
--input pdf.txt \
--output ./sections \
--pattern "(?i)^(chapter\s+\d+:.+)$" \
--group 1python3 pdf_splitter_generic.py \
--input pdf.txt \
--output ./sections \
--pattern "^(\d*\.?\s*[A-Z][A-Za-z\s]+)$" \
--group 1python3 pdf_splitter_generic.py \
--input pdf.txt \
--output ./sections \
--pattern "^(#{1,3}\s+.+|^[A-Z]+\s*$)" \
--group 0import re
# Your pattern
pattern = r"^(\d+)\.\s+(.+)$"
# Test line
line = "1. Introduction"
match = re.match(pattern, line)
if match:
print(f"Match! Groups: {match.groups()}")
print(f"Group 0 (full): {match.group(0)}")
print(f"Group 1: {match.group(1)}")
print(f"Group 2: {match.group(2)}")
else:
print("No match")| Operation | Time |
|---|---|
| Read file | ~100ms |
| Process lines | ~200ms |
| Write output | ~150ms |
| Total | ~450ms |
For larger PDFs (100K+ lines), expect linear scaling.