Created
May 3, 2024 00:46
-
-
Save mermelstein/27ec13eda12c8c394a5a2c73948af56c to your computer and use it in GitHub Desktop.
extract text from pdf when the text isn't easy to copy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from PIL import Image | |
import pytesseract | |
from pdf2image import convert_from_path | |
# Convert the PDF to a list of images | |
images = convert_from_path('path_to_pdf.pdf') | |
# Process each image with Tesseract | |
for i, img in enumerate(images): | |
text = pytesseract.image_to_string(img, lang='eng') | |
with open(f'page_{i+1}.txt', 'w') as f: | |
f.write(text) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment