-
-
Save t-bltg/eeb0c97ef83f95dd0701f4cfd35a1d91 to your computer and use it in GitHub Desktop.
Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/sh | |
# Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable. | |
# Hacked together using tips from these websites: | |
# http://www.jlaundry.com/2012/ocr-a-scanned-pdf-with-tesseract/ | |
# http://askubuntu.com/questions/27097/how-to-print-a-regular-file-to-pdf-from-command-line | |
# Dependencies: pdftk, tesseract, imagemagick, enscript, ps2pdf | |
# Would be nice to use hocr2pdf instead so that the text lines up with the PDF image. | |
# http://www.exactcode.com/site/open_source/exactimage/hocr2pdf/ | |
cp $1 $1.bak | |
pdftk $1 burst output tesspage_%02d.pdf | |
for file in `ls tesspage*` | |
do | |
PAGE=$(basename "$file" .pdf) | |
# Convert the PDF page into a TIFF file | |
convert -monochrome -density 600 $file "$PAGE".tif | |
# OCR the TIFF file and save text to output.txt | |
tesseract "$PAGE".tif output | |
# Turn text file outputed by tesseract into a PDF, then put it in background of original page | |
enscript output.txt -B -o - | ps2pdf - output.pdf && pdftk $file background output.pdf output new-"$file" | |
# Clean up | |
rm output* | |
rm "$file" | |
rm *.tif | |
done | |
pdftk new* cat output $1 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment