Skip to content

Instantly share code, notes, and snippets.

@jackrusher
Created July 26, 2012 14:49

Revisions

  1. jackrusher created this gist Jul 26, 2012.
    15 changes: 15 additions & 0 deletions gistfile1.clj
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,15 @@
    (ns pdfbox.core
    (:import [org.apache.pdfbox.pdmodel PDDocument]
    [org.apache.pdfbox.util PDFMarkedContentExtractor TextPosition]
    [java.util ArrayList]))

    (defn parse-pdf [filename]
    (let [pages (.getAllPages (.getDocumentCatalog (PDDocument/load filename)))
    textpool (ArrayList.)
    extract-text (proxy [PDFMarkedContentExtractor] []
    (processTextPosition [text]
    (.add textpool text)))]
    (doseq [page pages]
    (when-let [contents (.getStream (.getContents page))]
    (.processStream extract-text page (.findResources page) contents)))
    textpool))