Skip to content

Instantly share code, notes, and snippets.

@kenmlee
Created June 13, 2019 06:14
Show Gist options
  • Save kenmlee/18b589b6d0b40ccf5f78f4abcaa91474 to your computer and use it in GitHub Desktop.
Save kenmlee/18b589b6d0b40ccf5f78f4abcaa91474 to your computer and use it in GitHub Desktop.
private void extractImageText(XHTMLContentHandler xhtml, HWPFDocument document) {
if (Config.inst().getProp(ConfigBool.ENABLE_IMAGE_OCR)) {
TikaImageHelper helper = new TikaImageHelper(metadata);
try {
List<Picture> pictures2 = document.getPicturesTable().getAllPictures();
for (Picture picture : pictures2) {
ByteArrayInputStream imageData = new ByteArrayInputStream(
picture.getContent());
helper.addImage(ImageIO.read(imageData));
}
// TODO: find out page number
helper.addTextToHandler(xhtml);
} catch (Exception e) {
e.printStackTrace();
} finally {
if (helper != null) {
helper.close();
}
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment