Skip to content

Instantly share code, notes, and snippets.

@benmarwick
Created January 17, 2017 10:59
Show Gist options
  • Save benmarwick/1a4975732702a6fef73245e8c13741ca to your computer and use it in GitHub Desktop.
Save benmarwick/1a4975732702a6fef73245e8c13741ca to your computer and use it in GitHub Desktop.
Convert Microsoft Word documents (.doc, MS Word 97) into plain text, using R and libreoffice
# Assuming we are in the directory with the docs
# get a list of the files
the_docs <- list.files(pattern = ".doc$")
# remove spaces from the file names
file.rename(the_docs, gsub(" ", "_", the_docs))
# get the file names again
the_docs <- list.files(pattern = ".doc$|docx$")
# set the location of the libreoffice program
libreoffice <- "C:\\Program Files (x86)\\LibreOffice 5\\program\\soffice.exe"
# loop over each file and convert to txt, output is in the same directory
for(i in seq_along(the_docs)){
# construct the command...
x <- paste0("\"",
libreoffice,
# see https://cgit.freedesktop.org/libreoffice/core/tree/filter/source/config/fragments/filters for
# other converters for other versions of Word, etc.
"\" --headless --convert-to txt:\"MS Word 97\"",
" --outdir",
" \"",
getwd(),
"\" ",
" \"",
paste0(getwd(), "/", the_docs[i]),
"\"")
# run the command
system(x)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment