Skip to content

Instantly share code, notes, and snippets.

@bxie
Last active September 6, 2024 16:46
Show Gist options
  • Save bxie/753e7b2a810ad13dcd30a04e1a2aa7d5 to your computer and use it in GitHub Desktop.
Save bxie/753e7b2a810ad13dcd30a04e1a2aa7d5 to your computer and use it in GitHub Desktop.
Script to calculate word count for ICER papers
# ICER work count script (from Jean Salac. Original script by Seth Poulsen). March 2023
# Need to ensure that the PDF in question (paper.pdf), pdfbox, and this script are in the same folder
# Make sure you have the most updated pdfbox version: https://pdfbox.apache.org/download.html
# Set executable permissions `chmod +x icer_word_count.sh` and then run with `./icer_word_count.sh`
# More info: https://icer2023.acm.org/track/icer-2023-papers#Submission-Instructions
java -jar pdfbox-app-2.0.27.jar ExtractText paper.pdf paper.txt;
grep -v -E '^[0-9]+$' paper.txt > paper_no_nums.txt;
sed -n '/REFERENCES/q;p' paper_no_nums.txt > paper_no_refs.txt;
wc -w paper_no_refs.txt
@bxie
Copy link
Author

bxie commented Mar 24, 2023

Notes on oddities in word counts

  1. Equations are often split up a lot and count as more words than you'd expect
  2. In tables, words separated by new lines are counted separately
  3. \ldots (a ellipsis) counts as 3 words, but "…" is zero words (h/t to Mara K-R for finding this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment