in gnu-like systems
sed -f rm-en-wordlist-on-lines.sed corpus.list > output.file
should suffice, however on OSX, it may be required to run sed
like this:
sed -e "$(cat rm-en-wordlist-anywhere.sed)" corpus.txt > output.file
to remove words from a wordlist where each word is alone on a line, we generate regex expecting a word between the start and end of a line like: ^word$
sed 's/.*/s|^&$||g/' word.list > rm-en-wordlist-on-lines.sed
to remove words from a file where each word is embedded in text, we generate regex expecting a word to be between two breakpoints like : \<word\>
sed 's/.*/s|\\\<&\\\>||g/' word.list > rm-en-wordlist-anywhere.sed
The Google 10000 lists the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus as found on first20hours/google-10000-english
The Stop Words list was found on Alir3z4/stop-words