Created
December 4, 2014 18:15
-
-
Save sdjacobs/d6dd0a65debdd89849ff to your computer and use it in GitHub Desktop.
Get count of all French unigrams in the Google Books corpus
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
# -*- coding: utf-8 -*- | |
from google_ngram_downloader import readline_google_store | |
all_records = readline_google_store(ngram_len=1, lang="fre") | |
this_ngram = "WORDS" | |
this_count = "COUNT" | |
for (fname, url, records) in all_records: | |
for r in records: | |
if r.year >= 1990: | |
if (r.ngram == this_ngram): | |
this_count += r.match_count | |
else: | |
print u'{}\t{}'.format(this_ngram, this_count) | |
this_ngram = r.ngram | |
this_count = r.match_count |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This script creates a tab-separated-values file where the first column is the ngrams, and the second column is the counts of ngrams.
Ngrams are enumerated in alphabetical order, so we can stream through the entire corpus without building up large data structures in memory.