Created
August 15, 2018 19:38
-
-
Save Paddy3118/281cc20923a70c24992b852829262bd9 to your computer and use it in GitHub Desktop.
Me doing some data parse-munge-dumping in a Jupyter Notebook.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"toc": "true" | |
}, | |
"source": [ | |
"# Table of Contents\n", | |
" <p><div class=\"lev1 toc-item\"><a href=\"#Word-part-Frequencies\" data-toc-modified-id=\"Word-part-Frequencies-1\"><span class=\"toc-item-num\">1 </span>Word-part Frequencies</a></div><div class=\"lev2 toc-item\"><a href=\"#The-List\" data-toc-modified-id=\"The-List-1.1\"><span class=\"toc-item-num\">1.1 </span>The List</a></div><div class=\"lev2 toc-item\"><a href=\"#The-format-of-the-list\" data-toc-modified-id=\"The-format-of-the-list-1.2\"><span class=\"toc-item-num\">1.2 </span>The format of the list</a></div><div class=\"lev3 toc-item\"><a href=\"#We-got-bytes!\" data-toc-modified-id=\"We-got-bytes!-1.2.1\"><span class=\"toc-item-num\">1.2.1 </span>We got bytes!</a></div><div class=\"lev3 toc-item\"><a href=\"#Format-so-far\" data-toc-modified-id=\"Format-so-far-1.2.2\"><span class=\"toc-item-num\">1.2.2 </span>Format so far</a></div><div class=\"lev3 toc-item\"><a href=\"#Rows-with-extra-columns?\" data-toc-modified-id=\"Rows-with-extra-columns?-1.2.3\"><span class=\"toc-item-num\">1.2.3 </span>Rows with extra columns?</a></div><div class=\"lev3 toc-item\"><a href=\"#Format-addition\" data-toc-modified-id=\"Format-addition-1.2.4\"><span class=\"toc-item-num\">1.2.4 </span>Format addition</a></div><div class=\"lev1 toc-item\"><a href=\"#Finish-Parsing\" data-toc-modified-id=\"Finish-Parsing-2\"><span class=\"toc-item-num\">2 </span>Finish Parsing</a></div><div class=\"lev1 toc-item\"><a href=\"#Munge.\" data-toc-modified-id=\"Munge.-3\"><span class=\"toc-item-num\">3 </span>Munge.</a></div><div class=\"lev2 toc-item\"><a href=\"#Accumulating-frequencies\" data-toc-modified-id=\"Accumulating-frequencies-3.1\"><span class=\"toc-item-num\">3.1 </span>Accumulating frequencies</a></div><div class=\"lev2 toc-item\"><a href=\"#Pretty-printing\" data-toc-modified-id=\"Pretty-printing-3.2\"><span class=\"toc-item-num\">3.2 </span>Pretty printing</a></div><div class=\"lev2 toc-item\"><a href=\"#More-frequency-counts\" data-toc-modified-id=\"More-frequency-counts-3.3\"><span class=\"toc-item-num\">3.3 </span>More frequency counts</a></div><div class=\"lev1 toc-item\"><a href=\"#Dump\" data-toc-modified-id=\"Dump-4\"><span class=\"toc-item-num\">4 </span>Dump</a></div><div class=\"lev1 toc-item\"><a href=\"#End-note\" data-toc-modified-id=\"End-note-5\"><span class=\"toc-item-num\">5 </span>End note</a></div>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Word-part Frequencies\n", | |
"Or me playing with a word list.\n", | |
"\n", | |
"## The List\n", | |
"I had revisted a certain [rosetta Code task](http://rosettacode.org/wiki/I_before_E_except_after_C) that used two word lists. A little later, and I was looking for word frequencies on-line and happened upon [a page](http://www.oxfordmathcenter.com/drupal7/node/353) that not only gave word frequencies, but first letter frequencies; last letter frequencies; bigram frequencies, (of two-letter combinations); and trigram frequencies. \n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"One of the lists mentioned on RC is from [Word Frequencies in Written and Spoken English: based on the British National Corpus](http://ucrel.lancs.ac.uk/bncfreq/) by Geoffrey Leech, Paul Rayson, and Andrew Wilson; \n", | |
"specifically this [\"word\" list](http://ucrel.lancs.ac.uk/bncfreq/lists/1_2_all_freq.txt).\n", | |
"\n", | |
"## The format of the list\n", | |
"Well, lets just slurp the whole list and take a look at the first few lines:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[b'\\tWord\\tPoS\\tFreq\\n',\n", | |
" b'\\tthe\\tDet\\t61847\\n',\n", | |
" b'\\tof\\tPrep\\t29391\\n',\n", | |
" b'\\tand\\tConj\\t26817\\n']" | |
] | |
}, | |
"execution_count": 37, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import urllib\n", | |
"\n", | |
"url='http://ucrel.lancs.ac.uk/bncfreq/lists/1_2_all_freq.txt'\n", | |
"\n", | |
"everyline = list(urllib.request.urlopen(url))\n", | |
"everyline[:4]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### We got bytes!\n", | |
"We'll need to .decode() those" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"ename": "UnicodeDecodeError", | |
"evalue": "'utf-8' codec can't decode byte 0xa3 in position 1: invalid start byte", | |
"output_type": "error", | |
"traceback": [ | |
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", | |
"\u001b[1;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)", | |
"\u001b[1;32m<ipython-input-38-c3601358a091>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0meveryline\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[0mline\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mline\u001b[0m \u001b[1;32min\u001b[0m \u001b[0murllib\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrequest\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0murlopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", | |
"\u001b[1;32m<ipython-input-38-c3601358a091>\u001b[0m in \u001b[0;36m<listcomp>\u001b[1;34m(.0)\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0meveryline\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[0mline\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mline\u001b[0m \u001b[1;32min\u001b[0m \u001b[0murllib\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrequest\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0murlopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", | |
"\u001b[1;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xa3 in position 1: invalid start byte" | |
] | |
} | |
], | |
"source": [ | |
"everyline = [line.decode() for line in urllib.request.urlopen(url)]\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Okay, a bit of trial and error points me at a successful decode using **latin_1**" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"everyline = [line.decode('latin_1') for line in urllib.request.urlopen(url)]\n", | |
"everyline[:4]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"\n", | |
"print(''.join(everyline[:4]))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Format so far\n", | |
"* First row is for headers: **Word**; **PoS** I think stands for **P**art **o**f **S**entence; and then **Freq**uency.\n", | |
"* An initial tab, then tab-separated fields in each row.\n", | |
"\n", | |
"Just scrolling in the [word list](http://ucrel.lancs.ac.uk/bncfreq/lists/1_2_all_freq.txt) I noted that a few rows seem to have more columns than others. Lets take a deeper look:\n", | |
"\n", | |
"### Rows with extra columns?\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"extra = [line for line in everyline if len(line.strip().split()) > 3]\n", | |
"extra[:4]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Oh good. It looks as if it has an idea of phrases and uses spaces to separate parts of a phrase.\n", | |
"\n", | |
"### Format addition\n", | |
"* The Word field may have space separated words of common phrases\n", | |
"\n", | |
"# Finish Parsing\n", | |
"Let's discard the header line; then separate the fields and convert the Frequency to a number so the data becomes ready for transformation. To separate the fields I will separate words on any whitespace character then later work with some rows with extra fields. I will do it this way as I suspect their may be hand edited columns that don't stick to the rules." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"textfields = [line.strip().split() for line in everyline[1:] ]\n", | |
"word2part2freq =[(' '.join(word).lower(), part_of_sentence, int(frequency))\n", | |
" for *word, part_of_sentence, frequency in textfields]\n", | |
"word2part2freq[:4]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Note the use of the fairly new feature that allows me to use `*word` to gather any extra word fields together and then join them into a space *separated* string." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"[w for w,p,f in word2part2freq if ' ' in w ][:2]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Munge.\n", | |
"Now that the data is parsed into a form we can work with, lets see about creating those overall frequency counts.\n", | |
"\n", | |
"I'll print the most frequent 10 in each category which means Pythons Counter objects are going to be needed" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from collections import Counter" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Accumulating frequencies\n", | |
"How to accumulate frequencies?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"Counter([('a', 3), ('b', 4), ('c', 5)])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Well that didn't work. Lets try something else..." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"c = Counter()\n", | |
"c['a']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Ah! if the key isn't in the counter it returns a count of zero. I should therefore be able to just accumulate appropriate frequencies into counters. Lets try one:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"wordfreq = Counter()\n", | |
"for word, pos, freq in word2part2freq:\n", | |
" wordfreq[word] += freq\n", | |
"wordfreq.most_common(10)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Looks about right. lets tidy the output\n", | |
"## Pretty printing" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def ppfreq(comment, counter, top=10):\n", | |
" print(comment)\n", | |
" print(' {0:^10} {1}'.format('ITEM', 'FREQ%'))\n", | |
" total = sum(counter.values())\n", | |
" for item, freq in counter.most_common(top):\n", | |
" print(f' {item:^10} {freq * 100. / total:5.2f}%')\n", | |
" print(f' From a total count of {total} amongst {len(counter)} items\\n')\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 40, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Words\n", | |
" ITEM FREQ%\n", | |
" the 7.04%\n", | |
" of 3.34%\n", | |
" and 3.05%\n", | |
" to 2.92%\n", | |
" a 2.46%\n", | |
" in 2.14%\n", | |
" that 1.27%\n", | |
" it 1.24%\n", | |
" is 1.14%\n", | |
" was 1.05%\n", | |
" From a total count of 878717 amongst 7062 items\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"ppfreq('Words', wordfreq)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## More frequency counts" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"posfreq, wordfreq, charfirst, charlast, \\\n", | |
" bigrams, trigrams, charfreq = [Counter() for i in range(7)]\n", | |
"for word, pos, freq in word2part2freq:\n", | |
" wordfreq[word] += freq\n", | |
" posfreq[pos] += freq\n", | |
" charfirst[word[0]] += freq\n", | |
" charlast[word[-1]] += freq\n", | |
" for bg in zip(word, word[1:]):\n", | |
" if ' ' not in bg:\n", | |
" bigrams[''.join(bg)] += freq\n", | |
" for tg in zip(word, word[1:], word[2:]):\n", | |
" if ' ' not in tg:\n", | |
" trigrams[''.join(tg)] += freq\n", | |
" for ch in word:\n", | |
" if ch != ' ':\n", | |
" charfreq[ch] += freq\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Dump\n", | |
"Lets dump the calculated data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Most common words:\n", | |
" ITEM FREQ%\n", | |
" the 7.04%\n", | |
" of 3.34%\n", | |
" and 3.05%\n", | |
" to 2.92%\n", | |
" a 2.46%\n", | |
" in 2.14%\n", | |
" that 1.27%\n", | |
" it 1.24%\n", | |
" is 1.14%\n", | |
" was 1.05%\n", | |
" From a total count of 878717 amongst 7062 items\n", | |
"\n", | |
"Most common letters:\n", | |
" ITEM FREQ%\n", | |
" e 12.44%\n", | |
" t 9.64%\n", | |
" a 7.93%\n", | |
" o 7.83%\n", | |
" i 6.90%\n", | |
" n 6.84%\n", | |
" s 6.14%\n", | |
" h 5.82%\n", | |
" r 5.77%\n", | |
" l 3.83%\n", | |
" From a total count of 3794407 amongst 52 items\n", | |
"\n", | |
"Most common Parts of Sentences:\n", | |
" ITEM FREQ%\n", | |
" NoC 19.43%\n", | |
" Verb 16.98%\n", | |
" Prep 12.78%\n", | |
" Det 11.75%\n", | |
" Pron 6.69%\n", | |
" Conj 6.44%\n", | |
" Adv 6.25%\n", | |
" Adj 6.24%\n", | |
" DetP 3.43%\n", | |
" Inf 1.87%\n", | |
" From a total count of 878717 amongst 23 items\n", | |
"\n", | |
"Most common first letters of words:\n", | |
" ITEM FREQ%\n", | |
" t 16.61%\n", | |
" a 11.74%\n", | |
" i 7.65%\n", | |
" o 7.01%\n", | |
" w 6.56%\n", | |
" s 6.23%\n", | |
" h 4.80%\n", | |
" b 4.51%\n", | |
" c 4.18%\n", | |
" f 3.72%\n", | |
" From a total count of 878717 amongst 41 items\n", | |
"\n", | |
"Most common last letters of 'words':\n", | |
" ITEM FREQ%\n", | |
" e 19.02%\n", | |
" s 11.58%\n", | |
" t 10.40%\n", | |
" d 9.56%\n", | |
" n 8.34%\n", | |
" y 5.70%\n", | |
" r 5.41%\n", | |
" o 4.84%\n", | |
" f 4.12%\n", | |
" l 3.08%\n", | |
" From a total count of 878717 amongst 45 items\n", | |
"\n", | |
"Most common bigrams:\n", | |
" ITEM FREQ%\n", | |
" th 4.22%\n", | |
" he 3.73%\n", | |
" in 2.35%\n", | |
" er 2.04%\n", | |
" an 2.02%\n", | |
" re 1.86%\n", | |
" on 1.57%\n", | |
" nd 1.45%\n", | |
" at 1.41%\n", | |
" en 1.39%\n", | |
" From a total count of 2902237 amongst 642 items\n", | |
"\n", | |
"Most common trigrams:\n", | |
" ITEM FREQ%\n", | |
" the 4.07%\n", | |
" and 1.49%\n", | |
" ing 1.03%\n", | |
" ion 0.76%\n", | |
" her 0.75%\n", | |
" ent 0.72%\n", | |
" hat 0.70%\n", | |
" for 0.66%\n", | |
" tha 0.65%\n", | |
" tio 0.60%\n", | |
" From a total count of 2048446 amongst 3268 items\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"ppfreq('Most common words:', wordfreq)\n", | |
"ppfreq('Most common letters:', charfreq)\n", | |
"ppfreq('Most common Parts of Sentences:', posfreq)\n", | |
"ppfreq('Most common first letters of words:', charfirst)\n", | |
"ppfreq('Most common last letters of \\'words\\':', charlast)\n", | |
"ppfreq('Most common bigrams:', bigrams)\n", | |
"ppfreq('Most common trigrams:', trigrams)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# End note\n", | |
"One could dig deeper into other peculiarities of the data, (such as finding what all the other \"letters\" are - turns out to be none space or tab characters including numbers, brackets, the pound sign, etc being included at lower frequencies). \n", | |
"But that is a lot like real-world, messy, data!\n", | |
"\n", | |
"END." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python [default]", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.6" | |
}, | |
"nav_menu": {}, | |
"toc": { | |
"navigate_menu": true, | |
"number_sections": true, | |
"sideBar": true, | |
"threshold": 6, | |
"toc_cell": true, | |
"toc_section_display": "block", | |
"toc_window_display": false | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment