Skip to content

Instantly share code, notes, and snippets.

@muziejus
Created May 29, 2018 00:16
Show Gist options
  • Save muziejus/9ae3f0d1f5b5d411b106f8d47d6e1503 to your computer and use it in GitHub Desktop.
Save muziejus/9ae3f0d1f5b5d411b106f8d47d6e1503 to your computer and use it in GitHub Desktop.
Alice Analysis
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> What are the four most frequent words used in this version of Alice in Wonderland? How often does the word “Alice” appear on either side of an adjective, and which adjectives are they? Use any method, language, or library you like. Send the results alongside a brief interpretation. (Hint: this uses something called “Part of Speech tagging” or “PoS tagging”, and NLTK or spaCy are great resources. Stanford’s PoS Tagger and TreeTagger may also be useful if you don’t already know python.) Bonus points for a more sophisticated analysis of ways Alice is described in the book."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two tasks here, a basic word frequency on the text as a whole and then a POS-tag inflected analysis. \n",
"\n",
"# First analysis\n",
"\n",
"I start with pulling in the Project Gutenberg text, remove the metadata, and do a frequency analysis. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from urllib import request\n",
"import nltk\n",
"from nltk.tokenize import word_tokenize\n",
"alice_url = \"http://www.gutenberg.org/files/11/11-0.txt\"\n",
"text_from_web = request.urlopen(alice_url).read().decode('utf8')\n",
"text_start = text_from_web.find(\"CHAPTER I\")\n",
"text_end = text_from_web.find(\"End of Project Gutenberg\")\n",
"raw_text = text_from_web[text_start:text_end]\n",
"tokens = word_tokenize(raw_text)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(',', 2418), ('’', 1755), ('the', 1522), ('‘', 1115)]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nltk.FreqDist(tokens).most_common(4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The four most common tokens, then, include three non-words. One solution would be to simply list the top 50 _tokens_ and count the four most common _words_, but that skips the need for deciding, as a human, what a “word” is, and then indicating that to the computer. For the purposes of this exercise, and considering that the tokenizer splits something like “Alice’s” into three tokens, a word should be something made up exclusively of letters of the alphabet."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('the', 1522), ('and', 796), ('to', 720), ('a', 614)]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"word_pattern = re.compile(\"^\\w+$\")\n",
"just_words = [word for word in tokens if word_pattern.match(word)]\n",
"nltk.FreqDist(just_words).most_common(4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These results are… unsurprising. Let’s try once more by using stopwords."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('I', 543), ('said', 456), ('Alice', 395), ('little', 125)]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from nltk.corpus import stopwords\n",
"stop_words = set(stopwords.words(\"english\"))\n",
"unstopped_words = [word for word in just_words if word not in stop_words]\n",
"nltk.FreqDist(unstopped_words).most_common(4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This list is a bit more useful, and the presence of “little” foreshadows the second analysis. \n",
"\n",
"# Second analysis\n",
"\n",
"For this analysis, I start by POS-tagging the original list of tokens, without the narrowing of no punctuation and no stopwords. The POS-tagger’s tags for adjectives all begin with “JJ”, so I’ll build up a list of all such tagged words that come before or after an instance of “Alice.” After “Alice” seems like overkill, because English doesn’t really work that way, and it also means having to run a sanity check ahead of time on my frequency distribution to make sure that a construction like “Alice {adjective} Alice” never occurs in the text."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"pos_tagged = nltk.pos_tag(tokens)\n",
"def sanity_check(sanity_words):\n",
" for prev_word, word, next_word in zip(sanity_words, \n",
" sanity_words[1:],\n",
" sanity_words[2:]): \n",
" if \"JJ\" in word[1]:\n",
" if prev_word[0] == \"Alice\" and next_word[0] == \"Alice\":\n",
" return \"Found “Alice {adj} Alice” for %s\" % word[0]\n",
"sanity_check(pos_tagged)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As suspected, “Alice {adjective} Alice” never occurs in the text, which means I can build a list of adjectives one “Alice” at a time. That is, I can look ahead and back at once and not worry about duplicating an adjective."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('’', 28),\n",
" ('poor', 10),\n",
" ('thought', 6),\n",
" ('little', 3),\n",
" ('upon', 2),\n",
" ('Rabbit-Hole', 1),\n",
" ('noticed', 1),\n",
" ('foolish', 1),\n",
" ('pleaded', 1),\n",
" ('interrupted', 1)]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def commonest_adjectives(tagged_tokens, term):\n",
" adjective_list = []\n",
" for prev_word, word, next_word in zip(tagged_tokens, \n",
" tagged_tokens[1:], \n",
" tagged_tokens[2:]):\n",
" if word[0] == term:\n",
" if (\"JJ\" in prev_word[1]):\n",
" adjective_list.append(prev_word[0])\n",
" if (\"JJ\" in next_word[1]):\n",
" adjective_list.append(next_word[0])\n",
" return nltk.FreqDist(adjective_list).most_common(len(adjective_list))\n",
"commonest_adjectives(pos_tagged, \"Alice\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These results are nonsense, but they also indicate a fundamental problem with the tagger and with the original text. Namely, that top hit, “’,” should not be marked as a “JJ,” but, rather, as “POS.” The tagger is getting tripped up by the typography, then, which uses the typographically correct “‘” and “’” throughout instead of “'”. So let’s try again…"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('thought', 26),\n",
" ('poor', 10),\n",
" ('cried', 7),\n",
" ('little', 3),\n",
" ('exclaimed', 3),\n",
" ('Rabbit-Hole', 1),\n",
" ('noticed', 1),\n",
" ('foolish', 1),\n",
" ('replied', 1),\n",
" ('pleaded', 1),\n",
" ('upon', 1),\n",
" ('shouted', 1),\n",
" ('interrupted', 1),\n",
" ('inquired', 1)]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw_dumb_quotes = raw_text.replace(\"\", \"'\").replace(\"\", \"'\")\n",
"dumb_quote_pos_tagged = nltk.pos_tag(word_tokenize(raw_dumb_quotes))\n",
"commonest_adjectives(dumb_quote_pos_tagged, \"Alice\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At least a punctuation mark is no longer considered an adjective, but here “thought” is most likely _not_ an adjective. The tagger seems to mark past tense verbs as past participles. My suspicion is this is done only when there is a construction like “‘blah blah,’ thought Alice.” English is SVO, after all, and that kind of construction is OVS, which might trip up the tagger. In fact, as noted above, I doubt there are many adjectives following Alice."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adjectives_after_alice = []\n",
"for alice, adjective in zip(dumb_quote_pos_tagged, \n",
" dumb_quote_pos_tagged[1:]):\n",
" if \"JJ\" in adjective[1] and alice[0] == \"Alice\":\n",
" adjectives_after_alice.append(adjective[0])\n",
"len(adjectives_after_alice)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As suspected. English positions adjectives, typically, in front of the modified noun or pronoun. But the fact that so few of the words in the list (“poor,” “little,” and “foolish”) seem to be adjectives makes me wonder if the tagger isn’t missing adjectives. Given that the corpus here is a shortish novel and that a “sophisticated analysis” of the descriptions of Alice is also called for, maybe it makes sense just to read the thing with a highlighter.\n",
"\n",
"(comes back after reading the novel)\n",
"\n",
"My list is: “poor Alice,” “wise little Alice,” “poor Alice,” “poor Alice,” “Poor Alice,” “poor Alice,” “poor Alice,” “poor Alice”, “poor Alice,” “foolish Alice,” “ill.’ Alice,” “poor Alice,” “poor Alice,” “anxious.) Alice,” “red. Alice,” “Alice’s great,” “poor Alice,” “twelfth?’ Alice,” “unimportant.’ Alice,” “Alice dear,” “little Alice,” “little Alice.”\n",
"\n",
"First, then, I caught an instance of “poor Alice” missed by the tagger, because the sentence is “Poor Alice!” What does the tagger say for that?"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('Poor', 'NNP'), ('Alice', 'NNP')]\n"
]
}
],
"source": [
"for poor, alice in zip(dumb_quote_pos_tagged, \n",
" dumb_quote_pos_tagged[1:]):\n",
" if poor[0] == \"Poor\" and alice[0] == \"Alice\":\n",
" print([poor, alice])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The tagger saw that instance of “poor” as a proper, singular noun. A few of my other noted words are edge cases where a certain amount of punctuation (including paragraph breaks) sits between the ajective and Alice. Yet “Alice dear” is a different edge case."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('Alice', 'NNP'), ('dear', 'NN')]\n",
"[('dear', 'JJ'), ('Alice', 'NNP')]\n"
]
}
],
"source": [
"print(nltk.pos_tag([\"Alice\", \"dear\"]))\n",
"print(nltk.pos_tag([\"dear\", \"Alice\"]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The tagger presumably reads the “dear” in “Alice dear” as a noun. I do not completely disagree with it, but it’s something I noted in reading. Effectively, then, the list of adjectives right before or after Alice, is:\n",
"\n",
"* “poor,” 11 times\n",
"* “little,” 3 times\n",
"* “foolish,” once\n",
"\n",
"Let’s look at those in context."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"would go through , ' thought poor Alice , 'it would be of very; me , '\n",
"but the wise little Alice was not going to do THAT; once ; but , alas\n",
"for poor Alice ! when she got to the; no use now , ' thought poor\n",
"Alice , 'to pretend to be two; the right words , ' said poor Alice ,\n",
"and her eyes filled with; n't mean it ! ' pleaded poor Alice . 'But\n",
"you 're so easily; any more ! ' And here poor Alice began to cry again\n",
", for; pleasanter at home , ' thought poor Alice , 'when one wasn't\n",
"always growing; THAT ! ' 'Oh , you foolish Alice ! ' she answered\n",
"herself .; used to it ! ' pleaded poor Alice in a piteous tone . And;\n",
". ' This answer so confused poor Alice , that she let the Dormouse;\n",
"both sat silent and looked at poor Alice , who felt ready to sink;\n",
"setting sun , and thinking of little Alice and all her wonderful\n",
"Adventures ,; -- First , she dreamed of little Alice herself , and\n",
"once again the\n"
]
}
],
"source": [
"alice_text = nltk.Text(word_tokenize(raw_dumb_quotes))\n",
"alice_text.findall(r\"(<.*> <.*> <.*> <.*> <.*> <.*> \\\n",
" <poor|foolish|little> <Alice> \\\n",
" <.*> <.*> <.*> <.*> <.*> <.*>)\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is not very useful, and nltk’s #findall doesn’t return something I seem to be able to use, so I’ll rewrite it from [here](http://www.nltk.org/_modules/nltk/text.html)."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"would go through,' thought poor Alice,'it would be of very\n",
"me,' but the wise little Alice was not going to do THAT\n",
"once ; but, alas for poor Alice! when she got to the\n",
"no use now,' thought poor Alice,'to pretend to be two\n",
"off to the garden door. Poor Alice! It was as much as\n",
"the right words,' said poor Alice, and her eyes filled with\n",
"n't mean it!' pleaded poor Alice.'But you're so easily\n",
"any more!' And here poor Alice began to cry again, for\n",
"pleasanter at home,' thought poor Alice,'when one wasn't always growing\n",
"THAT!''Oh, you foolish Alice!' she answered herself.\n",
"used to it!' pleaded poor Alice in a piteous tone. And\n",
".' This answer so confused poor Alice, that she let the Dormouse\n",
"both sat silent and looked at poor Alice, who felt ready to sink\n",
"setting sun, and thinking of little Alice and all her wonderful Adventures,\n",
"-- First, she dreamed of little Alice herself, and once again the\n"
]
}
],
"source": [
"def my_findall(nltktext, regexp):\n",
" nltk_raw = ''.join('<'+w+'>' for w in nltktext)\n",
" regexp = re.sub(r'\\s', '', regexp)\n",
" regexp = re.sub(r'<', '(?:<(?:', regexp)\n",
" regexp = re.sub(r'>', ')>)', regexp)\n",
" regexp = re.sub(r'(?<!\\\\)\\.', '[^>]', regexp)\n",
" hits = re.findall(regexp, nltk_raw)\n",
" hits = [h[1:-1].split('><') for h in hits]\n",
" hits = [re.sub(r\" ([,'!.])\", r\"\\1\", \" \".join(hit)) for hit in hits]\n",
" return hits\n",
"described_alices = my_findall(alice_text, r\"(<.*> <.*> <.*> <.*> <.*> <.*>\\\n",
"<[Pp]oor|foolish|little> <Alice> \\\n",
"<.*> <.*> <.*> <.*> <.*> <.*>)\")\n",
"for described_alice in described_alices:\n",
" print(described_alice)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Close enough to be readable, and there are 15 hits, meaning it matches my count\n",
"from above.\n",
"\n",
"The only time Alice refers to herself with an adjective is to call herself\n",
"foolish. This happens while she is filling up Rabbit’s home and wondering\n",
"about whether she will grow any more. She chastises herself for being\n",
"foolish for suspecting that she can learn lessons inside a house in which she\n",
"barely fits. Notably, this follows by only a few paragraphs the part of the\n",
"novel where Alice thinks, “There ought to be a book written about me, that\n",
"there ought! And when I grow up, I’ll write one—but I’m grown up now.” This\n",
"matches with Alice’s repeated confusion in the novel of what being “grown\n",
"up” entails. That is, she conflates size and maturity, yet her own shrinking\n",
"and expanding does not follow with a comparable change in her own maturity.\n",
"She remains the bratty child throughout, a bit too pleased with herself and\n",
"her achievements, critical of everyone around her, and adept at putting her\n",
"foot in her mouth. So the foolishness leads to her actually not being old\n",
"enough to write a book about herself, at least not yet.\n",
"\n",
"It is the narrator, then, who refers to Alice as “poor” throughout. The effect\n",
"is somewhat double-edged. As I mention above, Alice does little to draw\n",
"sympathy from the reader, and the narrator could be reading through Alice’s\n",
"own self-pity in referring to her as “poor.” On the other hand, the narrator\n",
"could also be wryly mocking Alice’s state. After all, how poor can she be, if\n",
"she’s to have a book, the book we’re reading, written about her? She comes out\n",
"of her adventures in Wonderland intact and full of an experience to relate to\n",
"her sister. This, then, leads us to “little,” the adjective used to describe\n",
"Alice when seen through her sister’s eyes. It is she, the sister, who is “thinking of\n",
"little Alice” and then dreams “of little Alice herself.”\n",
"\n",
"Again, the “little”ing of Alice, as with her “poor”ing, achieves two effects.\n",
"First, it recasts her Adventures as the province of childhood. To her sister,\n",
"Alice is invariably “little,” still a child and not the grown up she imagines\n",
"herself to be when her size seems to suggest a similar level of maturity, in which the sister also conflates size with maturity. But\n",
"as I noted back in the first analysis, “little” is the most common adjective\n",
"that is not a stopword, making me wonder how it ranks among all the\n",
"adjectives."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(('little', 'JJ'), 124),\n",
" (('other', 'JJ'), 40),\n",
" (('great', 'JJ'), 39),\n",
" (('much', 'JJ'), 34),\n",
" (('large', 'JJ'), 33),\n",
" (('more', 'JJR'), 32),\n",
" (('last', 'JJ'), 32),\n",
" (('first', 'JJ'), 31),\n",
" (('thought', 'JJ'), 26),\n",
" (('such', 'JJ'), 26)]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nltk.FreqDist([adj for adj in dumb_quote_pos_tagged if \"JJ\" in adj[1]]).most_common(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"“Little,” then, is the most common adjective in the novel as a whole, and\n",
"though it only refers to Alice a single time (as “little Alice”) until her\n",
"sister is thinking of her, its ubiquity in the text as a whole reminds the\n",
"reader that Wonderland is a world of smallness, where Alice is similarly\n",
"sized to birds, rodents, caterpillars, and even playing cards. For the most\n",
"part, Wonderland is made up of things smaller than Alice, and her own being\n",
"out of scale trips up her ability to be inside Wonderland unimpeded. When\n",
"she’s too large, she causes natural disasters or home destruction, for\n",
"example.\n",
"\n",
"The second effect, however, recasts the novel with an optimistic nod toward\n",
"the ability of the wonder of childhood to persist even into adulthood. After\n",
"all, Alice believes that she will write a book about herself when she’s a\n",
"grown up, but the reader also knows that the novel was written by an adult who\n",
"imagined the young girl’s Adventures, a fact that is then recursively\n",
"presented to the reader in how Alice’s sister dreams the stories Alice\n",
"dreamed, stories that are subsequently imagined, by her sister, as told by\n",
"Alice, as a grownup, to her own children.\n",
"\n",
"That is, an underlying question of the novel, which is how Alice knows that\n",
"she is the “right” size at any moment, given the unreliable frame of reference that is\n",
"Wonderland’s scale, is challenged anew in the novel’s closing paragraphs,\n",
"where Alice’s being little is presented, paradoxically, as both the reason why\n",
"she was able to have the Adventures she did in the first place (young and imaginative and also small enough to get into tiny places) and as also immaterial to turning\n",
"those Adventures into a story that can appeal to future audiences, be they her\n",
"sister, Alice’s own future progeny, or, of course, we readers. It’s a little\n",
"curious."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment