Created
April 23, 2020 22:43
-
-
Save SandieIJ/824b4e575f3c27b93ecadf36ed82dd90 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**CREATING A DICTIONARY OF WORD MAPPINGS AND A BAG-OF-WORDS**\n", | |
"\n", | |
"The process of creating a dictionary sounds complex but it is simply assigning an integer value to each word in the document for purposes of modeling.\n", | |
"\n", | |
"Once we have a document, we create a bag-of-words which is a list containing the token identification number assigned in the creation of the dictionary and a value representing the frequency in which this token appears in the document (tokenization_id, frequency)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 2), (5, 1), (6, 3), (7, 1), (8, 9), (9, 1), (10, 1), (11, 5), (12, 2), (13, 3), (14, 1), (15, 2), (16, 3), (17, 1), (18, 1), (19, 1), (20, 4), (21, 1), (22, 2), (23, 3), (24, 1), (25, 1), (26, 1), (27, 1), (28, 16), (29, 3), (30, 14), (31, 15), (32, 5), (33, 1), (34, 4), (35, 2), (36, 1), (37, 3), (38, 1), (39, 1), (40, 11), (41, 1), (42, 2), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 10), (52, 8), (53, 1), (54, 5), (55, 1), (56, 3), (57, 1), (58, 4), (59, 1), (60, 1), (61, 1), (62, 5), (63, 1), (64, 1), (65, 1), (66, 10), (67, 1), (68, 1), (69, 1), (70, 5), (71, 5), (72, 1), (73, 1), (74, 1), (75, 6), (76, 4), (77, 2), (78, 2)]]\n" | |
] | |
} | |
], | |
"source": [ | |
"# a mapping between words and their corresponding integer values\n", | |
"id2word = corpora.Dictionary(data_lemmatized) \n", | |
"\n", | |
"# Term Document Frequency and gensim creates a unique id for each word in the document\n", | |
"corpus = [id2word.doc2bow(text) for text in data_lemmatized]\n", | |
"\n", | |
"# This corpus is a mapping of (word_id, word_frequency)\n", | |
"print(corpus[:1])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**TEXT REPRESENTED AS A BAG OF WORDS DISREGARDING GRAMMAR**" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[[('action', 1),\n", | |
" ('ai', 1),\n", | |
" ('amazing', 1),\n", | |
" ('arena', 1),\n", | |
" ('army', 2),\n", | |
" ('available', 1),\n", | |
" ('battle', 3),\n", | |
" ('battlefield', 1),\n", | |
" ('battleground', 9),\n", | |
" ('best', 1),\n", | |
" ('challenging', 1),\n", | |
" ('commando', 5),\n", | |
" ('control', 2),\n", | |
" ('crazy', 3),\n", | |
" ('cross', 1),\n", | |
" ('dangerous', 2),\n", | |
" ('death', 3),\n", | |
" ('depend', 1),\n", | |
" ('detect', 1),\n", | |
" ('download', 1),\n", | |
" ('enemy', 4),\n", | |
" ('environment', 1),\n", | |
" ('exciting', 2),\n", | |
" ('face', 3),\n", | |
" ('feature', 1),\n", | |
" ('feel', 1),\n", | |
" ('fierce', 1),\n", | |
" ('fill', 1),\n", | |
" ('fire', 16),\n", | |
" ('firing_squad', 3),\n", | |
" ('free', 14),\n", | |
" ('game', 15),\n", | |
" ('good', 5),\n", | |
" ('graphic', 1),\n", | |
" ('gun', 4),\n", | |
" ('journey', 2),\n", | |
" ('last', 1),\n", | |
" ('legend', 3),\n", | |
" ('lot', 1),\n", | |
" ('mind_blowing', 1),\n", | |
" ('mission', 11),\n", | |
" ('missionsdozen', 1),\n", | |
" ('modern', 2),\n", | |
" ('offline', 1),\n", | |
" ('other', 1),\n", | |
" ('play', 2),\n", | |
" ('player', 1),\n", | |
" ('position', 1),\n", | |
" ('ready', 1),\n", | |
" ('real', 1),\n", | |
" ('see', 1),\n", | |
" ('shoot', 10),\n", | |
" ('shooting', 8),\n", | |
" ('show', 1),\n", | |
" ('skill', 5),\n", | |
" ('smooth', 1),\n", | |
" ('sniper', 3),\n", | |
" ('soldier', 1),\n", | |
" ('squad', 4),\n", | |
" ('squadreal', 1),\n", | |
" ('start', 1),\n", | |
" ('store', 1),\n", | |
" ('strike', 5),\n", | |
" ('strikesimple', 1),\n", | |
" ('surgical', 1),\n", | |
" ('surgical_strike', 1),\n", | |
" ('survival', 10),\n", | |
" ('system', 1),\n", | |
" ('terrorist', 1),\n", | |
" ('totally', 1),\n", | |
" ('training', 5),\n", | |
" ('unknown', 5),\n", | |
" ('unknown_battleground', 1),\n", | |
" ('variety', 1),\n", | |
" ('wait', 1),\n", | |
" ('war', 6),\n", | |
" ('weapon', 4),\n", | |
" ('where', 2),\n", | |
" ('world', 2)]]" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Human readable format of corpus (term-frequency)\n", | |
"\n", | |
"[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.4" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment