Skip to content

Instantly share code, notes, and snippets.

@SandieIJ
Created April 23, 2020 22:43
Show Gist options
  • Save SandieIJ/824b4e575f3c27b93ecadf36ed82dd90 to your computer and use it in GitHub Desktop.
Save SandieIJ/824b4e575f3c27b93ecadf36ed82dd90 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**CREATING A DICTIONARY OF WORD MAPPINGS AND A BAG-OF-WORDS**\n",
"\n",
"The process of creating a dictionary sounds complex but it is simply assigning an integer value to each word in the document for purposes of modeling.\n",
"\n",
"Once we have a document, we create a bag-of-words which is a list containing the token identification number assigned in the creation of the dictionary and a value representing the frequency in which this token appears in the document (tokenization_id, frequency)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 2), (5, 1), (6, 3), (7, 1), (8, 9), (9, 1), (10, 1), (11, 5), (12, 2), (13, 3), (14, 1), (15, 2), (16, 3), (17, 1), (18, 1), (19, 1), (20, 4), (21, 1), (22, 2), (23, 3), (24, 1), (25, 1), (26, 1), (27, 1), (28, 16), (29, 3), (30, 14), (31, 15), (32, 5), (33, 1), (34, 4), (35, 2), (36, 1), (37, 3), (38, 1), (39, 1), (40, 11), (41, 1), (42, 2), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 10), (52, 8), (53, 1), (54, 5), (55, 1), (56, 3), (57, 1), (58, 4), (59, 1), (60, 1), (61, 1), (62, 5), (63, 1), (64, 1), (65, 1), (66, 10), (67, 1), (68, 1), (69, 1), (70, 5), (71, 5), (72, 1), (73, 1), (74, 1), (75, 6), (76, 4), (77, 2), (78, 2)]]\n"
]
}
],
"source": [
"# a mapping between words and their corresponding integer values\n",
"id2word = corpora.Dictionary(data_lemmatized) \n",
"\n",
"# Term Document Frequency and gensim creates a unique id for each word in the document\n",
"corpus = [id2word.doc2bow(text) for text in data_lemmatized]\n",
"\n",
"# This corpus is a mapping of (word_id, word_frequency)\n",
"print(corpus[:1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TEXT REPRESENTED AS A BAG OF WORDS DISREGARDING GRAMMAR**"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[('action', 1),\n",
" ('ai', 1),\n",
" ('amazing', 1),\n",
" ('arena', 1),\n",
" ('army', 2),\n",
" ('available', 1),\n",
" ('battle', 3),\n",
" ('battlefield', 1),\n",
" ('battleground', 9),\n",
" ('best', 1),\n",
" ('challenging', 1),\n",
" ('commando', 5),\n",
" ('control', 2),\n",
" ('crazy', 3),\n",
" ('cross', 1),\n",
" ('dangerous', 2),\n",
" ('death', 3),\n",
" ('depend', 1),\n",
" ('detect', 1),\n",
" ('download', 1),\n",
" ('enemy', 4),\n",
" ('environment', 1),\n",
" ('exciting', 2),\n",
" ('face', 3),\n",
" ('feature', 1),\n",
" ('feel', 1),\n",
" ('fierce', 1),\n",
" ('fill', 1),\n",
" ('fire', 16),\n",
" ('firing_squad', 3),\n",
" ('free', 14),\n",
" ('game', 15),\n",
" ('good', 5),\n",
" ('graphic', 1),\n",
" ('gun', 4),\n",
" ('journey', 2),\n",
" ('last', 1),\n",
" ('legend', 3),\n",
" ('lot', 1),\n",
" ('mind_blowing', 1),\n",
" ('mission', 11),\n",
" ('missionsdozen', 1),\n",
" ('modern', 2),\n",
" ('offline', 1),\n",
" ('other', 1),\n",
" ('play', 2),\n",
" ('player', 1),\n",
" ('position', 1),\n",
" ('ready', 1),\n",
" ('real', 1),\n",
" ('see', 1),\n",
" ('shoot', 10),\n",
" ('shooting', 8),\n",
" ('show', 1),\n",
" ('skill', 5),\n",
" ('smooth', 1),\n",
" ('sniper', 3),\n",
" ('soldier', 1),\n",
" ('squad', 4),\n",
" ('squadreal', 1),\n",
" ('start', 1),\n",
" ('store', 1),\n",
" ('strike', 5),\n",
" ('strikesimple', 1),\n",
" ('surgical', 1),\n",
" ('surgical_strike', 1),\n",
" ('survival', 10),\n",
" ('system', 1),\n",
" ('terrorist', 1),\n",
" ('totally', 1),\n",
" ('training', 5),\n",
" ('unknown', 5),\n",
" ('unknown_battleground', 1),\n",
" ('variety', 1),\n",
" ('wait', 1),\n",
" ('war', 6),\n",
" ('weapon', 4),\n",
" ('where', 2),\n",
" ('world', 2)]]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Human readable format of corpus (term-frequency)\n",
"\n",
"[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment