SandieIJ · April 23, 2020 22:43
diff --git a/BoW.ipynb b/BoW.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**CREATING A DICTIONARY OF WORD MAPPINGS AND A BAG-OF-WORDS**\n",
    "\n",
    "The process of creating a dictionary sounds complex but it is simply assigning an integer value to each word in the document for purposes of modeling.\n",
    "\n",
    "Once we have a document, we create a bag-of-words which is a list containing the token identification number assigned in the creation of the dictionary and a value representing the frequency in which this token appears in the document (tokenization_id, frequency)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 2), (5, 1), (6, 3), (7, 1), (8, 9), (9, 1), (10, 1), (11, 5), (12, 2), (13, 3), (14, 1), (15, 2), (16, 3), (17, 1), (18, 1), (19, 1), (20, 4), (21, 1), (22, 2), (23, 3), (24, 1), (25, 1), (26, 1), (27, 1), (28, 16), (29, 3), (30, 14), (31, 15), (32, 5), (33, 1), (34, 4), (35, 2), (36, 1), (37, 3), (38, 1), (39, 1), (40, 11), (41, 1), (42, 2), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 10), (52, 8), (53, 1), (54, 5), (55, 1), (56, 3), (57, 1), (58, 4), (59, 1), (60, 1), (61, 1), (62, 5), (63, 1), (64, 1), (65, 1), (66, 10), (67, 1), (68, 1), (69, 1), (70, 5), (71, 5), (72, 1), (73, 1), (74, 1), (75, 6), (76, 4), (77, 2), (78, 2)]]\n"
     ]
    }
   ],
   "source": [
    "# a mapping between words and their corresponding integer values\n",
    "id2word = corpora.Dictionary(data_lemmatized) \n",
    "\n",
    "# Term Document Frequency and gensim creates a unique id for each word in the document\n",
    "corpus = [id2word.doc2bow(text) for text in data_lemmatized]\n",
    "\n",
    "# This corpus is a mapping of (word_id, word_frequency)\n",
    "print(corpus[:1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TEXT REPRESENTED AS A BAG OF WORDS DISREGARDING GRAMMAR**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[('action', 1),\n",
       "  ('ai', 1),\n",
       "  ('amazing', 1),\n",
       "  ('arena', 1),\n",
       "  ('army', 2),\n",
       "  ('available', 1),\n",
       "  ('battle', 3),\n",
       "  ('battlefield', 1),\n",
       "  ('battleground', 9),\n",
       "  ('best', 1),\n",
       "  ('challenging', 1),\n",
       "  ('commando', 5),\n",
       "  ('control', 2),\n",
       "  ('crazy', 3),\n",
       "  ('cross', 1),\n",
       "  ('dangerous', 2),\n",
       "  ('death', 3),\n",
       "  ('depend', 1),\n",
       "  ('detect', 1),\n",
       "  ('download', 1),\n",
       "  ('enemy', 4),\n",
       "  ('environment', 1),\n",
       "  ('exciting', 2),\n",
       "  ('face', 3),\n",
       "  ('feature', 1),\n",
       "  ('feel', 1),\n",
       "  ('fierce', 1),\n",
       "  ('fill', 1),\n",
       "  ('fire', 16),\n",
       "  ('firing_squad', 3),\n",
       "  ('free', 14),\n",
       "  ('game', 15),\n",
       "  ('good', 5),\n",
       "  ('graphic', 1),\n",
       "  ('gun', 4),\n",
       "  ('journey', 2),\n",
       "  ('last', 1),\n",
       "  ('legend', 3),\n",
       "  ('lot', 1),\n",
       "  ('mind_blowing', 1),\n",
       "  ('mission', 11),\n",
       "  ('missionsdozen', 1),\n",
       "  ('modern', 2),\n",
       "  ('offline', 1),\n",
       "  ('other', 1),\n",
       "  ('play', 2),\n",
       "  ('player', 1),\n",
       "  ('position', 1),\n",
       "  ('ready', 1),\n",
       "  ('real', 1),\n",
       "  ('see', 1),\n",
       "  ('shoot', 10),\n",
       "  ('shooting', 8),\n",
       "  ('show', 1),\n",
       "  ('skill', 5),\n",
       "  ('smooth', 1),\n",
       "  ('sniper', 3),\n",
       "  ('soldier', 1),\n",
       "  ('squad', 4),\n",
       "  ('squadreal', 1),\n",
       "  ('start', 1),\n",
       "  ('store', 1),\n",
       "  ('strike', 5),\n",
       "  ('strikesimple', 1),\n",
       "  ('surgical', 1),\n",
       "  ('surgical_strike', 1),\n",
       "  ('survival', 10),\n",
       "  ('system', 1),\n",
       "  ('terrorist', 1),\n",
       "  ('totally', 1),\n",
       "  ('training', 5),\n",
       "  ('unknown', 5),\n",
       "  ('unknown_battleground', 1),\n",
       "  ('variety', 1),\n",
       "  ('wait', 1),\n",
       "  ('war', 6),\n",
       "  ('weapon', 4),\n",
       "  ('where', 2),\n",
       "  ('world', 2)]]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Human readable format of corpus (term-frequency)\n",
    "\n",
    "[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]"
   ]
  }
 ],
 "metadata": {
 "kernelspec": {
 "display_name": "Python 3",
 "language": "python",
 "name": "python3"
 },
 "language_info": {
 "codemirror_mode": {
  "name": "ipython",
  "version": 3
 },
 "file_extension": ".py",
 "mimetype": "text/x-python",
 "name": "python",
 "nbconvert_exporter": "python",
 "pygments_lexer": "ipython3",
 "version": "3.7.4"
 }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"CREATING A DICTIONARY OF WORD MAPPINGS AND A BAG-OF-WORDS\n",
	"\n",
	"The process of creating a dictionary sounds complex but it is simply assigning an integer value to each word in the document for purposes of modeling.\n",
	"\n",
	"Once we have a document, we create a bag-of-words which is a list containing the token identification number assigned in the creation of the dictionary and a value representing the frequency in which this token appears in the document (tokenization_id, frequency)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 2), (5, 1), (6, 3), (7, 1), (8, 9), (9, 1), (10, 1), (11, 5), (12, 2), (13, 3), (14, 1), (15, 2), (16, 3), (17, 1), (18, 1), (19, 1), (20, 4), (21, 1), (22, 2), (23, 3), (24, 1), (25, 1), (26, 1), (27, 1), (28, 16), (29, 3), (30, 14), (31, 15), (32, 5), (33, 1), (34, 4), (35, 2), (36, 1), (37, 3), (38, 1), (39, 1), (40, 11), (41, 1), (42, 2), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 10), (52, 8), (53, 1), (54, 5), (55, 1), (56, 3), (57, 1), (58, 4), (59, 1), (60, 1), (61, 1), (62, 5), (63, 1), (64, 1), (65, 1), (66, 10), (67, 1), (68, 1), (69, 1), (70, 5), (71, 5), (72, 1), (73, 1), (74, 1), (75, 6), (76, 4), (77, 2), (78, 2)]]\n"
	]
	}
	],
	"source": [
	"# a mapping between words and their corresponding integer values\n",
	"id2word = corpora.Dictionary(data_lemmatized) \n",
	"\n",
	"# Term Document Frequency and gensim creates a unique id for each word in the document\n",
	"corpus = [id2word.doc2bow(text) for text in data_lemmatized]\n",
	"\n",
	"# This corpus is a mapping of (word_id, word_frequency)\n",
	"print(corpus[:1])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"TEXT REPRESENTED AS A BAG OF WORDS DISREGARDING GRAMMAR"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[[('action', 1),\n",
	" ('ai', 1),\n",
	" ('amazing', 1),\n",
	" ('arena', 1),\n",
	" ('army', 2),\n",
	" ('available', 1),\n",
	" ('battle', 3),\n",
	" ('battlefield', 1),\n",
	" ('battleground', 9),\n",
	" ('best', 1),\n",
	" ('challenging', 1),\n",
	" ('commando', 5),\n",
	" ('control', 2),\n",
	" ('crazy', 3),\n",
	" ('cross', 1),\n",
	" ('dangerous', 2),\n",
	" ('death', 3),\n",
	" ('depend', 1),\n",
	" ('detect', 1),\n",
	" ('download', 1),\n",
	" ('enemy', 4),\n",
	" ('environment', 1),\n",
	" ('exciting', 2),\n",
	" ('face', 3),\n",
	" ('feature', 1),\n",
	" ('feel', 1),\n",
	" ('fierce', 1),\n",
	" ('fill', 1),\n",
	" ('fire', 16),\n",
	" ('firing_squad', 3),\n",
	" ('free', 14),\n",
	" ('game', 15),\n",
	" ('good', 5),\n",
	" ('graphic', 1),\n",
	" ('gun', 4),\n",
	" ('journey', 2),\n",
	" ('last', 1),\n",
	" ('legend', 3),\n",
	" ('lot', 1),\n",
	" ('mind_blowing', 1),\n",
	" ('mission', 11),\n",
	" ('missionsdozen', 1),\n",
	" ('modern', 2),\n",
	" ('offline', 1),\n",
	" ('other', 1),\n",
	" ('play', 2),\n",
	" ('player', 1),\n",
	" ('position', 1),\n",
	" ('ready', 1),\n",
	" ('real', 1),\n",
	" ('see', 1),\n",
	" ('shoot', 10),\n",
	" ('shooting', 8),\n",
	" ('show', 1),\n",
	" ('skill', 5),\n",
	" ('smooth', 1),\n",
	" ('sniper', 3),\n",
	" ('soldier', 1),\n",
	" ('squad', 4),\n",
	" ('squadreal', 1),\n",
	" ('start', 1),\n",
	" ('store', 1),\n",
	" ('strike', 5),\n",
	" ('strikesimple', 1),\n",
	" ('surgical', 1),\n",
	" ('surgical_strike', 1),\n",
	" ('survival', 10),\n",
	" ('system', 1),\n",
	" ('terrorist', 1),\n",
	" ('totally', 1),\n",
	" ('training', 5),\n",
	" ('unknown', 5),\n",
	" ('unknown_battleground', 1),\n",
	" ('variety', 1),\n",
	" ('wait', 1),\n",
	" ('war', 6),\n",
	" ('weapon', 4),\n",
	" ('where', 2),\n",
	" ('world', 2)]]"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Human readable format of corpus (term-frequency)\n",
	"\n",
	"[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}