VibhuJawa · April 7, 2022 08:39
diff --git a/Gutenburg_NLP_Analysis.ipynb b/Gutenburg_NLP_Analysis.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gutenburg NLP Analysis using Rapids\n",
    "By: Vibhu Jawa, Data Scientist at Rapids (https://www.linkedin.com/in/jawavibhu/)\n",
    "\n",
    "\n",
    "#### Rapids AI: https://rapids.ai/\n",
    "\n",
    "\n",
    "### Blog Link:\n",
    "https://medium.com/rapids-ai/show-me-the-word-count-3146e1173801\n",
    "\n",
    "\n",
    "### Objective: Show case nlp capabilties of cudf\n",
    "\n",
    "### Pre-Processing :\n",
    "* filter punctuation\n",
    "* to_lower\n",
    "* remove stop words (from nltk corpus)\n",
    "* remove multiple spaces with one\n",
    "* remove leading and trailing spaces    \n",
    "    \n",
    "### Word Count: \n",
    "* Get Frequency count for the whole dataset\n",
    "* Compare word count for two authors (Albert Einstein vs Charles Dickens )\n",
    "* Get Word counts for all the authors\n",
    "\n",
    "### Encode the word-count for all authors into a count-vector\n",
    "\n",
    "We do this in two steps:\n",
    "\n",
    "1. Encode the string Series using `top 20k` most used `words` in the Dataset which we calculated earlier.\n",
    "    * We encode anything not in the series to string_id = `20_000` (`threshold`)\n",
    "\n",
    "\n",
    "2. With the encoded count series for all authors, we  create an aligned word-count vector for them, where:\n",
    "    * Where each column corresponds to a `word_id` from the the `top 20k words`\n",
    "    * Each row corresponds to the `count vector` for that author\n",
    "    \n",
    "    \n",
    "### Find the nearest authors using the count-vector:\n",
    "* Fit a knn\n",
    "* Find the authors nearest to each other in the count vector space\n",
    "* Decrease dimunitonality using UMAP\n",
    "* Find the authors nearest to each other in the latent space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Download Links:\n",
    "\n",
    "Download the data from: https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html\n",
    "\n",
    "You can also run below commands"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install gdown\n",
    "# !gdown https://drive.google.com/uc?id=0B2Mzhc7popBga2RkcWZNcjlRTGM\n",
    "# !apt update\n",
    "# !apt install unzip\n",
    "# !unzip Gutenberg.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/nvme/0/vjawa/conda/envs/cudf_15_june_25/lib/python3.7/site-packages/treelite/gallery/__init__.py:7: FutureWarning: treelite.gallery.sklearn has been moved to treelite.sklearn. treelite.gallery.sklearn will be removed in version 1.1.\n",
      "  FutureWarning)\n",
      "/nvme/0/vjawa/conda/envs/cudf_15_june_25/lib/python3.7/site-packages/treelite/gallery/sklearn/__init__.py:9: FutureWarning: treelite.gallery.sklearn has been moved to treelite.sklearn. treelite.gallery.sklearn will be removed in version 1.1.\n",
      "  FutureWarning)\n"
     ]
    }
   ],
   "source": [
    "import cudf\n",
    "import os\n",
    "import numpy as np\n",
    "import cuml\n",
    "try:\n",
    "    import nltk\n",
    "except ModuleNotFoundError:\n",
    "    os.system('pip install nltk')\n",
    "    import nltk\n",
    "from numba import cuda\n",
    "from dask.utils import parse_bytes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setting Rmm Pool \n",
    "Why ?\n",
    "\n",
    "RAPIDS Memory Manager was enhanced to allow sharing a memory pool between RAPIDS libraries and CuPy.\n",
    "As some RAPIDS TPCx-BB queries rely on CuPy, this change allows us to use a single device memory pool on the entire GPU, providing significant performance gains by reducing the cost of dynamically allocating and freeing memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "cudf.set_allocator(pool=True, initial_pool_size=parse_bytes(\"8GB\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set Data Dir "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_dir = 'Gutenberg/txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read Text Frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Read helper functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_non_empty_lines(lines):\n",
    "    \"\"\"\n",
    "        returns non empty lines from a list of lines\n",
    "    \"\"\"\n",
    "    clean_lines = []\n",
    "    for line in lines:\n",
    "        str_line = line.strip()\n",
    "        if str_line:\n",
    "            clean_lines.append(str_line)            \n",
    "    return clean_lines\n",
    "\n",
    "def get_txt_lines(data_dir):\n",
    "    \"\"\"\n",
    "        Read text lines from gutenberg texts\n",
    "        returns (text_ls,fname_ls) where \n",
    "        text_ls= input_text_lines and fname_ls = list of fnames\n",
    "    \"\"\"\n",
    "    text_ls = []\n",
    "    fname_ls = []\n",
    "    for fn in os.listdir(data_dir):\n",
    "        full_fn = os.path.join(data_dir,fn)\n",
    "        with open(full_fn,encoding=\"utf-8\",errors=\"ignore\") as f:\n",
    "            content = f.readlines()\n",
    "            content = get_non_empty_lines(content)\n",
    "            text_ls += content\n",
    "            ### dont add .txt to the file\n",
    "            fname_ls += [fn[:-4]]*len(content)\n",
    "    \n",
    "    return text_ls, fname_ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read text lines into a cudf dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File Read Time:\n",
      "CPU times: user 5.61 s, sys: 1.45 s, total: 7.06 s\n",
      "Wall time: 7.06 s\n",
      "\n",
      "CUDF  Creation Time:\n",
      "CPU times: user 1.93 s, sys: 1.46 s, total: 3.39 s\n",
      "Wall time: 3.33 s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/nvme/0/vjawa/conda/envs/cudf_15_june_25/lib/python3.7/site-packages/cudf/core/column/string.py:1906: UserWarning: `expand` parameter defatults to True.\n",
      "  warnings.warn(\"`expand` parameter defatults to True.\")\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of lines in the DF = 19,259,957\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>author</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>THE POISON BELT</td>\n",
       "      <td>Sir Arthur Conan Doyle</td>\n",
       "      <td>The Poison Belt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>BY</td>\n",
       "      <td>Sir Arthur Conan Doyle</td>\n",
       "      <td>The Poison Belt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ARTHUR CONAN DOYLE</td>\n",
       "      <td>Sir Arthur Conan Doyle</td>\n",
       "      <td>The Poison Belt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Being an account of another adventure of</td>\n",
       "      <td>Sir Arthur Conan Doyle</td>\n",
       "      <td>The Poison Belt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Prof. George E. Challenger, Lord John Roxton,</td>\n",
       "      <td>Sir Arthur Conan Doyle</td>\n",
       "      <td>The Poison Belt</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            text                  author  \\\n",
       "0                                THE POISON BELT  Sir Arthur Conan Doyle   \n",
       "1                                             BY  Sir Arthur Conan Doyle   \n",
       "2                             ARTHUR CONAN DOYLE  Sir Arthur Conan Doyle   \n",
       "3       Being an account of another adventure of  Sir Arthur Conan Doyle   \n",
       "4  Prof. George E. Challenger, Lord John Roxton,  Sir Arthur Conan Doyle   \n",
       "\n",
       "             title  \n",
       "0  The Poison Belt  \n",
       "1  The Poison Belt  \n",
       "2  The Poison Belt  \n",
       "3  The Poison Belt  \n",
       "4  The Poison Belt  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(\"File Read Time:\")\n",
    "%time txt_ls,fname_ls = get_txt_lines(data_dir)\n",
    "df = cudf.DataFrame()\n",
    "\n",
    "print(\"\\nCUDF  Creation Time:\")\n",
    "%time df['text'] = cudf.Series(txt_ls,dtype='str')\n",
    "\n",
    "df['label'] = cudf.Series(fname_ls,dtype='str')\n",
    "title_label_df = df['label'].str.split('___')\n",
    "df['author'] = title_label_df[0]\n",
    "df['title'] = title_label_df[1]\n",
    "df = df.drop(labels=['label'])\n",
    "print(\"Number of lines in the DF = {:,}\".format(len(df)))\n",
    "df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NLP Preprocessing\n",
    "\n",
    "In almost every workflow involving textual data, we'll want to do some kind of preprocessing before running our analysis. We might want to remove punctuation, standardize to all lowercase characters, and potentially dozens of other small tasks. RAPIDS makes developing GPU accelerated preprocessing pipelines smooth.\n",
    "\n",
    "Let's start by removing all the punctuation, since we don't want those characters to cloud our analysis. We could replace them one by one in many calls to replace. More realistically, we might generate a large regular expression pattern that looks for `!`, `,`, `%` and all of our other patterns and replaces them. It might look something like this: `(!)|(,)...|(%)`.\n",
    "\n",
    "A longer regex may or may not be less efficient necessarily on the GPU.  If an instruction within the regex fails to match the current character being processed for the string, the rest of the expression does not need to be evaluated and we can move on to the next character.  However, regexes with many alternation as in our case,  may mean evaluating the same character over many more instructions before continuing. An alternation can be explicit like in `(\\bone\\b)|(\\b1\\b)` but also can be implicit like in `[aA]`.\n",
    "\n",
    "\n",
    "This can be tedious, and isn't well suited to the GPU. \n",
    "\n",
    "Overall, avoiding regex can be more efficient since the algorithm is complex due to the richness of its features.  \n",
    "\n",
    "For cases like removing multiple `characters` or `stop words`, a `general regex` can be overkill and `cudf.str` provides some alternative methods which make this computation much faster. \n",
    "\n",
    "In this workflow we use the following `cudf.string` functions:\n",
    "* `str_ser.str.translate`: (Allows passing list To replace the punctuations with a blank space.\n",
    "* `str_ser.replace_tokens`: To replace the tokens with a empty space.\n",
    "\n",
    "Please checkout `https://docs.rapids.ai/api`, we are adding more features everyday. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Now back to our workflow:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Removing Filters:\n",
    "First, we need to define our list of filter characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove the following punctuations/characters from cudf\n",
    "filters = [ '!', '\"', '#', '$', '%', '&', '(', ')', '*', '+', '-', '.', '/',  '\\\\', ':', ';', '<', '=', '>',\n",
    "           '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '\\t','\\n',\"'\",\",\",'~' , '—']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we can simply pass `filters` to the string processing methods inside cuDF and apply it to our Series. We'll eventually make a helper function to let us execute this on every column in the DataFrame. But, let's just take a quick look now on a sample of our text data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                  THE POISON BELT\n",
       "1                                               BY\n",
       "2                               ARTHUR CONAN DOYLE\n",
       "3         Being an account of another adventure of\n",
       "4    Prof. George E. Challenger, Lord John Roxton,\n",
       "Name: text, dtype: object"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_col_sample = df.head(5)\n",
    "text_col_sample['text']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                  THE POISON BELT\n",
       "1                                               BY\n",
       "2                               ARTHUR CONAN DOYLE\n",
       "3         Being an account of another adventure of\n",
       "4    Prof  George E  Challenger  Lord John Roxton \n",
       "Name: text_clean, dtype: object"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "translation_table = {ord(char): ord(' ') for char in filters}\n",
    "text_col_sample['text_clean'] = text_col_sample['text'].str.translate(translation_table)\n",
    "text_col_sample['text_clean'].to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With one method we removed all of the symbols in our `filters` list. Next, we'll want to convert to lowercase with `str.lower()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### To Lower"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                  the poison belt\n",
       "1                                               by\n",
       "2                               arthur conan doyle\n",
       "3         being an account of another adventure of\n",
       "4    prof  george e  challenger  lord john roxton \n",
       "Name: text_clean, dtype: object"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_col_sample['text_clean'] = text_col_sample['text_clean'].str.lower()\n",
    "text_col_sample['text_clean'].to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also remove stopwords with `replace_tokens`. We can pass the default list of English stopwords that ships with the `nltk` library. We'll replace each of our stopwords with a single space."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Remove Stop Words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to\n",
      "[nltk_data]     /home/nfs/vjawa/nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n"
     ]
    }
   ],
   "source": [
    "nltk.download('stopwords')\n",
    "STOPWORDS = nltk.corpus.stopwords.words('english')\n",
    "STOPWORDS = cudf.Series(STOPWORDS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                      poison belt\n",
       "1                                                 \n",
       "2                               arthur conan doyle\n",
       "3                    account   another adventure  \n",
       "4    prof  george e  challenger  lord john roxton \n",
       "Name: text_clean, dtype: object"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_col_sample['text_clean'] = text_col_sample['text_clean'].str.replace_tokens(STOPWORDS, ' ')\n",
    "text_col_sample['text_clean'].to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Replacing Multiple White Spaces\n",
    "\n",
    "This looks great, but we'll probably want to replace multiple spaces in a row with a single space and strip leading and trailing spaces. We can do that easily, too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                  poison belt\n",
       "1                                             \n",
       "2                           arthur conan doyle\n",
       "3                    account another adventure\n",
       "4    prof george e challenger lord john roxton\n",
       "Name: text_clean, dtype: object"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_col_sample['text_clean'] = text_col_sample['text_clean'].str.normalize_spaces( )\n",
    "text_col_sample['text_clean'] = text_col_sample['text_clean'].str.strip(' ')\n",
    "text_col_sample['text_clean'].to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With that, we've finished our basic preprocessing steps on a tiny sample of our text column. We'll wrap this into a function for portability, and run it on the entire data. We'll rewrite our code to create our filter list and stopwords again for clarity."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Full Pre-processing Pipe-Line\n",
    "\n",
    "\n",
    "``` Dask Time Taken``` = ```15.252387285232544 ```  on a  `dual 16-core CPU (64 virtual core)`\n",
    "\n",
    "\n",
    "``` 2.94 s``` on a  ```Tesla T4  GPU ```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "STOPWORDS = nltk.corpus.stopwords.words('english')\n",
    "\n",
    "filters = [ '!', '\"', '#', '$', '%', '&', '(', ')', '*', '+', '-', '.', '/',  '\\\\', ':', ';', '<', '=', '>',\n",
    "           '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '\\t','\\n',\"'\",\",\",'~' , '—']\n",
    "\n",
    "def preprocess_text(input_strs , filters=None , stopwords=STOPWORDS):\n",
    "    \"\"\"\n",
    "        * filter punctuation\n",
    "        * to_lower\n",
    "        * remove stop words (from nltk corpus)\n",
    "        * remove multiple spaces with one\n",
    "        * remove leading spaces    \n",
    "    \"\"\"\n",
    "    \n",
    "    # filter punctuation and case conversion\n",
    "    translation_table = {ord(char): ord(' ') for char in filters}\n",
    "    input_strs = input_strs.str.translate(translation_table)\n",
    "    input_strs = input_strs.str.lower()\n",
    "        \n",
    "    # remove stopwords\n",
    "    stopwords_gpu = cudf.Series(stopwords)\n",
    "    input_strs =  input_strs.str.replace_tokens(STOPWORDS, ' ')\n",
    "        \n",
    "    # replace multiple spaces with single one and strip leading/trailing spaces\n",
    "    input_strs = input_strs.str.normalize_spaces( )\n",
    "    input_strs = input_strs.str.strip(' ')\n",
    "    \n",
    "    return input_strs\n",
    "\n",
    "def preprocess_text_df(df, text_cols=['text'], **kwargs):\n",
    "    for col in text_cols:\n",
    "        df[col] = preprocess_text(df[col], **kwargs)\n",
    "    return  df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With our function defined, we can execute it to preprocess the entire dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.15 s, sys: 804 ms, total: 2.95 s\n",
      "Wall time: 2.95 s\n"
     ]
    }
   ],
   "source": [
    "%time df = preprocess_text_df(df, filters=filters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                  poison belt\n",
       "1                                             \n",
       "2                           arthur conan doyle\n",
       "3                    account another adventure\n",
       "4    prof george e challenger lord john roxton\n",
       "Name: text, dtype: object"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['text'].head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word Count\n",
    "\n",
    "Lets find the top words used in:\n",
    "* Whole dataset\n",
    "* by Albert Einstein\n",
    "* by Charles dickens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Getting a frequency count for Strings\n",
    "\n",
    "def get_word_count(str_col):\n",
    "    \"\"\"\n",
    "        returns the count of input strings\n",
    "    \"\"\" \n",
    "    ## Tokenize: convert sentences into a long list of words\n",
    "    ## Get counts: Groupby each token to get value counts\n",
    "\n",
    "    df = cudf.DataFrame()\n",
    "    # tokenize sentences into a string using nvtext.tokenize()\n",
    "    # it into a single tall data-frame\n",
    "    df['string'] = str_col.str.tokenize()\n",
    "    \n",
    "    # Using Group by to do a value count for string columns\n",
    "    # This will be natively supported soon\n",
    "    # See: issue https://github.com/rapidsai/cudf/issues/1951\n",
    "    df['counts'] = np.dtype('int32').type(0)\n",
    "    res = df.groupby('string').count()\n",
    "    res = res.reset_index(drop=False).sort_values(by='counts', ascending=False)\n",
    "    return res.rename(columns={'index':'string'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Top Words  Across the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.13 s, sys: 99.3 ms, total: 1.23 s\n",
      "Wall time: 1.24 s\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>string</th>\n",
       "      <th>counts</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>318634</th>\n",
       "      <td>said</td>\n",
       "      <td>779344</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>260399</th>\n",
       "      <td>one</td>\n",
       "      <td>709102</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>408323</th>\n",
       "      <td>would</td>\n",
       "      <td>632164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>225934</th>\n",
       "      <td>man</td>\n",
       "      <td>439813</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87499</th>\n",
       "      <td>could</td>\n",
       "      <td>416195</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       string  counts\n",
       "318634   said  779344\n",
       "260399    one  709102\n",
       "408323  would  632164\n",
       "225934    man  439813\n",
       "87499   could  416195"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time \n",
    "\n",
    "count_df = get_word_count(df['text'])\n",
    "count_df.head(5).to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now lets compare Charles Dickens and Albert Einstein"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####  Albert Einstein"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>string</th>\n",
       "      <th>counts</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3000</th>\n",
       "      <td>theory</td>\n",
       "      <td>248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2515</th>\n",
       "      <td>relativity</td>\n",
       "      <td>223</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2788</th>\n",
       "      <td>space</td>\n",
       "      <td>190</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>413</th>\n",
       "      <td>body</td>\n",
       "      <td>168</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3028</th>\n",
       "      <td>time</td>\n",
       "      <td>160</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          string  counts\n",
       "3000      theory     248\n",
       "2515  relativity     223\n",
       "2788       space     190\n",
       "413         body     168\n",
       "3028        time     160"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "einstein_df = df[df['author'].str.contains('Einstein')]\n",
    "einstein_count_df = get_word_count(einstein_df['text'])\n",
    "einstein_count_df.head(5).to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Charles Dickens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>string</th>\n",
       "      <th>counts</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>27120</th>\n",
       "      <td>mr</td>\n",
       "      <td>34408</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35515</th>\n",
       "      <td>said</td>\n",
       "      <td>32643</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28538</th>\n",
       "      <td>one</td>\n",
       "      <td>17684</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46610</th>\n",
       "      <td>would</td>\n",
       "      <td>15266</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44229</th>\n",
       "      <td>upon</td>\n",
       "      <td>14657</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      string  counts\n",
       "27120     mr   34408\n",
       "35515   said   32643\n",
       "28538    one   17684\n",
       "46610  would   15266\n",
       "44229   upon   14657"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "charles_dickens_df = df[df['author'].str.contains('Charles Dickens')]\n",
    "charles_dickens_count_df = get_word_count(charles_dickens_df['text'])\n",
    "charles_dickens_count_df.head(5).to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "So Einstein is talking about relativity, with words like `relativity`,`theory`,`body` ,\n",
    "while Charles Dickens is telling stories with   `one`, `upon`, `time` , `old`\n",
    "\n",
    "Our Word Count seems to be working :-D"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Word Counts for all the authors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Lets get the list of authors for our dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    Abraham Lincoln\n",
       "1    Agatha Christie\n",
       "2    Albert Einstein\n",
       "3      Aldous Huxley\n",
       "4     Alexander Pope\n",
       "Name: author, dtype: object"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['author'].unique().to_pandas().head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Calculate the  word count for all authors into a list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 4.22 s, sys: 928 ms, total: 5.14 s\n",
      "Wall time: 5.15 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "author_wc_ls = []\n",
    "author_name_ls = []\n",
    "for author_name in df['author'].unique():\n",
    "    df_auth = df[df['author']==author_name]\n",
    "    author_wc = get_word_count(df_auth['text'])\n",
    "    author_wc_ls.append(author_wc)\n",
    "    author_name_ls.append(author_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Encode the word-count `series` list for all authors into a count-vector\n",
    "\n",
    "We do this in two steps:\n",
    "\n",
    "1. Encode the string Series using`top 20k` most used `words` in the Dataset which we calculated earlier.\n",
    "    * We encode anything not in the series to string_id = `20_000` (threshold)\n",
    "\n",
    "\n",
    "2. With the encoded count series for all authors, we create an aligned word-count vector for them, where:\n",
    "    * Where each column corresponds to a `word_id` from the the `top 20k words`\n",
    "    * Each row corresponds to the `count vector` for that author"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Categorize the `string series` from the `word count series` into a `integer series`  for all the authors "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "def encode_count_df(auth_wc_df,keys,out_of_dict_id):\n",
    "    \"\"\"\n",
    "        Encode the count series for all authors by using the index provided in keys\n",
    "        All strings not in keys are mapped to out_of_dict_id and their count is summed\n",
    "    \"\"\"\n",
    "    auth_wc_df['encoded_str_id'] = auth_wc_df['string'].astype('category')\n",
    "    auth_wc_df['encoded_str_id'] = auth_wc_df['encoded_str_id'].cat.set_categories(keys)._column.codes\n",
    "    auth_wc_df['encoded_str_id'] = auth_wc_df['encoded_str_id'].fillna(out_of_dict_id)\n",
    "    \n",
    "    # sub df which  contains words that are in the dictionary\n",
    "    in_dict_wc_df = auth_wc_df[auth_wc_df['encoded_str_id']!=out_of_dict_id]\n",
    "    \n",
    "    # sum of `count series` of words not in dictionary \n",
    "    out_of_dict_wcount = auth_wc_df[auth_wc_df['encoded_str_id']==out_of_dict_id]['counts'].sum()\n",
    "    \n",
    "    # mapping out the count of words to -1\n",
    "    out_of_dict_df = cudf.DataFrame({'encoded_str_id':out_of_dict_id,'counts': out_of_dict_wcount,'string':'other'})\n",
    "    \n",
    "    out_of_dict_df['encoded_str_id'] = out_of_dict_df['encoded_str_id'].astype(np.int32)\n",
    "    out_of_dict_df['counts'] = out_of_dict_df['counts'].astype(np.int32)\n",
    "    \n",
    "    return cudf.concat([in_dict_wc_df,out_of_dict_df])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.63 s, sys: 231 ms, total: 2.86 s\n",
      "Wall time: 2.93 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# keep only top 20k words in the dataset\n",
    "th = 20_000\n",
    "keys = count_df['string'][:th]\n",
    "encoded_wc_ls = []\n",
    "\n",
    "for auth_wc_df in author_wc_ls:\n",
    "    encoded_count_df = encode_count_df(auth_wc_df,keys,th)\n",
    "    encoded_wc_ls.append(encoded_count_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Now lets check if the encoding worked ! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Agatha Christie Counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Agatha Christie\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>string</th>\n",
       "      <th>counts</th>\n",
       "      <th>encoded_str_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6490</th>\n",
       "      <td>said</td>\n",
       "      <td>738</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7862</th>\n",
       "      <td>tuppence</td>\n",
       "      <td>626</td>\n",
       "      <td>13917</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7718</th>\n",
       "      <td>tommy</td>\n",
       "      <td>597</td>\n",
       "      <td>2804</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5159</th>\n",
       "      <td>one</td>\n",
       "      <td>505</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4894</th>\n",
       "      <td>mr</td>\n",
       "      <td>455</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        string  counts  encoded_str_id\n",
       "6490      said     738               0\n",
       "7862  tuppence     626           13917\n",
       "7718     tommy     597            2804\n",
       "5159       one     505               1\n",
       "4894        mr     455               9"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "author_id = author_name_ls.index('Agatha Christie') \n",
    "print(author_name_ls[author_id])\n",
    "encoded_wc_ls[author_id].head(5).to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Charles Dickens Counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Charles Dickens\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>string</th>\n",
       "      <th>counts</th>\n",
       "      <th>encoded_str_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>27120</th>\n",
       "      <td>mr</td>\n",
       "      <td>34408</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35515</th>\n",
       "      <td>said</td>\n",
       "      <td>32643</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28538</th>\n",
       "      <td>one</td>\n",
       "      <td>17684</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46610</th>\n",
       "      <td>would</td>\n",
       "      <td>15266</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44229</th>\n",
       "      <td>upon</td>\n",
       "      <td>14657</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      string  counts  encoded_str_id\n",
       "27120     mr   34408               9\n",
       "35515   said   32643               0\n",
       "28538    one   17684               1\n",
       "46610  would   15266               2\n",
       "44229   upon   14657               8"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "author_id = author_name_ls.index('Charles Dickens') \n",
    "print(author_name_ls[author_id])\n",
    "encoded_wc_ls[author_id].head(5).to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### We can see that the encoded_str_id for `said` is `0` for both `Charles Dickens` and `Agatha Christie`. Yaay! the encoding worked"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a aligned word-count vector for each author:\n",
    "\n",
    "We create a dataframe, where a row represents a `author` and the columnss contain the count of the `words` respresented by that `column`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create a numba nd-array of shape (`num_authors`,`Vocablary Size+1`))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_authors = len(encoded_wc_ls)\n",
    "count_ary = np.zeros(shape = (num_authors,th+1), dtype=np.int32)\n",
    "count_dary = cuda.to_device(count_ary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fill the count array using a numba function:\n",
    "\n",
    "Apply the numba function to fill the `author_count_array` with the count of words used by the `author`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Numba Function`: See https://numba.pydata.org/numba-doc/0.13/CUDAJit.html for more `info` on how to write `cuda-jit` functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 109 ms, sys: 4.31 ms, total: 113 ms\n",
      "Wall time: 123 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "@cuda.jit('void(int32[:], int32[:], int32[:])')\n",
    "def count_vec_func(author_token_id_array,author_token_count_array,author_count_array):\n",
    "    pos = cuda.grid(1)\n",
    "    if pos < author_token_id_array.size:\n",
    "        token_id = author_token_id_array[pos]\n",
    "        token_count = author_token_count_array[pos]\n",
    "        author_count_array[token_id] = token_count        \n",
    "        \n",
    "for author_id,encoded_wc_df in enumerate(encoded_wc_ls):    \n",
    "    count_sr = encoded_wc_df['counts']\n",
    "    token_id_sr =  encoded_wc_df['encoded_str_id']\n",
    "    \n",
    "    count_ar = count_sr._column.data_array_view\n",
    "    token_id_ar = token_id_sr._column.data_array_view\n",
    "    author_ar = count_dary[author_id]\n",
    "    \n",
    "    # See https://numba.pydata.org/numba-doc/0.13/CUDAJit.html\n",
    "    threadsperblock = 36\n",
    "    blockspergrid = (count_ar.size + (threadsperblock - 1)) // threadsperblock\n",
    "    count_vec_func[blockspergrid, threadsperblock](token_id_ar,count_ar,author_ar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Now, Lets check if creating the count vectors worked !"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Agatha Christie\n",
      "0 : 738\n",
      "13917 : 626\n",
      "2804 : 597\n",
      "1 : 505\n",
      "9 : 455\n"
     ]
    }
   ],
   "source": [
    "author_id = author_name_ls.index('Agatha Christie')  \n",
    "\n",
    "print(author_name_ls[author_id])\n",
    "top_word_ids = encoded_wc_ls[author_id]['encoded_str_id'].head(5).to_pandas()\n",
    "for word_id in top_word_ids:\n",
    "    print(\"{} : {}\".format(word_id,count_dary[author_id][word_id]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lets find the  Nearest Authors \n",
    "\n",
    "Now your count df is ready for ML\n",
    "\n",
    "Let's train a KNN on the count-df and see if we can find any interesting patterns in count_df. Though `euclidian distance` is not the best measure for these higher dimensional spaces but it still works as a small toy example. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Normalize Counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "normalized_count_array = count_dary/np.sum(count_dary,axis=1)[:,None]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Train and find nearest_neighours on the non embedded  space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 185 ms, sys: 192 ms, total: 377 ms\n",
      "Wall time: 377 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "nn_model = cuml.neighbors.NearestNeighbors(n_neighbors = 5)\n",
    "nn_model.fit(normalized_count_array)\n",
    "ouput_mat,output_indices_count_sp = nn_model.kneighbors(X=normalized_count_array)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Nearest authors to Albert Einstein in the count vector space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Albert Einstein\n",
      "Thomas Carlyle\n",
      "Lord Byron\n",
      "James Russell Lowell\n",
      "Michael Faraday\n"
     ]
    }
   ],
   "source": [
    "author_id = author_name_ls.index('Albert Einstein') \n",
    "for index in output_indices_count_sp[author_id]:\n",
    "    print(author_name_ls[int(index)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Nearest authors to Charles Dickens in the count vector space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Charles Dickens\n",
      "Winston Churchill\n",
      "Charlotte Mary Yonge\n",
      "Harriet Elizabeth Beecher Stowe\n",
      "William Dean Howells\n"
     ]
    }
   ],
   "source": [
    "author_id = author_name_ls.index('Charles Dickens') \n",
    "for index in output_indices_count_sp[author_id]:\n",
    "    print(author_name_ls[int(index)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Encode the count vecotrs to a lower dimention using Umap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_ar_gpu =  cuml.UMAP(n_neighbors=100,n_components=3).fit_transform(normalized_count_array)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### KNN in the lower dimentional space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 292 µs, sys: 124 ms, total: 124 ms\n",
      "Wall time: 123 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "nn_model = cuml.neighbors.NearestNeighbors(n_neighbors = 5)\n",
    "nn_model.fit(embedding_ar_gpu)\n",
    "ouput_mat,output_indices_umap = nn_model.kneighbors(X=embedding_ar_gpu)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Nearest authors to Albert Einstein in the emdedded space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Albert Einstein\n",
      "Charles Darwin\n",
      "Thomas Carlyle\n",
      "Michael Faraday\n",
      "Lord Byron\n"
     ]
    }
   ],
   "source": [
    "author_id = author_name_ls.index('Albert Einstein') \n",
    "for index in output_indices_umap[author_id]:\n",
    "    print(author_name_ls[int(index)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Nearest authors to Charles Dickens in the emdedded space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Charles Dickens\n",
      "Lyman Frank Baum\n",
      "D H Lawrence\n",
      "Isaac Asimov\n",
      "Winston Churchill\n"
     ]
    }
   ],
   "source": [
    "author_id = author_name_ls.index('Charles Dickens') \n",
    "for index in output_indices_umap[author_id]:\n",
    "    print(author_name_ls[int(index)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Want to get started with RAPIDS? Check out [`cuDF`](https://github.com/rapidsai/cudf) on Github and let us know what you think! You can download pre-built Docker containers for our 0.8 and newer releases from [NGC](https://ngc.nvidia.com/catalog/landing) or [Dockerhub](https://hub.docker.com/r/rapidsai/rapidsai/) to get started, or install it yourself via Conda. Need something even easier? You can quickly get started with RAPIDS in [Google Colab](https://colab.research.google.com/drive/1XTKHiIcvyL5nuldx0HSL_dUa8yopzy_Y#forceEdit=true&offline=true&sandboxMode=true) and try out all the new things we've added with just a single push of a button.\n",
    "\n",
    "Don't want to wait for the next release to use upcoming features? You can download our nightly containers from [Dockerhub](https://hub.docker.com/r/rapidsai/rapidsai-nightly) or install via [Conda](https://anaconda.org/rapidsai-nightly) to stay at the tip of our development branch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Whats Next in Text pre-processing in rapids\n",
    "\n",
    "- We will super soon have tf-idf/count vectorizer in cuml (https://github.com/rapidsai/cuml/pull/2267)\n",
    "- Check on click-streams work for cyber log parsing (https://medium.com/rapids-ai/cybert-28b35a4c81c4)\n",
    "- We have an acclerated Bert tokenizer in the works (https://github.com/rapidsai/cudf/issues/4981)\n",
    "\n",
    "### Other Examples of NLP workflows:\n",
    "- https://github.com/rapidsai/tpcx-bb/tree/master/tpcx_bb/queries/q17\n",
    "- https://github.com/rapidsai/tpcx-bb/tree/master/tpcx_bb/queries/q18 \n",
    "- https://github.com/rapidsai/tpcx-bb/tree/master/tpcx_bb/queries/q27\n",
    "- https://github.com/rapidsai/tpcx-bb/tree/master/tpcx_bb/queries/q28"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }