palewire · October 15, 2024 16:25
diff --git a/old-school.ipynb b/old-school.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8300886a-a2c3-46d4-848c-67d48d5abd2b",
   "metadata": {},
   "source": [
    "# \"Old School\" Machine Learning Classifier\n",
    "\n",
    "An example of how things were traditionally done"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d780d8a0-bc76-49c4-8ed9-39345ac87149",
   "metadata": {},
   "source": [
    "## Import Python tools"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "9ab4e129-fc5d-4f9b-b09f-df79dbbb61e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Here's a bunch of machine-learning crap from a module called \"scikit-learn\"\n",
    "# This is what most Python people use to write machine-learning code\n",
    "from sklearn import metrics\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.feature_extraction.text import CountVectorizer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b4edd5c-ba9a-4cb0-b43b-de4546197c8c",
   "metadata": {},
   "source": [
    "## Train\n",
    "\n",
    "The first step is to train a machine learning model using your \"supervised\" sample of classification decisions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "5d912490-ab83-4ceb-ad2d-76950d3002c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "training_df = pd.read_csv(\"./sample.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "aa34d88c-a6ff-4c49-9fa3-0d876aa0c8db",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>app</th>\n",
       "      <th>headline</th>\n",
       "      <th>is_politics</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NYTimes</td>\n",
       "      <td>‘Shortcuts Everywhere’: Quality issues have pl...</td>\n",
       "      <td>n</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NYTimes</td>\n",
       "      <td>‘We're Going to Do Our Job’: Speaker Mike John...</td>\n",
       "      <td>y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CBS News</td>\n",
       "      <td>\"60 Minutes\" reports: Here's what the rivalry ...</td>\n",
       "      <td>n</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CBS News</td>\n",
       "      <td>\"Face the Nation\": DHS Secretary Alejandro May...</td>\n",
       "      <td>n</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>CBS News</td>\n",
       "      <td>\"History is watching\": President Biden blames ...</td>\n",
       "      <td>y</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        app                                           headline is_politics\n",
       "0   NYTimes  ‘Shortcuts Everywhere’: Quality issues have pl...           n\n",
       "1   NYTimes  ‘We're Going to Do Our Job’: Speaker Mike John...           y\n",
       "2  CBS News  \"60 Minutes\" reports: Here's what the rivalry ...           n\n",
       "3  CBS News  \"Face the Nation\": DHS Secretary Alejandro May...           n\n",
       "4  CBS News  \"History is watching\": President Biden blames ...           y"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "training_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "513f25d1-81ed-44a5-828a-0772ea38ae6a",
   "metadata": {},
   "source": [
    "Then you have to create a \"vectorizer\" to convert your text into numbers, so that the machine-learning system can read the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "a7eecb53-4b85-4013-85fe-3dd460e2110b",
   "metadata": {},
   "outputs": [],
   "source": [
    "vectorizer = CountVectorizer(\n",
    "    # Notice all these weird configuration options.\n",
    "    # Yes, you probably have to learn what they are and how to tune them.\n",
    "    # Yes, they are annoying as hell.\n",
    "    min_df=0.1,\n",
    "    max_df=0.9,\n",
    "    ngram_range=(1, 8),\n",
    "    analyzer=\"char\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d5d6abb-10e8-46eb-9b0c-bfbf53a8ac4f",
   "metadata": {},
   "source": [
    "Then you create a special function to pull the classification column out of your sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "7f20f252-e1ba-4691-8c52-b6190cf78a73",
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocessor = ColumnTransformer(\n",
    "    transformers=[\n",
    "        # Our vectorizer goes here...\n",
    "        ('headline', vectorizer, 'headline')\n",
    "    ],\n",
    "    # Here are some other super annoying configuration options\n",
    "    # These took me hours of study and fucking around to figure out.\n",
    "    # I still barely understand them.\n",
    "    sparse_threshold=0,\n",
    "    remainder='drop'  # Drop any columns not specified\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ce74a24-bf31-4c6f-915a-e4b2aab49e37",
   "metadata": {},
   "source": [
    "Now you combine your preprocesser with a machine learning algorithm into a \"pipeline\" that will make classification decisions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "9df492df-352e-45ae-b845-a91f9a410024",
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = Pipeline([\n",
    "    ('preprocessor', preprocessor),\n",
    "    # We'll use a \"Gaussian Naive Bayesian\" one. Exciting, right?\n",
    "    # Guess what? There are a dozen other algos you can try, each giving slightly different results depending on your data.\n",
    "    # They're not fun.\n",
    "    ('classifier', GaussianNB())\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f421f9a0-210f-4c34-aa96-a99c59d3bc50",
   "metadata": {},
   "source": [
    "Now we divide up our training sample into a chunk for training and chunk for testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "2612503e-d8ac-4e3f-9e2d-8a42b85b98b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "train, test, train_labels, test_labels = train_test_split(\n",
    "    training_df[['headline']],  # Feed it the sample's input headlines\n",
    "    training_df['is_politics'],  # Feed it the sample's output classifications\n",
    "    test_size=0.33,  # Set the size of the sample to hold aside for testing\n",
    "    random_state=42  # Remember Douglas Adams and Jackie Robinson\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b502bbc-7730-4eb7-9270-7932b5189dfe",
   "metadata": {},
   "source": [
    "Train the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "bdea74db-6da0-4321-859f-32472b8c38ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = pipeline.fit(train, train_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "878dc0e9-e488-4791-a22a-ea0971eb0c3f",
   "metadata": {},
   "source": [
    "Test how well our model performs by having it guess against our the test records held aside"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "e000b03e-f2fb-4334-a9dd-81484f6f1c0e",
   "metadata": {},
   "outputs": [],
   "source": [
    "preds = model.predict(test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9337331-f0e8-44bd-8c49-50c056d266a1",
   "metadata": {},
   "source": [
    "Give us a report card on how it did"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "df187ba7-8dc2-4fea-9166-2df15e030a46",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           n       0.79      0.98      0.87        46\n",
      "           y       0.89      0.40      0.55        20\n",
      "\n",
      "    accuracy                           0.80        66\n",
      "   macro avg       0.84      0.69      0.71        66\n",
      "weighted avg       0.82      0.80      0.78        66\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(metrics.classification_report(test_labels, preds))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "8300886a-a2c3-46d4-848c-67d48d5abd2b",
	"metadata": {},
	"source": [
	"# \"Old School\" Machine Learning Classifier\n",
	"\n",
	"An example of how things were traditionally done"
	]
	},
	{
	"cell_type": "markdown",
	"id": "d780d8a0-bc76-49c4-8ed9-39345ac87149",
	"metadata": {},
	"source": [
	"## Import Python tools"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"id": "9ab4e129-fc5d-4f9b-b09f-df79dbbb61e5",
	"metadata": {},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"\n",
	"# Here's a bunch of machine-learning crap from a module called \"scikit-learn\"\n",
	"# This is what most Python people use to write machine-learning code\n",
	"from sklearn import metrics\n",
	"from sklearn.pipeline import Pipeline\n",
	"from sklearn.naive_bayes import GaussianNB\n",
	"from sklearn.compose import ColumnTransformer\n",
	"from sklearn.model_selection import train_test_split\n",
	"from sklearn.feature_extraction.text import CountVectorizer"
	]
	},
	{
	"cell_type": "markdown",
	"id": "5b4edd5c-ba9a-4cb0-b43b-de4546197c8c",
	"metadata": {},
	"source": [
	"## Train\n",
	"\n",
	"The first step is to train a machine learning model using your \"supervised\" sample of classification decisions. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"id": "5d912490-ab83-4ceb-ad2d-76950d3002c0",
	"metadata": {},
	"outputs": [],
	"source": [
	"training_df = pd.read_csv(\"./sample.csv\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"id": "aa34d88c-a6ff-4c49-9fa3-0d876aa0c8db",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>app</th>\n",
	" <th>headline</th>\n",
	" <th>is_politics</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>NYTimes</td>\n",
	" <td>‘Shortcuts Everywhere’: Quality issues have pl...</td>\n",
	" <td>n</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>NYTimes</td>\n",
	" <td>‘We're Going to Do Our Job’: Speaker Mike John...</td>\n",
	" <td>y</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>CBS News</td>\n",
	" <td>\"60 Minutes\" reports: Here's what the rivalry ...</td>\n",
	" <td>n</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>CBS News</td>\n",
	" <td>\"Face the Nation\": DHS Secretary Alejandro May...</td>\n",
	" <td>n</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>CBS News</td>\n",
	" <td>\"History is watching\": President Biden blames ...</td>\n",
	" <td>y</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" app headline is_politics\n",
	"0 NYTimes ‘Shortcuts Everywhere’: Quality issues have pl... n\n",
	"1 NYTimes ‘We're Going to Do Our Job’: Speaker Mike John... y\n",
	"2 CBS News \"60 Minutes\" reports: Here's what the rivalry ... n\n",
	"3 CBS News \"Face the Nation\": DHS Secretary Alejandro May... n\n",
	"4 CBS News \"History is watching\": President Biden blames ... y"
	]
	},
	"execution_count": 15,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"training_df.head()"
	]
	},
	{
	"cell_type": "markdown",
	"id": "513f25d1-81ed-44a5-828a-0772ea38ae6a",
	"metadata": {},
	"source": [
	"Then you have to create a \"vectorizer\" to convert your text into numbers, so that the machine-learning system can read the data."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"id": "a7eecb53-4b85-4013-85fe-3dd460e2110b",
	"metadata": {},
	"outputs": [],
	"source": [
	"vectorizer = CountVectorizer(\n",
	" # Notice all these weird configuration options.\n",
	" # Yes, you probably have to learn what they are and how to tune them.\n",
	" # Yes, they are annoying as hell.\n",
	" min_df=0.1,\n",
	" max_df=0.9,\n",
	" ngram_range=(1, 8),\n",
	" analyzer=\"char\"\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "5d5d6abb-10e8-46eb-9b0c-bfbf53a8ac4f",
	"metadata": {},
	"source": [
	"Then you create a special function to pull the classification column out of your sample"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"id": "7f20f252-e1ba-4691-8c52-b6190cf78a73",
	"metadata": {},
	"outputs": [],
	"source": [
	"preprocessor = ColumnTransformer(\n",
	" transformers=[\n",
	" # Our vectorizer goes here...\n",
	" ('headline', vectorizer, 'headline')\n",
	" ],\n",
	" # Here are some other super annoying configuration options\n",
	" # These took me hours of study and fucking around to figure out.\n",
	" # I still barely understand them.\n",
	" sparse_threshold=0,\n",
	" remainder='drop' # Drop any columns not specified\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "7ce74a24-bf31-4c6f-915a-e4b2aab49e37",
	"metadata": {},
	"source": [
	"Now you combine your preprocesser with a machine learning algorithm into a \"pipeline\" that will make classification decisions"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"id": "9df492df-352e-45ae-b845-a91f9a410024",
	"metadata": {},
	"outputs": [],
	"source": [
	"pipeline = Pipeline([\n",
	" ('preprocessor', preprocessor),\n",
	" # We'll use a \"Gaussian Naive Bayesian\" one. Exciting, right?\n",
	" # Guess what? There are a dozen other algos you can try, each giving slightly different results depending on your data.\n",
	" # They're not fun.\n",
	" ('classifier', GaussianNB())\n",
	"])"
	]
	},
	{
	"cell_type": "markdown",
	"id": "f421f9a0-210f-4c34-aa96-a99c59d3bc50",
	"metadata": {},
	"source": [
	"Now we divide up our training sample into a chunk for training and chunk for testing."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 27,
	"id": "2612503e-d8ac-4e3f-9e2d-8a42b85b98b8",
	"metadata": {},
	"outputs": [],
	"source": [
	"train, test, train_labels, test_labels = train_test_split(\n",
	" training_df[['headline']], # Feed it the sample's input headlines\n",
	" training_df['is_politics'], # Feed it the sample's output classifications\n",
	" test_size=0.33, # Set the size of the sample to hold aside for testing\n",
	" random_state=42 # Remember Douglas Adams and Jackie Robinson\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "2b502bbc-7730-4eb7-9270-7932b5189dfe",
	"metadata": {},
	"source": [
	"Train the model"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"id": "bdea74db-6da0-4321-859f-32472b8c38ac",
	"metadata": {},
	"outputs": [],
	"source": [
	"model = pipeline.fit(train, train_labels)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "878dc0e9-e488-4791-a22a-ea0971eb0c3f",
	"metadata": {},
	"source": [
	"Test how well our model performs by having it guess against our the test records held aside"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"id": "e000b03e-f2fb-4334-a9dd-81484f6f1c0e",
	"metadata": {},
	"outputs": [],
	"source": [
	"preds = model.predict(test)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "b9337331-f0e8-44bd-8c49-50c056d266a1",
	"metadata": {},
	"source": [
	"Give us a report card on how it did"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 30,
	"id": "df187ba7-8dc2-4fea-9166-2df15e030a46",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" precision recall f1-score support\n",
	"\n",
	" n 0.79 0.98 0.87 46\n",
	" y 0.89 0.40 0.55 20\n",
	"\n",
	" accuracy 0.80 66\n",
	" macro avg 0.84 0.69 0.71 66\n",
	"weighted avg 0.82 0.80 0.78 66\n",
	"\n"
	]
	}
	],
	"source": [
	"print(metrics.classification_report(test_labels, preds))"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.11.9"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}