Skip to content

Instantly share code, notes, and snippets.

@palewire
Created October 15, 2024 16:25
Show Gist options
  • Save palewire/3098ab1f4fe7e6e43451c1d3bc7b2a71 to your computer and use it in GitHub Desktop.
Save palewire/3098ab1f4fe7e6e43451c1d3bc7b2a71 to your computer and use it in GitHub Desktop.
"Old School" Machine Learning Classifier
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "8300886a-a2c3-46d4-848c-67d48d5abd2b",
"metadata": {},
"source": [
"# \"Old School\" Machine Learning Classifier\n",
"\n",
"An example of how things were traditionally done"
]
},
{
"cell_type": "markdown",
"id": "d780d8a0-bc76-49c4-8ed9-39345ac87149",
"metadata": {},
"source": [
"## Import Python tools"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "9ab4e129-fc5d-4f9b-b09f-df79dbbb61e5",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# Here's a bunch of machine-learning crap from a module called \"scikit-learn\"\n",
"# This is what most Python people use to write machine-learning code\n",
"from sklearn import metrics\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.naive_bayes import GaussianNB\n",
"from sklearn.compose import ColumnTransformer\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.feature_extraction.text import CountVectorizer"
]
},
{
"cell_type": "markdown",
"id": "5b4edd5c-ba9a-4cb0-b43b-de4546197c8c",
"metadata": {},
"source": [
"## Train\n",
"\n",
"The first step is to train a machine learning model using your \"supervised\" sample of classification decisions. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "5d912490-ab83-4ceb-ad2d-76950d3002c0",
"metadata": {},
"outputs": [],
"source": [
"training_df = pd.read_csv(\"./sample.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "aa34d88c-a6ff-4c49-9fa3-0d876aa0c8db",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>app</th>\n",
" <th>headline</th>\n",
" <th>is_politics</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NYTimes</td>\n",
" <td>‘Shortcuts Everywhere’: Quality issues have pl...</td>\n",
" <td>n</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NYTimes</td>\n",
" <td>‘We're Going to Do Our Job’: Speaker Mike John...</td>\n",
" <td>y</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>CBS News</td>\n",
" <td>\"60 Minutes\" reports: Here's what the rivalry ...</td>\n",
" <td>n</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>CBS News</td>\n",
" <td>\"Face the Nation\": DHS Secretary Alejandro May...</td>\n",
" <td>n</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>CBS News</td>\n",
" <td>\"History is watching\": President Biden blames ...</td>\n",
" <td>y</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" app headline is_politics\n",
"0 NYTimes ‘Shortcuts Everywhere’: Quality issues have pl... n\n",
"1 NYTimes ‘We're Going to Do Our Job’: Speaker Mike John... y\n",
"2 CBS News \"60 Minutes\" reports: Here's what the rivalry ... n\n",
"3 CBS News \"Face the Nation\": DHS Secretary Alejandro May... n\n",
"4 CBS News \"History is watching\": President Biden blames ... y"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_df.head()"
]
},
{
"cell_type": "markdown",
"id": "513f25d1-81ed-44a5-828a-0772ea38ae6a",
"metadata": {},
"source": [
"Then you have to create a \"vectorizer\" to convert your text into numbers, so that the machine-learning system can read the data."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a7eecb53-4b85-4013-85fe-3dd460e2110b",
"metadata": {},
"outputs": [],
"source": [
"vectorizer = CountVectorizer(\n",
" # Notice all these weird configuration options.\n",
" # Yes, you probably have to learn what they are and how to tune them.\n",
" # Yes, they are annoying as hell.\n",
" min_df=0.1,\n",
" max_df=0.9,\n",
" ngram_range=(1, 8),\n",
" analyzer=\"char\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "5d5d6abb-10e8-46eb-9b0c-bfbf53a8ac4f",
"metadata": {},
"source": [
"Then you create a special function to pull the classification column out of your sample"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "7f20f252-e1ba-4691-8c52-b6190cf78a73",
"metadata": {},
"outputs": [],
"source": [
"preprocessor = ColumnTransformer(\n",
" transformers=[\n",
" # Our vectorizer goes here...\n",
" ('headline', vectorizer, 'headline')\n",
" ],\n",
" # Here are some other super annoying configuration options\n",
" # These took me hours of study and fucking around to figure out.\n",
" # I still barely understand them.\n",
" sparse_threshold=0,\n",
" remainder='drop' # Drop any columns not specified\n",
")"
]
},
{
"cell_type": "markdown",
"id": "7ce74a24-bf31-4c6f-915a-e4b2aab49e37",
"metadata": {},
"source": [
"Now you combine your preprocesser with a machine learning algorithm into a \"pipeline\" that will make classification decisions"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "9df492df-352e-45ae-b845-a91f9a410024",
"metadata": {},
"outputs": [],
"source": [
"pipeline = Pipeline([\n",
" ('preprocessor', preprocessor),\n",
" # We'll use a \"Gaussian Naive Bayesian\" one. Exciting, right?\n",
" # Guess what? There are a dozen other algos you can try, each giving slightly different results depending on your data.\n",
" # They're not fun.\n",
" ('classifier', GaussianNB())\n",
"])"
]
},
{
"cell_type": "markdown",
"id": "f421f9a0-210f-4c34-aa96-a99c59d3bc50",
"metadata": {},
"source": [
"Now we divide up our training sample into a chunk for training and chunk for testing."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "2612503e-d8ac-4e3f-9e2d-8a42b85b98b8",
"metadata": {},
"outputs": [],
"source": [
"train, test, train_labels, test_labels = train_test_split(\n",
" training_df[['headline']], # Feed it the sample's input headlines\n",
" training_df['is_politics'], # Feed it the sample's output classifications\n",
" test_size=0.33, # Set the size of the sample to hold aside for testing\n",
" random_state=42 # Remember Douglas Adams and Jackie Robinson\n",
")"
]
},
{
"cell_type": "markdown",
"id": "2b502bbc-7730-4eb7-9270-7932b5189dfe",
"metadata": {},
"source": [
"Train the model"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "bdea74db-6da0-4321-859f-32472b8c38ac",
"metadata": {},
"outputs": [],
"source": [
"model = pipeline.fit(train, train_labels)"
]
},
{
"cell_type": "markdown",
"id": "878dc0e9-e488-4791-a22a-ea0971eb0c3f",
"metadata": {},
"source": [
"Test how well our model performs by having it guess against our the test records held aside"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "e000b03e-f2fb-4334-a9dd-81484f6f1c0e",
"metadata": {},
"outputs": [],
"source": [
"preds = model.predict(test)"
]
},
{
"cell_type": "markdown",
"id": "b9337331-f0e8-44bd-8c49-50c056d266a1",
"metadata": {},
"source": [
"Give us a report card on how it did"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "df187ba7-8dc2-4fea-9166-2df15e030a46",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" n 0.79 0.98 0.87 46\n",
" y 0.89 0.40 0.55 20\n",
"\n",
" accuracy 0.80 66\n",
" macro avg 0.84 0.69 0.71 66\n",
"weighted avg 0.82 0.80 0.78 66\n",
"\n"
]
}
],
"source": [
"print(metrics.classification_report(test_labels, preds))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment