Created
October 15, 2024 16:25
-
-
Save palewire/3098ab1f4fe7e6e43451c1d3bc7b2a71 to your computer and use it in GitHub Desktop.
"Old School" Machine Learning Classifier
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "8300886a-a2c3-46d4-848c-67d48d5abd2b", | |
"metadata": {}, | |
"source": [ | |
"# \"Old School\" Machine Learning Classifier\n", | |
"\n", | |
"An example of how things were traditionally done" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "d780d8a0-bc76-49c4-8ed9-39345ac87149", | |
"metadata": {}, | |
"source": [ | |
"## Import Python tools" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"id": "9ab4e129-fc5d-4f9b-b09f-df79dbbb61e5", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"\n", | |
"# Here's a bunch of machine-learning crap from a module called \"scikit-learn\"\n", | |
"# This is what most Python people use to write machine-learning code\n", | |
"from sklearn import metrics\n", | |
"from sklearn.pipeline import Pipeline\n", | |
"from sklearn.naive_bayes import GaussianNB\n", | |
"from sklearn.compose import ColumnTransformer\n", | |
"from sklearn.model_selection import train_test_split\n", | |
"from sklearn.feature_extraction.text import CountVectorizer" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "5b4edd5c-ba9a-4cb0-b43b-de4546197c8c", | |
"metadata": {}, | |
"source": [ | |
"## Train\n", | |
"\n", | |
"The first step is to train a machine learning model using your \"supervised\" sample of classification decisions. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"id": "5d912490-ab83-4ceb-ad2d-76950d3002c0", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"training_df = pd.read_csv(\"./sample.csv\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"id": "aa34d88c-a6ff-4c49-9fa3-0d876aa0c8db", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>app</th>\n", | |
" <th>headline</th>\n", | |
" <th>is_politics</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>NYTimes</td>\n", | |
" <td>‘Shortcuts Everywhere’: Quality issues have pl...</td>\n", | |
" <td>n</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>NYTimes</td>\n", | |
" <td>‘We're Going to Do Our Job’: Speaker Mike John...</td>\n", | |
" <td>y</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>CBS News</td>\n", | |
" <td>\"60 Minutes\" reports: Here's what the rivalry ...</td>\n", | |
" <td>n</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>CBS News</td>\n", | |
" <td>\"Face the Nation\": DHS Secretary Alejandro May...</td>\n", | |
" <td>n</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>CBS News</td>\n", | |
" <td>\"History is watching\": President Biden blames ...</td>\n", | |
" <td>y</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" app headline is_politics\n", | |
"0 NYTimes ‘Shortcuts Everywhere’: Quality issues have pl... n\n", | |
"1 NYTimes ‘We're Going to Do Our Job’: Speaker Mike John... y\n", | |
"2 CBS News \"60 Minutes\" reports: Here's what the rivalry ... n\n", | |
"3 CBS News \"Face the Nation\": DHS Secretary Alejandro May... n\n", | |
"4 CBS News \"History is watching\": President Biden blames ... y" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"training_df.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "513f25d1-81ed-44a5-828a-0772ea38ae6a", | |
"metadata": {}, | |
"source": [ | |
"Then you have to create a \"vectorizer\" to convert your text into numbers, so that the machine-learning system can read the data." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "a7eecb53-4b85-4013-85fe-3dd460e2110b", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"vectorizer = CountVectorizer(\n", | |
" # Notice all these weird configuration options.\n", | |
" # Yes, you probably have to learn what they are and how to tune them.\n", | |
" # Yes, they are annoying as hell.\n", | |
" min_df=0.1,\n", | |
" max_df=0.9,\n", | |
" ngram_range=(1, 8),\n", | |
" analyzer=\"char\"\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "5d5d6abb-10e8-46eb-9b0c-bfbf53a8ac4f", | |
"metadata": {}, | |
"source": [ | |
"Then you create a special function to pull the classification column out of your sample" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"id": "7f20f252-e1ba-4691-8c52-b6190cf78a73", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"preprocessor = ColumnTransformer(\n", | |
" transformers=[\n", | |
" # Our vectorizer goes here...\n", | |
" ('headline', vectorizer, 'headline')\n", | |
" ],\n", | |
" # Here are some other super annoying configuration options\n", | |
" # These took me hours of study and fucking around to figure out.\n", | |
" # I still barely understand them.\n", | |
" sparse_threshold=0,\n", | |
" remainder='drop' # Drop any columns not specified\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "7ce74a24-bf31-4c6f-915a-e4b2aab49e37", | |
"metadata": {}, | |
"source": [ | |
"Now you combine your preprocesser with a machine learning algorithm into a \"pipeline\" that will make classification decisions" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"id": "9df492df-352e-45ae-b845-a91f9a410024", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"pipeline = Pipeline([\n", | |
" ('preprocessor', preprocessor),\n", | |
" # We'll use a \"Gaussian Naive Bayesian\" one. Exciting, right?\n", | |
" # Guess what? There are a dozen other algos you can try, each giving slightly different results depending on your data.\n", | |
" # They're not fun.\n", | |
" ('classifier', GaussianNB())\n", | |
"])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f421f9a0-210f-4c34-aa96-a99c59d3bc50", | |
"metadata": {}, | |
"source": [ | |
"Now we divide up our training sample into a chunk for training and chunk for testing." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"id": "2612503e-d8ac-4e3f-9e2d-8a42b85b98b8", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"train, test, train_labels, test_labels = train_test_split(\n", | |
" training_df[['headline']], # Feed it the sample's input headlines\n", | |
" training_df['is_politics'], # Feed it the sample's output classifications\n", | |
" test_size=0.33, # Set the size of the sample to hold aside for testing\n", | |
" random_state=42 # Remember Douglas Adams and Jackie Robinson\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "2b502bbc-7730-4eb7-9270-7932b5189dfe", | |
"metadata": {}, | |
"source": [ | |
"Train the model" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"id": "bdea74db-6da0-4321-859f-32472b8c38ac", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"model = pipeline.fit(train, train_labels)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "878dc0e9-e488-4791-a22a-ea0971eb0c3f", | |
"metadata": {}, | |
"source": [ | |
"Test how well our model performs by having it guess against our the test records held aside" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"id": "e000b03e-f2fb-4334-a9dd-81484f6f1c0e", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"preds = model.predict(test)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "b9337331-f0e8-44bd-8c49-50c056d266a1", | |
"metadata": {}, | |
"source": [ | |
"Give us a report card on how it did" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"id": "df187ba7-8dc2-4fea-9166-2df15e030a46", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" precision recall f1-score support\n", | |
"\n", | |
" n 0.79 0.98 0.87 46\n", | |
" y 0.89 0.40 0.55 20\n", | |
"\n", | |
" accuracy 0.80 66\n", | |
" macro avg 0.84 0.69 0.71 66\n", | |
"weighted avg 0.82 0.80 0.78 66\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"print(metrics.classification_report(test_labels, preds))" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.11.9" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment