Last active
June 21, 2023 06:29
-
-
Save mazieres/0b905a30b1fc9bdbb36237575fe276c8 to your computer and use it in GitHub Desktop.
Python companion to the work described here : https://mazieres.gitlab.io/namograph/index.html
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Namograph\n", | |
"\n", | |
"## Large-scale diversity estimates through surname origin inference\n", | |
"\n", | |
"by [Antoine Mazières](https://mazieres.gitlab.io/) and [Camille Roth](http://camilleroth.eu/).\n", | |
"\n", | |
"*First published in 2018. Last edited on June 21th, 2023 using Python 3.9.6.*\n", | |
"\n", | |
"Python companion to the paper « [Large-scale diversity estimation through surname origin inference](https://hal.archives-ouvertes.fr/hal-01766665) » published in the [*Bulletin of Sociological Methodology*](https://journals.sagepub.com/doi/abs/10.1177/0759106318778828). This notebook doesn't pretend to show everything that has been tried, but rather a simple way to what worked best so far.\n", | |
"\n", | |
"**To simply use this method on your own data**, check [this notebook](https://gist.github.com/mazieres/a04830b959aee40b3b790017154641d2)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import json\n", | |
"import re\n", | |
"from collections import defaultdict, Counter\n", | |
"from math import log\n", | |
"import time\n", | |
"import gzip\n", | |
"import os\n", | |
"import pickle\n", | |
"import joblib\n", | |
"\n", | |
"from glob import glob\n", | |
"from string import ascii_uppercase, ascii_letters\n", | |
"\n", | |
"import pandas as pd\n", | |
"import numpy as np\n", | |
"\n", | |
"from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer, CountVectorizer\n", | |
"from sklearn.model_selection import train_test_split\n", | |
"from sklearn.naive_bayes import MultinomialNB\n", | |
"from sklearn.metrics import f1_score, classification_report, precision_recall_fscore_support, accuracy_score\n", | |
"from sklearn.model_selection import learning_curve\n", | |
"from sklearn.model_selection import ShuffleSplit\n", | |
"\n", | |
"from scipy import cluster\n", | |
"from scipy.cluster.hierarchy import linkage, dendrogram, to_tree\n", | |
"from scipy.spatial.distance import pdist, squareform, euclidean\n", | |
"from scipy.stats.mstats import gmean\n", | |
"\n", | |
"import matplotlib.pyplot as plt\n", | |
"import matplotlib\n", | |
"\n", | |
"matplotlib.style.use('ggplot')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"reverse_clean_letters = {\n", | |
" \"O\": ['Ó', 'Õ', 'О', 'Ō', 'Ő', 'Ö', 'Ό', 'Ô', 'Ò', 'Ø'],\n", | |
" \"E\": ['Ě', 'Ê', 'Ȩ', 'Ē', 'Ę', 'È', 'Ĕ', 'Ë', 'É', 'Ė'],\n", | |
" \"AE\": ['Æ'],\n", | |
" \"L\": ['Ł', 'Ľ'],\n", | |
" \"I\": ['Î', 'Ï', 'İ', 'Ī', 'Ì', 'Í'],\n", | |
" \"U\": ['Ů', 'Ū', 'Ù', 'Ű', 'Ú', 'Û', 'Ü'],\n", | |
" \"R\": ['Ř'],\n", | |
" \"C\": ['Ç', 'Ć', 'Č'],\n", | |
" \"A\": ['Ą', 'Ä', 'Ã', 'Ӓ', 'Å', 'Â', 'À', 'Á', 'Ă'],\n", | |
" \"T\": ['Ț', 'Т', 'Ť'],\n", | |
" \"Y\": ['Ý'],\n", | |
" \"Z\": ['Ź', 'Ż', 'Ž'],\n", | |
" \"N\": ['Ń', 'Ň', 'Ñ'],\n", | |
" \"S\": ['Š', 'Ş', 'Ș', 'Ś'],\n", | |
" \"D\": ['Đ', 'Ď', 'Ð'],\n", | |
" \"Y\": ['У', 'Ý'],\n", | |
" \"G\": ['Ğ'],\n", | |
" \"K\": ['К'],\n", | |
" \"B\": ['Β'],\n", | |
" \"'\": ['ʾ']\n", | |
"}\n", | |
"\n", | |
"clean_letters = {}\n", | |
"for k, v in reverse_clean_letters.items():\n", | |
" for lett in v:\n", | |
" clean_letters[lett] = k\n", | |
"\n", | |
"def clean_name(name):\n", | |
" name = name.replace('\\\\', '')\n", | |
" name = name.replace('•', '')\n", | |
" \n", | |
" clean = []\n", | |
" for lett in name.upper():\n", | |
" if lett in clean_letters:\n", | |
" lett = clean_letters[lett]\n", | |
" clean.append(lett)\n", | |
" continue\n", | |
" if lett not in ascii_uppercase+\"- '\":\n", | |
" print(\"clean_name Error:\", lett, name.upper())\n", | |
" return None\n", | |
" clean.append(lett)\n", | |
" return ''.join(clean)\n", | |
"\n", | |
"def flatten(li):\n", | |
" return [item for sublist in li for item in sublist]\n", | |
"\n", | |
"def chunks(li, n):\n", | |
" if li == []:\n", | |
" return\n", | |
" yield li[:n]\n", | |
" yield from chunks(li[n:], n)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Data\n", | |
"\n", | |
"In order to learn our model, we used all [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/) data as of March 2017 with the query `1800:2020[dp]`, yielding `26,941,850` records. From each record we extracted all affiliations related to one of the 176 countries present in the [Natural Earth](http://www.naturalearthdata.com/downloads/110m-cultural-vectors/) dataset. This returned `25,628,999` tuples with the author's surname and the country of affiliation.\n", | |
"\n", | |
"```bash\n", | |
"$ wget -c 'https://api.nakala.fr/data/10.34847/nkl.da5f62h8/bb0efe9a9bb47176bd5951c91d42f9865ab55972' -O pubmed_name_country.csv.gz\n", | |
"$ gzip -d pubmed_name_country.csv.gz\n", | |
"$ gshuf -n5 pubmed_name_country.csv \n", | |
"SETO;Japan\n", | |
"SIMUNOVIC;France\n", | |
"IACOBELLI;Italy\n", | |
"SASAKI;Japan\n", | |
"SCHOEHN;France\n", | |
"```\n", | |
"\n", | |
"From this data extraction, we want to build a matrice $RAW$ with as an index the list of unique names encountered in the pubmed dataset, and as columns the countries in which they have been encountered. When two surnames composes an author surname, they are separated by a dash. In this case we consider the two names assigning to both of them the label of the original composed surname. We selected only names appearing more than 1000 times in the dataset." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def handle_dashes(name):\n", | |
" if name.count('-') > 1:\n", | |
" return [name.replace('-', ' ')]\n", | |
" li = [y for y in [x.strip(\" -'\") for x in name.split('-')] if len(y) > 1 and not re.fullmatch(\"[A-Z] [A-Z]\", y)]\n", | |
" if len(li) == 0:\n", | |
" return None\n", | |
" # Handling the dash in many arabic surnames such has it doesn't appear as a composed surname.\n", | |
" if li[0] in ['AL', 'EL']:\n", | |
" return [name.replace('-', ' ')]\n", | |
" return li\n", | |
"\n", | |
"def load_raw_dataset():\n", | |
" names_by_countries = defaultdict(lambda: defaultdict(int))\n", | |
" with open(\"pubmed_name_country.csv\") as fi:\n", | |
" for line in fi:\n", | |
" line = line.strip()\n", | |
" if line == '':\n", | |
" continue\n", | |
" name, country = [x.strip(\" -'\") for x in line.split(';')]\n", | |
" # Getting rid of one-letter names\n", | |
" if len(name) < 2:\n", | |
" continue\n", | |
" # Getting rid of 30-letter-or-more names\n", | |
" if len(name) > 30:\n", | |
" continue\n", | |
" # Getting rid of one-letter-space-one-letter names\n", | |
" two_weird = re.fullmatch(\"[A-Z] [A-Z]\", name)\n", | |
" if two_weird:\n", | |
" continue\n", | |
" # Handling composed surnames\n", | |
" if '-' in name:\n", | |
" nn = handle_dashes(name)\n", | |
" if nn == None or len(nn) == 0: continue\n", | |
" for n in nn:\n", | |
" if len(n) < 2 or len(n) > 30: continue\n", | |
" names_by_countries[country][n] += 1\n", | |
" else:\n", | |
" names_by_countries[country][name] += 1\n", | |
" # Applying threshold of more-than-1000-occurences names\n", | |
" thres = {k:v for k, v in names_by_countries.items() if sum(v.values()) > 1000}\n", | |
" df = pd.DataFrame(thres)\n", | |
" return df" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_raw = load_raw_dataset()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can look at the number of occurences of a name per country:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"United States 4.0\n", | |
"France 105.0\n", | |
"Malaysia 1.0\n", | |
"Sweden 2.0\n", | |
"Name: MAZIERES, dtype: float64" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_raw.T['MAZIERES'][df_raw.T['MAZIERES'] > 0]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Normalization\n", | |
"\n", | |
"The PubMed dataset is very biased because, among other things:\n", | |
"\n", | |
"+ We can assume that the scientific population it represents is more nomadic that the average.\n", | |
"+ Observing academic publications strongly over-represents ~rich countries (ex: North America) and under-represents ~poor countries (ex: Africa).\n", | |
"+ The data source (PubMed) itself may probably reference better and since more time, english written publications.\n", | |
"\n", | |
"To counterbalance these biases, we created the matrice $NORM$ by normalizing the data such as a name's weight in a country depends on the total of observations made in this country." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_norm = df_raw / df_raw.sum(axis=0)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"United States 5.893463e-07\n", | |
"France 9.312746e-05\n", | |
"Malaysia 1.495640e-05\n", | |
"Sweden 5.399335e-06\n", | |
"Name: MAZIERES, dtype: float64" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_norm.T['MAZIERES'][df_norm.T['MAZIERES'] > 0]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In order to be able to represent in an consistent way how a name distribute itself among several countries, we created a matrice $NORMNORM$ such as the sum of ever names' countries weights in $NORM$ is equal to one." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_norm_norm = (df_norm.T / df_norm.sum(axis=1)).T" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"United States 0.005166\n", | |
"France 0.816388\n", | |
"Malaysia 0.131113\n", | |
"Sweden 0.047332\n", | |
"Name: MAZIERES, dtype: float64" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_norm_norm.T['MAZIERES'][df_norm_norm.T['MAZIERES'] > 0]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Assuming that this normalization have reduced the biases inherited from PubMed, we can still assume that it carries many biases in the context of a study of surnames origins. Indeed, if surnames where usually freezed by various administrative policies a few centuries back (500 years in Europe, a few thousands in Asia), all migration fluxes since then constitue a bias to reconstruct the geographical origin of a surname, ie. where a surnname has first been coined.\n", | |
"\n", | |
"While it may not be possible to infer the origin of all names in the dataset, we can still presume that names for which presence is highly concentrated in one country have a pretty good chance to come from this country. In order to measure such concentration we used the [Herfindahl-Hirschmann index](https://en.wikipedia.org/wiki/Herfindahl_index) (hhi)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def hhi(v, norm=True):\n", | |
" v = v[~v.isnull()]\n", | |
" h = np.square(v).sum()\n", | |
" if norm == True:\n", | |
" n = len(v)\n", | |
" if n == 1:\n", | |
" return 1\n", | |
" return (h - 1 / n) / (1 - 1 / n)\n", | |
" return h" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_hhi = df_norm_norm.T.apply(hhi, norm=False)\n", | |
"df_hhi.name = 'hhi'\n", | |
"df_max = df_norm_norm.idxmax(axis=1)\n", | |
"df_max.name = 'max_line'\n", | |
"df_max_freq = df_norm.max(axis=1)\n", | |
"df_max_freq.name = 'freq'\n", | |
"hhifreq = pd.concat([df_max, df_hhi, df_max_freq], axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The matrice $HHIFREQ$ shows for all surnnames its hhi score, the country the most represented in NORM and the corresponding relative frequency of the name in this country. Such as:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>max_line</th>\n", | |
" <th>hhi</th>\n", | |
" <th>freq</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>HULMAN</th>\n", | |
" <td>Botswana</td>\n", | |
" <td>0.375878</td>\n", | |
" <td>0.000543</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>HUDEC</th>\n", | |
" <td>Slovakia</td>\n", | |
" <td>0.855060</td>\n", | |
" <td>0.000777</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>GASPAROVIC</th>\n", | |
" <td>Croatia</td>\n", | |
" <td>0.790233</td>\n", | |
" <td>0.001955</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>BENA</th>\n", | |
" <td>Slovakia</td>\n", | |
" <td>0.331380</td>\n", | |
" <td>0.000194</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>ARTEMIOU</th>\n", | |
" <td>Slovakia</td>\n", | |
" <td>0.833605</td>\n", | |
" <td>0.000291</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" max_line hhi freq\n", | |
"HULMAN Botswana 0.375878 0.000543\n", | |
"HUDEC Slovakia 0.855060 0.000777\n", | |
"GASPAROVIC Croatia 0.790233 0.001955\n", | |
"BENA Slovakia 0.331380 0.000194\n", | |
"ARTEMIOU Slovakia 0.833605 0.000291" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"hhifreq.head(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This allow us to apply a new threshold for selecting highly geographically concentrated surnames (hhi > 0.8) and with a certain relative frequency in this country." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_thres = hhifreq[(hhifreq['hhi'] > 0.8) & (hhifreq['freq'] > 0.000001)]['max_line']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"HUDEC Slovakia\n", | |
"ARTEMIOU Slovakia\n", | |
"GAZDIKOVA Slovakia\n", | |
"CAPRNDA Slovakia\n", | |
"HOJEROVA Slovakia\n", | |
"Name: max_line, dtype: object" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_thres.head(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Clustering\n", | |
"\n", | |
"Making the assumption that names origins are predictable from the information contained in the names (letters and combinaisons of letter frequencies, ie. \"consonnance\") somewhat overlaps with the idea that names in our datasets should have relatively similar consonnances if close geographically.\n", | |
"\n", | |
"To verify such statement, we used the n-grams decomposition of the names to build the feature vector of all countries present in our dataset. The matrice $NG$ represents for each country in index, the relative frequency of each n-gram in column, relative to the sum of all the given n-gram on the whole dataset." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def ngrams(seq, depth='max'):\n", | |
" seq = \"^\" + seq + \"$\"\n", | |
" if depth == 'max':\n", | |
" depth = len(seq)\n", | |
" elif not isinstance(depth, int):\n", | |
" print(\"ngrams:Depth must be an int.\")\n", | |
" return False\n", | |
" res = []\n", | |
" step = 1\n", | |
" while step <= depth:\n", | |
" i = 0\n", | |
" while i < len(seq) and i+step <= len(seq):\n", | |
" res.append(seq[i:i+step])\n", | |
" i += 1\n", | |
" step += 1\n", | |
" res.remove(\"^\")\n", | |
" res.remove(\"$\")\n", | |
" if \" \" in res:\n", | |
" res.remove(\" \")\n", | |
" return res" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here is an example of n-gram decomposition of the name 'ROTH'. We systematically added signs for begining and end of string, in order to keep this information." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['R', 'O', 'T', 'H', '^R', 'RO', 'OT', 'TH', 'H$', '^RO', 'ROT', 'OTH', 'TH$']" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"ngrams('ROTH', depth=3)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"ngrams_by_countries = defaultdict(lambda: defaultdict(int))\n", | |
"for name, country in zip(df_thres.index, df_thres.values):\n", | |
" for ng in ngrams(name, depth=4):\n", | |
" ngrams_by_countries[country][ng] += 1\n", | |
"df_ng = pd.DataFrame(ngrams_by_countries).fillna(0.0).T\n", | |
"df_ng = (df_ng.T / df_ng.sum(axis=1)).T" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Figure size 640x480 with 0 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 1500x500 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"plt.clf()\n", | |
"plt.rcParams['axes.facecolor']='white'\n", | |
"plt.rcParams['savefig.facecolor']='white'\n", | |
"matplotlib.rcParams['lines.linewidth'] = 0.5\n", | |
"L = linkage(pdist(df_ng.values), method=\"ward\")\n", | |
"plt.figure(figsize=(15,5), facecolor='white')\n", | |
"plt.grid(visible=False)\n", | |
"dend = dendrogram(L, labels=[\"{}\".format(x) for x in df_ng.index], orientation='top', color_threshold=0.4*max(L[:,2]))\n", | |
"plt.tick_params(\n", | |
" labelsize=5,\n", | |
" axis='x',\n", | |
" which='both',\n", | |
" bottom='off',\n", | |
" top='off',\n", | |
" labelbottom='on')\n", | |
"plt.tight_layout()\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"By applying hierarchical clustering on NG we can see that countries regroup particularly well according to major earth geographical/cultural divisions, such as from right to left off the index of the hclust plot : \n", | |
"\n", | |
"+ Subaharan-african\n", | |
"+ Asian\n", | |
"+ Indian subcontinental\n", | |
"+ Arabian\n", | |
"+ Slavic\n", | |
"+ North European\n", | |
"+ Central and SouthEuropean\n", | |
"\n", | |
"However, according to this geographical interpretation of this clustering a few errors of classification emerge that should be either reclassified (Philippines,Japan,Indonesia->Asia ; Ethiopia>SubSaharanAfrica) or deleted if they represent a low number of observations (Papua New Guinea-385;Madagascar-664;Jamaica-346;Chad-282;Armenia-482)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"clusters = {\n", | |
" \"African\": [\"Zimbabwe\",\"Rwanda\",\"Zambia\",\"Malawi\",\"Tanzania\",\"Uganda\",\"Kenya\",\"Dem. Rep. Congo\",\"Congo\",\"South Africa\",\"Gambia\",\"Botswana\",\"Mozambique\",\"Mali\",\"Trinidad and Tobago\",\"Gabon\",\"Cameroon\",\"Benin\",\"Côte d'Ivoire\",\"Burkina Faso\",\"Togo\",\"Senegal\",\"Nigeria\",\"Ghana\",\"Ethiopia\"],\n", | |
" \"Asian\": [\"Vietnam\",\"China\",\"Thailand\",\"Cambodia\",\"Taiwan\",\"Korea\",\"Lao PDR\",\"Japan\",\"Indonesia\",\"Philippines\"],\n", | |
" \"Indian\": [\"Nepal\",\"India\",\"Sri Lanka\",\"Mongolia\",\"Pakistan\",\"Malaysia\",\"Bangladesh\",\"Iran\"],\n", | |
" \"Arabian\": [\"Sudan\",\"Libya\",\"Egypt\",\"Tunisia\",\"Morocco\",\"Algeria\",\"United Arab Emirates\",\"Qatar\",\"Lebanon\",\"Syria\",\"Jordan\",\"Palestine\",\"Saudi Arabia\",\"Kuwait\",\"Iraq\",\"Oman\",\"Yemen\"],\n", | |
" \"Slavic\": [\"Poland\",\"Macedonia\",\"Ukraine\",\"Belarus\",\"Russia\",\"Kazakhstan\",\"Bulgaria\",\"Slovakia\",\"Czech Rep.\",\"Croatia\",\"Bosnia and Herz.\",\"Serbia\",\"Montenegro\"],\n", | |
" \"NorthEuropean\": [\"Norway\",\"Denmark\",\"Sweden\",\"Netherlands\",\"Belgium\",\"Germany\",\"Austria\",\"United Kingdom\",\"Australia\",\"Canada\",\"New Zealand\",\"United States\",\"Ireland\",\"Israel\",\"Switzerland\",\"Luxembourg\",\"France\",\"Iceland\"],\n", | |
" \"CentralSouthEuropean\": [\"Slovenia\",\"Hungary\",\"Turkey\",\"Latvia\",\"Estonia\",\"Finland\",\"Italy\",\"Albania\",\"Romania\",\"Lithuania\",\"Greece\",\"Cyprus\",\"Georgia\",\"Venezuela\",\"Puerto Rico\",\"Costa Rica\",\"Spain\",\"Mexico\",\"Cuba\",\"Colombia\",\"Guatemala\",\"Peru\",\"Chile\",\"Ecuador\",\"Bolivia\",\"Uruguay\",\"Argentina\",\"Panama\",\"Portugal\",\"Brazil\"],\n", | |
"}\n", | |
"\n", | |
"reverse_clusters = {}\n", | |
"for k, v in clusters.items():\n", | |
" for c in v:\n", | |
" reverse_clusters[c] = k" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Learning\n", | |
"\n", | |
"For learning a model, I've tried many classifiers (basically, everything that sklearn offers plus a few others), together with feature selection, etc. In the end, plain naive bayes works well and fast and random forest seems to figure out better tricky names but with a weaker overall performance." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"names, countries = [], []\n", | |
"for k, v in df_thres.items():\n", | |
" if v not in reverse_clusters:\n", | |
" continue\n", | |
" countries.append(reverse_clusters[v])\n", | |
" names.append(\"^{}$\".format(k))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 4))\n", | |
"Xs_train, Xs_test, y_train, y_test = train_test_split(names, countries, test_size=0.15, random_state=42)\n", | |
"X_train = vectorizer.fit_transform(Xs_train)\n", | |
"X_test = vectorizer.transform(Xs_test)\n", | |
"clf = MultinomialNB(alpha=0.01, fit_prior=True)\n", | |
"clf.fit(X_train, y_train)\n", | |
"y_pred = clf.predict(X_test)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Checking performances**" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"prfs = precision_recall_fscore_support(y_test, y_pred)\n", | |
"precision, recall, f1, support = (pd.Series({clf.classes_[idx]:val for idx, val in enumerate(x)}) for x in prfs)\n", | |
"precision.name = 'precision'\n", | |
"recall.name = 'recall'\n", | |
"f1.name = 'f1'\n", | |
"support.name = 'support'\n", | |
"prfs_df = pd.concat([precision, recall, f1, support], axis=1).sort_values('f1', ascending=False)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>precision</th>\n", | |
" <th>recall</th>\n", | |
" <th>f1</th>\n", | |
" <th>support</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>CentralSouthEuropean</th>\n", | |
" <td>0.805153</td>\n", | |
" <td>0.709759</td>\n", | |
" <td>0.754453</td>\n", | |
" <td>28528</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Slavic</th>\n", | |
" <td>0.634462</td>\n", | |
" <td>0.830888</td>\n", | |
" <td>0.719509</td>\n", | |
" <td>9745</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>NorthEuropean</th>\n", | |
" <td>0.771076</td>\n", | |
" <td>0.618756</td>\n", | |
" <td>0.686569</td>\n", | |
" <td>32150</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Asian</th>\n", | |
" <td>0.613152</td>\n", | |
" <td>0.775492</td>\n", | |
" <td>0.684833</td>\n", | |
" <td>6757</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Indian</th>\n", | |
" <td>0.630236</td>\n", | |
" <td>0.720121</td>\n", | |
" <td>0.672187</td>\n", | |
" <td>10258</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Arabian</th>\n", | |
" <td>0.525581</td>\n", | |
" <td>0.715945</td>\n", | |
" <td>0.606169</td>\n", | |
" <td>4735</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>African</th>\n", | |
" <td>0.442462</td>\n", | |
" <td>0.605092</td>\n", | |
" <td>0.511153</td>\n", | |
" <td>4753</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" precision recall f1 support\n", | |
"CentralSouthEuropean 0.805153 0.709759 0.754453 28528\n", | |
"Slavic 0.634462 0.830888 0.719509 9745\n", | |
"NorthEuropean 0.771076 0.618756 0.686569 32150\n", | |
"Asian 0.613152 0.775492 0.684833 6757\n", | |
"Indian 0.630236 0.720121 0.672187 10258\n", | |
"Arabian 0.525581 0.715945 0.606169 4735\n", | |
"African 0.442462 0.605092 0.511153 4753" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"prfs_df" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Figure size 640x480 with 0 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 700x500 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"plt.clf()\n", | |
"matplotlib.style.use('ggplot')\n", | |
"plt.figure(figsize=(7,5), facecolor='white')\n", | |
"plt.xlabel('Precision', fontsize=20, labelpad=20, color='black')\n", | |
"plt.ylabel('Recall', fontsize=20, labelpad=20, color='black')\n", | |
"plt.axis([0.0, 1.1, 0.0, 1.1])\n", | |
"for _, row in prfs_df[['precision', 'recall', 'support']].iterrows():\n", | |
" l, p, r, s = row.name, row['precision'], row['recall'], row['support']\n", | |
" plt.text(p, r, l, fontsize=log(s)*1.2)\n", | |
"plt.xlim((0, 1))\n", | |
"plt.ylim((0, 1))\n", | |
"plt.xticks(fontsize=15)\n", | |
"plt.yticks(fontsize=15)\n", | |
"plt.tight_layout()\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"That's it, you've got yourself a classifier ! However, we used aditionnal steps to make it better that I'm gonna demonstrate using our data." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Correcting the model inferred distributions\n", | |
"\n", | |
"Let's load the raw_count yielded by our crawler, and the confusion matrix for recall scores. Download it here:\n", | |
"\n", | |
"```bash\n", | |
"wget -c 'https://api.nakala.fr/data/10.34847/nkl.da5f62h8/41fb60ec24e7f07494e84aface7ee7745865f147?content-disposition=attachment' -O raw_count_from_crawler.csv\n", | |
"wget -c 'https://api.nakala.fr/data/10.34847/nkl.da5f62h8/c6d0ff34cc2c6cb80ea3d6ddd40fd1a1e6323952?content-disposition=attachment' -O conf_mat_recall.csv\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>African</th>\n", | |
" <th>Arabian</th>\n", | |
" <th>Asian</th>\n", | |
" <th>CentralSouthEuropean</th>\n", | |
" <th>Indian</th>\n", | |
" <th>NorthEuropean</th>\n", | |
" <th>Slavic</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>assemblee</th>\n", | |
" <td>290.5</td>\n", | |
" <td>335.0</td>\n", | |
" <td>218.0</td>\n", | |
" <td>1566.0</td>\n", | |
" <td>193.0</td>\n", | |
" <td>5599.0</td>\n", | |
" <td>124.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>avocats</th>\n", | |
" <td>1573.0</td>\n", | |
" <td>2698.0</td>\n", | |
" <td>1035.5</td>\n", | |
" <td>5745.5</td>\n", | |
" <td>1391.5</td>\n", | |
" <td>18564.5</td>\n", | |
" <td>1009.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>bac_pro</th>\n", | |
" <td>4208.5</td>\n", | |
" <td>6689.5</td>\n", | |
" <td>2424.5</td>\n", | |
" <td>16297.5</td>\n", | |
" <td>2862.5</td>\n", | |
" <td>46416.0</td>\n", | |
" <td>1773.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>baccalaureat</th>\n", | |
" <td>20969.0</td>\n", | |
" <td>33511.5</td>\n", | |
" <td>14223.0</td>\n", | |
" <td>90545.5</td>\n", | |
" <td>15485.0</td>\n", | |
" <td>250686.5</td>\n", | |
" <td>10220.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>bep</th>\n", | |
" <td>6548.5</td>\n", | |
" <td>11282.0</td>\n", | |
" <td>3784.0</td>\n", | |
" <td>24626.0</td>\n", | |
" <td>4659.0</td>\n", | |
" <td>63108.0</td>\n", | |
" <td>2803.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>brevet</th>\n", | |
" <td>26964.5</td>\n", | |
" <td>46639.5</td>\n", | |
" <td>17321.0</td>\n", | |
" <td>116343.0</td>\n", | |
" <td>19360.0</td>\n", | |
" <td>321461.0</td>\n", | |
" <td>14859.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>bts</th>\n", | |
" <td>4093.5</td>\n", | |
" <td>6609.0</td>\n", | |
" <td>2820.0</td>\n", | |
" <td>18397.0</td>\n", | |
" <td>3011.5</td>\n", | |
" <td>50964.5</td>\n", | |
" <td>2020.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>cap</th>\n", | |
" <td>4832.5</td>\n", | |
" <td>7510.0</td>\n", | |
" <td>2964.5</td>\n", | |
" <td>21310.0</td>\n", | |
" <td>3338.5</td>\n", | |
" <td>56124.5</td>\n", | |
" <td>2283.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>cnrs</th>\n", | |
" <td>451.5</td>\n", | |
" <td>624.5</td>\n", | |
" <td>456.0</td>\n", | |
" <td>2873.5</td>\n", | |
" <td>435.5</td>\n", | |
" <td>7337.5</td>\n", | |
" <td>478.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>comptables</th>\n", | |
" <td>993.0</td>\n", | |
" <td>1211.5</td>\n", | |
" <td>628.5</td>\n", | |
" <td>4106.0</td>\n", | |
" <td>681.5</td>\n", | |
" <td>12952.0</td>\n", | |
" <td>372.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>maires</th>\n", | |
" <td>1409.5</td>\n", | |
" <td>1067.0</td>\n", | |
" <td>944.0</td>\n", | |
" <td>6694.0</td>\n", | |
" <td>872.0</td>\n", | |
" <td>25110.0</td>\n", | |
" <td>526.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>pharmaciens</th>\n", | |
" <td>3219.5</td>\n", | |
" <td>4268.5</td>\n", | |
" <td>2545.0</td>\n", | |
" <td>13767.5</td>\n", | |
" <td>2371.0</td>\n", | |
" <td>45899.5</td>\n", | |
" <td>1351.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>senateurs</th>\n", | |
" <td>55.0</td>\n", | |
" <td>80.5</td>\n", | |
" <td>48.0</td>\n", | |
" <td>315.0</td>\n", | |
" <td>66.0</td>\n", | |
" <td>1127.0</td>\n", | |
" <td>18.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>veterinaires</th>\n", | |
" <td>622.0</td>\n", | |
" <td>564.0</td>\n", | |
" <td>453.5</td>\n", | |
" <td>2845.0</td>\n", | |
" <td>419.5</td>\n", | |
" <td>10512.5</td>\n", | |
" <td>290.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>x</th>\n", | |
" <td>960.0</td>\n", | |
" <td>1606.0</td>\n", | |
" <td>1080.0</td>\n", | |
" <td>4256.0</td>\n", | |
" <td>764.0</td>\n", | |
" <td>13819.0</td>\n", | |
" <td>570.0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" African Arabian Asian CentralSouthEuropean Indian \n", | |
"assemblee 290.5 335.0 218.0 1566.0 193.0 \\\n", | |
"avocats 1573.0 2698.0 1035.5 5745.5 1391.5 \n", | |
"bac_pro 4208.5 6689.5 2424.5 16297.5 2862.5 \n", | |
"baccalaureat 20969.0 33511.5 14223.0 90545.5 15485.0 \n", | |
"bep 6548.5 11282.0 3784.0 24626.0 4659.0 \n", | |
"brevet 26964.5 46639.5 17321.0 116343.0 19360.0 \n", | |
"bts 4093.5 6609.0 2820.0 18397.0 3011.5 \n", | |
"cap 4832.5 7510.0 2964.5 21310.0 3338.5 \n", | |
"cnrs 451.5 624.5 456.0 2873.5 435.5 \n", | |
"comptables 993.0 1211.5 628.5 4106.0 681.5 \n", | |
"maires 1409.5 1067.0 944.0 6694.0 872.0 \n", | |
"pharmaciens 3219.5 4268.5 2545.0 13767.5 2371.0 \n", | |
"senateurs 55.0 80.5 48.0 315.0 66.0 \n", | |
"veterinaires 622.0 564.0 453.5 2845.0 419.5 \n", | |
"x 960.0 1606.0 1080.0 4256.0 764.0 \n", | |
"\n", | |
" NorthEuropean Slavic \n", | |
"assemblee 5599.0 124.5 \n", | |
"avocats 18564.5 1009.0 \n", | |
"bac_pro 46416.0 1773.5 \n", | |
"baccalaureat 250686.5 10220.5 \n", | |
"bep 63108.0 2803.5 \n", | |
"brevet 321461.0 14859.0 \n", | |
"bts 50964.5 2020.5 \n", | |
"cap 56124.5 2283.0 \n", | |
"cnrs 7337.5 478.5 \n", | |
"comptables 12952.0 372.5 \n", | |
"maires 25110.0 526.5 \n", | |
"pharmaciens 45899.5 1351.0 \n", | |
"senateurs 1127.0 18.5 \n", | |
"veterinaires 10512.5 290.5 \n", | |
"x 13819.0 570.0 " | |
] | |
}, | |
"execution_count": 25, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"count = pd.read_csv('raw_count_from_crawler.csv', index_col=0)\n", | |
"count" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>African</th>\n", | |
" <th>Arabian</th>\n", | |
" <th>Asian</th>\n", | |
" <th>CentralSouthEuropean</th>\n", | |
" <th>Indian</th>\n", | |
" <th>NorthEuropean</th>\n", | |
" <th>Slavic</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>African</th>\n", | |
" <td>2763</td>\n", | |
" <td>165</td>\n", | |
" <td>381</td>\n", | |
" <td>1081</td>\n", | |
" <td>460</td>\n", | |
" <td>1441</td>\n", | |
" <td>157</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Arabian</th>\n", | |
" <td>159</td>\n", | |
" <td>3292</td>\n", | |
" <td>84</td>\n", | |
" <td>577</td>\n", | |
" <td>598</td>\n", | |
" <td>1549</td>\n", | |
" <td>77</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Asian</th>\n", | |
" <td>319</td>\n", | |
" <td>113</td>\n", | |
" <td>5200</td>\n", | |
" <td>831</td>\n", | |
" <td>716</td>\n", | |
" <td>1147</td>\n", | |
" <td>174</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>CentralSouthEuropean</th>\n", | |
" <td>258</td>\n", | |
" <td>128</td>\n", | |
" <td>274</td>\n", | |
" <td>20364</td>\n", | |
" <td>299</td>\n", | |
" <td>3535</td>\n", | |
" <td>324</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Indian</th>\n", | |
" <td>273</td>\n", | |
" <td>487</td>\n", | |
" <td>420</td>\n", | |
" <td>991</td>\n", | |
" <td>7226</td>\n", | |
" <td>1862</td>\n", | |
" <td>191</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>NorthEuropean</th>\n", | |
" <td>643</td>\n", | |
" <td>351</td>\n", | |
" <td>315</td>\n", | |
" <td>3254</td>\n", | |
" <td>609</td>\n", | |
" <td>20183</td>\n", | |
" <td>670</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Slavic</th>\n", | |
" <td>114</td>\n", | |
" <td>60</td>\n", | |
" <td>80</td>\n", | |
" <td>1570</td>\n", | |
" <td>159</td>\n", | |
" <td>2752</td>\n", | |
" <td>8250</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" African Arabian Asian CentralSouthEuropean Indian \n", | |
"African 2763 165 381 1081 460 \\\n", | |
"Arabian 159 3292 84 577 598 \n", | |
"Asian 319 113 5200 831 716 \n", | |
"CentralSouthEuropean 258 128 274 20364 299 \n", | |
"Indian 273 487 420 991 7226 \n", | |
"NorthEuropean 643 351 315 3254 609 \n", | |
"Slavic 114 60 80 1570 159 \n", | |
"\n", | |
" NorthEuropean Slavic \n", | |
"African 1441 157 \n", | |
"Arabian 1549 77 \n", | |
"Asian 1147 174 \n", | |
"CentralSouthEuropean 3535 324 \n", | |
"Indian 1862 191 \n", | |
"NorthEuropean 20183 670 \n", | |
"Slavic 2752 8250 " | |
] | |
}, | |
"execution_count": 26, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"purerawR = pd.read_csv('conf_mat_recall.csv', index_col=0)\n", | |
"purerawR" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>African</th>\n", | |
" <th>Arabian</th>\n", | |
" <th>Asian</th>\n", | |
" <th>CentralSouthEuropean</th>\n", | |
" <th>Indian</th>\n", | |
" <th>NorthEuropean</th>\n", | |
" <th>Slavic</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>African</th>\n", | |
" <td>132.3</td>\n", | |
" <td>13.7</td>\n", | |
" <td>11.7</td>\n", | |
" <td>223.4</td>\n", | |
" <td>15.8</td>\n", | |
" <td>822.9</td>\n", | |
" <td>4.1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Arabian</th>\n", | |
" <td>7.6</td>\n", | |
" <td>272.7</td>\n", | |
" <td>2.6</td>\n", | |
" <td>119.2</td>\n", | |
" <td>20.6</td>\n", | |
" <td>884.5</td>\n", | |
" <td>2.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Asian</th>\n", | |
" <td>15.3</td>\n", | |
" <td>9.4</td>\n", | |
" <td>160.0</td>\n", | |
" <td>171.7</td>\n", | |
" <td>24.6</td>\n", | |
" <td>655.0</td>\n", | |
" <td>4.6</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>CentralSouthEuropean</th>\n", | |
" <td>12.4</td>\n", | |
" <td>10.6</td>\n", | |
" <td>8.4</td>\n", | |
" <td>4208.6</td>\n", | |
" <td>10.3</td>\n", | |
" <td>2018.6</td>\n", | |
" <td>8.6</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Indian</th>\n", | |
" <td>13.1</td>\n", | |
" <td>40.3</td>\n", | |
" <td>12.9</td>\n", | |
" <td>204.8</td>\n", | |
" <td>248.5</td>\n", | |
" <td>1063.3</td>\n", | |
" <td>5.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>NorthEuropean</th>\n", | |
" <td>30.8</td>\n", | |
" <td>29.1</td>\n", | |
" <td>9.7</td>\n", | |
" <td>672.5</td>\n", | |
" <td>20.9</td>\n", | |
" <td>11525.1</td>\n", | |
" <td>17.7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Slavic</th>\n", | |
" <td>5.5</td>\n", | |
" <td>5.0</td>\n", | |
" <td>2.5</td>\n", | |
" <td>324.5</td>\n", | |
" <td>5.5</td>\n", | |
" <td>1571.5</td>\n", | |
" <td>217.8</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" African Arabian Asian CentralSouthEuropean Indian \n", | |
"African 132.3 13.7 11.7 223.4 15.8 \\\n", | |
"Arabian 7.6 272.7 2.6 119.2 20.6 \n", | |
"Asian 15.3 9.4 160.0 171.7 24.6 \n", | |
"CentralSouthEuropean 12.4 10.6 8.4 4208.6 10.3 \n", | |
"Indian 13.1 40.3 12.9 204.8 248.5 \n", | |
"NorthEuropean 30.8 29.1 9.7 672.5 20.9 \n", | |
"Slavic 5.5 5.0 2.5 324.5 5.5 \n", | |
"\n", | |
" NorthEuropean Slavic \n", | |
"African 822.9 4.1 \n", | |
"Arabian 884.5 2.0 \n", | |
"Asian 655.0 4.6 \n", | |
"CentralSouthEuropean 2018.6 8.6 \n", | |
"Indian 1063.3 5.0 \n", | |
"NorthEuropean 11525.1 17.7 \n", | |
"Slavic 1571.5 217.8 " | |
] | |
}, | |
"execution_count": 27, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Computing the corrected confusion matrix using the proportion\n", | |
"# available from the uncorrected classification guessed by the original\n", | |
"# classifier on the Brevet dataset, in order to correct the PubMed\n", | |
"# population breakdown which diverges significantly from the Brevet\n", | |
"# counts [see endnote 4 in the originally submitted manuscript\n", | |
"norm_brevet = count.T['brevet'] / count.T['brevet'].sum()\n", | |
"rawR = purerawR * norm_brevet\n", | |
"rawR.round(1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>African</th>\n", | |
" <th>Arabian</th>\n", | |
" <th>Asian</th>\n", | |
" <th>CentralSouthEuropean</th>\n", | |
" <th>Indian</th>\n", | |
" <th>NorthEuropean</th>\n", | |
" <th>Slavic</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>African</th>\n", | |
" <td>0.108128</td>\n", | |
" <td>0.011169</td>\n", | |
" <td>0.009578</td>\n", | |
" <td>0.182528</td>\n", | |
" <td>0.012925</td>\n", | |
" <td>0.672288</td>\n", | |
" <td>0.003386</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Arabian</th>\n", | |
" <td>0.005817</td>\n", | |
" <td>0.208306</td>\n", | |
" <td>0.001974</td>\n", | |
" <td>0.091076</td>\n", | |
" <td>0.015707</td>\n", | |
" <td>0.675567</td>\n", | |
" <td>0.001552</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Asian</th>\n", | |
" <td>0.014684</td>\n", | |
" <td>0.008997</td>\n", | |
" <td>0.153758</td>\n", | |
" <td>0.165045</td>\n", | |
" <td>0.023664</td>\n", | |
" <td>0.629439</td>\n", | |
" <td>0.004414</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>CentralSouthEuropean</th>\n", | |
" <td>0.001969</td>\n", | |
" <td>0.001689</td>\n", | |
" <td>0.001343</td>\n", | |
" <td>0.670433</td>\n", | |
" <td>0.001638</td>\n", | |
" <td>0.321566</td>\n", | |
" <td>0.001362</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Indian</th>\n", | |
" <td>0.008235</td>\n", | |
" <td>0.025408</td>\n", | |
" <td>0.008138</td>\n", | |
" <td>0.128975</td>\n", | |
" <td>0.156493</td>\n", | |
" <td>0.669576</td>\n", | |
" <td>0.003175</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>NorthEuropean</th>\n", | |
" <td>0.002503</td>\n", | |
" <td>0.002363</td>\n", | |
" <td>0.000788</td>\n", | |
" <td>0.054649</td>\n", | |
" <td>0.001702</td>\n", | |
" <td>0.936559</td>\n", | |
" <td>0.001437</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Slavic</th>\n", | |
" <td>0.002561</td>\n", | |
" <td>0.002332</td>\n", | |
" <td>0.001155</td>\n", | |
" <td>0.152185</td>\n", | |
" <td>0.002565</td>\n", | |
" <td>0.737068</td>\n", | |
" <td>0.102135</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" African Arabian Asian CentralSouthEuropean \n", | |
"African 0.108128 0.011169 0.009578 0.182528 \\\n", | |
"Arabian 0.005817 0.208306 0.001974 0.091076 \n", | |
"Asian 0.014684 0.008997 0.153758 0.165045 \n", | |
"CentralSouthEuropean 0.001969 0.001689 0.001343 0.670433 \n", | |
"Indian 0.008235 0.025408 0.008138 0.128975 \n", | |
"NorthEuropean 0.002503 0.002363 0.000788 0.054649 \n", | |
"Slavic 0.002561 0.002332 0.001155 0.152185 \n", | |
"\n", | |
" Indian NorthEuropean Slavic \n", | |
"African 0.012925 0.672288 0.003386 \n", | |
"Arabian 0.015707 0.675567 0.001552 \n", | |
"Asian 0.023664 0.629439 0.004414 \n", | |
"CentralSouthEuropean 0.001638 0.321566 0.001362 \n", | |
"Indian 0.156493 0.669576 0.003175 \n", | |
"NorthEuropean 0.001702 0.936559 0.001437 \n", | |
"Slavic 0.002565 0.737068 0.102135 " | |
] | |
}, | |
"execution_count": 28, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Matrix for Multinomial Bayesian inference corrections\n", | |
"corrector = (rawR.T / rawR.T.sum()).T\n", | |
"corrector" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def redress(dataset):\n", | |
" return (count.T[dataset] * corrector.T).sum(axis=1)\n", | |
"\n", | |
"def redressratio(dataset, refdataset=\"brevet\"):\n", | |
" ref_dataset_count = redress(refdataset)\n", | |
" ref_dataset_norm = ref_dataset_count / ref_dataset_count.sum()\n", | |
" dataset_count = redress(dataset)\n", | |
" dataset_norm = dataset_count / dataset_count.sum()\n", | |
" return dataset_norm / ref_dataset_norm" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>African</th>\n", | |
" <th>Arabian</th>\n", | |
" <th>Asian</th>\n", | |
" <th>CentralSouthEuropean</th>\n", | |
" <th>Indian</th>\n", | |
" <th>NorthEuropean</th>\n", | |
" <th>Slavic</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>assemblee</th>\n", | |
" <td>0.804085</td>\n", | |
" <td>0.557258</td>\n", | |
" <td>0.849278</td>\n", | |
" <td>0.914253</td>\n", | |
" <td>0.724993</td>\n", | |
" <td>1.043487</td>\n", | |
" <td>0.719838</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>avocats</th>\n", | |
" <td>1.026600</td>\n", | |
" <td>1.026118</td>\n", | |
" <td>1.047989</td>\n", | |
" <td>0.923827</td>\n", | |
" <td>1.157147</td>\n", | |
" <td>1.016126</td>\n", | |
" <td>1.124734</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>bac_pro</th>\n", | |
" <td>1.054336</td>\n", | |
" <td>1.003692</td>\n", | |
" <td>0.989381</td>\n", | |
" <td>0.986271</td>\n", | |
" <td>1.021140</td>\n", | |
" <td>1.003351</td>\n", | |
" <td>0.899460</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>baccalaureat</th>\n", | |
" <td>1.004085</td>\n", | |
" <td>0.943026</td>\n", | |
" <td>1.045487</td>\n", | |
" <td>1.002737</td>\n", | |
" <td>1.014575</td>\n", | |
" <td>1.000612</td>\n", | |
" <td>0.933506</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>bep</th>\n", | |
" <td>1.115252</td>\n", | |
" <td>1.146599</td>\n", | |
" <td>1.059234</td>\n", | |
" <td>1.022322</td>\n", | |
" <td>1.124127</td>\n", | |
" <td>0.986978</td>\n", | |
" <td>0.951642</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>brevet</th>\n", | |
" <td>1.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>bts</th>\n", | |
" <td>0.981552</td>\n", | |
" <td>0.923278</td>\n", | |
" <td>1.027909</td>\n", | |
" <td>1.004736</td>\n", | |
" <td>0.987463</td>\n", | |
" <td>1.001437</td>\n", | |
" <td>0.920695</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>cap</th>\n", | |
" <td>1.011165</td>\n", | |
" <td>0.934855</td>\n", | |
" <td>0.985573</td>\n", | |
" <td>1.028319</td>\n", | |
" <td>0.982563</td>\n", | |
" <td>0.994919</td>\n", | |
" <td>0.925589</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>cnrs</th>\n", | |
" <td>0.837519</td>\n", | |
" <td>0.662508</td>\n", | |
" <td>1.105289</td>\n", | |
" <td>1.057363</td>\n", | |
" <td>0.949007</td>\n", | |
" <td>0.994088</td>\n", | |
" <td>1.262002</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>comptables</th>\n", | |
" <td>0.982021</td>\n", | |
" <td>0.749097</td>\n", | |
" <td>0.972916</td>\n", | |
" <td>0.957006</td>\n", | |
" <td>0.929215</td>\n", | |
" <td>1.020817</td>\n", | |
" <td>0.797968</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>maires</th>\n", | |
" <td>0.845592</td>\n", | |
" <td>0.449110</td>\n", | |
" <td>0.842655</td>\n", | |
" <td>0.897893</td>\n", | |
" <td>0.722153</td>\n", | |
" <td>1.050516</td>\n", | |
" <td>0.707819</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>pharmaciens</th>\n", | |
" <td>0.944041</td>\n", | |
" <td>0.752009</td>\n", | |
" <td>1.078519</td>\n", | |
" <td>0.931259</td>\n", | |
" <td>0.932360</td>\n", | |
" <td>1.026966</td>\n", | |
" <td>0.813930</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>senateurs</th>\n", | |
" <td>0.785748</td>\n", | |
" <td>0.642040</td>\n", | |
" <td>0.909125</td>\n", | |
" <td>0.906563</td>\n", | |
" <td>0.989606</td>\n", | |
" <td>1.040090</td>\n", | |
" <td>0.630272</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>veterinaires</th>\n", | |
" <td>0.869147</td>\n", | |
" <td>0.521220</td>\n", | |
" <td>0.922983</td>\n", | |
" <td>0.899550</td>\n", | |
" <td>0.788853</td>\n", | |
" <td>1.045734</td>\n", | |
" <td>0.807517</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>x</th>\n", | |
" <td>0.939215</td>\n", | |
" <td>0.870620</td>\n", | |
" <td>1.370286</td>\n", | |
" <td>0.933111</td>\n", | |
" <td>0.989820</td>\n", | |
" <td>1.019163</td>\n", | |
" <td>0.968917</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" African Arabian Asian CentralSouthEuropean Indian \n", | |
"assemblee 0.804085 0.557258 0.849278 0.914253 0.724993 \\\n", | |
"avocats 1.026600 1.026118 1.047989 0.923827 1.157147 \n", | |
"bac_pro 1.054336 1.003692 0.989381 0.986271 1.021140 \n", | |
"baccalaureat 1.004085 0.943026 1.045487 1.002737 1.014575 \n", | |
"bep 1.115252 1.146599 1.059234 1.022322 1.124127 \n", | |
"brevet 1.000000 1.000000 1.000000 1.000000 1.000000 \n", | |
"bts 0.981552 0.923278 1.027909 1.004736 0.987463 \n", | |
"cap 1.011165 0.934855 0.985573 1.028319 0.982563 \n", | |
"cnrs 0.837519 0.662508 1.105289 1.057363 0.949007 \n", | |
"comptables 0.982021 0.749097 0.972916 0.957006 0.929215 \n", | |
"maires 0.845592 0.449110 0.842655 0.897893 0.722153 \n", | |
"pharmaciens 0.944041 0.752009 1.078519 0.931259 0.932360 \n", | |
"senateurs 0.785748 0.642040 0.909125 0.906563 0.989606 \n", | |
"veterinaires 0.869147 0.521220 0.922983 0.899550 0.788853 \n", | |
"x 0.939215 0.870620 1.370286 0.933111 0.989820 \n", | |
"\n", | |
" NorthEuropean Slavic \n", | |
"assemblee 1.043487 0.719838 \n", | |
"avocats 1.016126 1.124734 \n", | |
"bac_pro 1.003351 0.899460 \n", | |
"baccalaureat 1.000612 0.933506 \n", | |
"bep 0.986978 0.951642 \n", | |
"brevet 1.000000 1.000000 \n", | |
"bts 1.001437 0.920695 \n", | |
"cap 0.994919 0.925589 \n", | |
"cnrs 0.994088 1.262002 \n", | |
"comptables 1.020817 0.797968 \n", | |
"maires 1.050516 0.707819 \n", | |
"pharmaciens 1.026966 0.813930 \n", | |
"senateurs 1.040090 0.630272 \n", | |
"veterinaires 1.045734 0.807517 \n", | |
"x 1.019163 0.968917 " | |
] | |
}, | |
"execution_count": 30, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"redressratio_df = pd.concat([redressratio(dataset) for dataset in count.index], axis=1, keys=count.index).T\n", | |
"redressratio_df" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"That's it ! This is the actual matrix used for plotting Figure 2 in the paper.\n", | |
"\n", | |
"Any questions ? Feel free to ask: [email protected]\n", | |
"\n", | |
"EOF" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.9.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment