Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save manavsehgal/4031dfc7c8da17111bdff5223b87a2c1 to your computer and use it in GitHub Desktop.
Save manavsehgal/4031dfc7c8da17111bdff5223b87a2c1 to your computer and use it in GitHub Desktop.
Wikipedia Acquire Dataset From Table | ManavSehgal.com
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook is explained at [ManavSehgal.com](https://manavsehgal.com) where you can find other useful Notebooks which can speed up your machine learning workflow and projects."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import requests\n",
"from bs4 import BeautifulSoup\n",
"import pandas as pd\n",
"\n",
"# Use SHIFT+TAB keys to popup inplace code help\n",
"%config IPCompleter.greedy = True\n",
"\n",
"# Output multiple statements from one input cell\n",
"from IPython.core.interactiveshell import InteractiveShell\n",
"InteractiveShell.ast_node_interactivity = \"all\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Customize Notebook\n",
"\n",
"**table_from_top.** If the Wikipedia page has one table then use `table_from_top = 1` value. Otherwise count table number from top and replace value to get specific table.\n",
"\n",
"**wikipedia_page.** Specify the wikipedia page name from where to source dataset. The CSV file will be saved with the same name.\n",
"\n",
"**trace.** Set `trace = True` to trace how feature values are extracted. Does not save extracted dataset. Prefixes applied parsing/extraction rules to extracted values."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"table_from_top = 1\n",
"wikipedia_page = 'List_of_tallest_buildings_and_structures'\n",
"trace = False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load and Parse\n",
"\n",
"This section loads the Wikipedia page and parses the table data we are interested in converting to a dataset."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"wikipedia_url = 'https://en.wikipedia.org/wiki/{}'.format(wikipedia_page)\n",
"page = requests.get(wikipedia_url)\n",
"soup = BeautifulSoup(page.content, 'lxml')\n",
"tables = soup.find_all('table', {'class': 'wikitable'})\n",
"table = tables[table_from_top - 1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Quick Preview\n",
"\n",
"This section extracts the table header with feature or column names.\n",
"\n",
"Use this section to quick preview if you have the right table in processing."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Category'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"'Structure'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"'Country'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"'City'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"'Height (metres)'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"'Height (feet)'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"'Year built'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"'Coordinates'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_names = []\n",
"\n",
"header_row = table.find('tr')\n",
"for header in header_row.find_all('th'):\n",
" feature_name = ' '.join(header.find_all(text=True))\n",
" feature_name.replace('\\n', '')\n",
" feature_names.append(feature_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Wrangling\n",
"\n",
"This section applies data wrangling rules based on exceptions found when parsing Wikipedia tables.\n",
"\n",
"- If a feature value contains a link then extract text from the link.\n",
"- Ignore text which starts with `[` square brackets.\n",
"- Ignore image links (...flags) prefix link text.\n",
"- Ignore hidden text used for IDs."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def has_coords(tag):\n",
" if tag.has_attr('class'):\n",
" if tag['class'][0] == 'latitude' or tag['class'][0] == 'longitude':\n",
" return True\n",
" return False\n",
"\n",
"def get_coords(child):\n",
" coords = []\n",
" for coord in child.find_all(has_coords):\n",
" coords.append(coord.string)\n",
" if coords:\n",
" if trace:\n",
" return 'C = {}'.format(' '.join(coords))\n",
" else:\n",
" return ' '.join(coords)\n",
" else:\n",
" return ''\n",
"\n",
"samples = []\n",
"sample_rows = table.find_all('tr')[1:]\n",
"for sample_row in sample_rows:\n",
" features = []\n",
" for feature_col in sample_row.find_all('td'):\n",
" feature_value = ''\n",
" text = feature_col.string\n",
" if text:\n",
" if trace:\n",
" features.append('T = {}'.format(text))\n",
" else:\n",
" features.append(text)\n",
" continue\n",
" \n",
" for child in feature_col.children:\n",
" if child.name == 'span':\n",
" if child.has_attr('class'):\n",
" if child['class'] == 'display:none':\n",
" continue\n",
" if child.find_all(has_coords):\n",
" feature_value = get_coords(child)\n",
" if feature_value:\n",
" break\n",
" else:\n",
" continue\n",
" if child.name == 'sup':\n",
" continue\n",
" if child.name == 'a':\n",
" if child.string[0] == '[':\n",
" continue \n",
" if child.name == 'a':\n",
" if trace:\n",
" feature_value = 'A = {}'.format(child.string)\n",
" else:\n",
" feature_value = child.string\n",
" break\n",
" if child.name == 'font':\n",
" if trace:\n",
" feature_value = 'F = {}'.format(child.string)\n",
" else:\n",
" feature_value = child.string\n",
" break\n",
" try:\n",
" # feature_value = '' for any tags not covered above\n",
" content = child.contents\n",
" except AttributeError:\n",
" # Handle whitespace between child tags, treated as a child string\n",
" if child.isspace():\n",
" continue\n",
" if trace:\n",
" feature_value = 'E = {}'.format(child)\n",
" else:\n",
" feature_value = child\n",
" break\n",
" features.append(feature_value)\n",
" samples.append(dict(zip(feature_names, features)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preview Dataset\n",
"\n",
"This section enables you to preview the parsed dataset."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Category</th>\n",
" <th>City</th>\n",
" <th>Coordinates</th>\n",
" <th>Country</th>\n",
" <th>Height (feet)</th>\n",
" <th>Height (metres)</th>\n",
" <th>Structure</th>\n",
" <th>Year built</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Mixed use</td>\n",
" <td>Dubai</td>\n",
" <td>25°11′50.0″N 55°16′26.6″E</td>\n",
" <td>United Arab Emirates</td>\n",
" <td>2,717</td>\n",
" <td>828.1</td>\n",
" <td>Burj Khalifa</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Self-supporting tower</td>\n",
" <td>Tokyo</td>\n",
" <td>35°42′36.5″N 139°48′39″E</td>\n",
" <td>Japan</td>\n",
" <td>2,080</td>\n",
" <td>634</td>\n",
" <td>Tokyo Skytree</td>\n",
" <td>2011</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Guyed steel lattice mast</td>\n",
" <td>Blanchard, North Dakota</td>\n",
" <td>47°20′32″N 97°17′25″W</td>\n",
" <td>United States</td>\n",
" <td>2,063</td>\n",
" <td>628.8</td>\n",
" <td>KVLY-TV mast</td>\n",
" <td>1963</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Clock building</td>\n",
" <td>Mecca</td>\n",
" <td>21°25′08″N 39°49′35″E</td>\n",
" <td>Saudi Arabia</td>\n",
" <td>1,972</td>\n",
" <td>601</td>\n",
" <td>Abraj Al Bait Towers</td>\n",
" <td>2011</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Office</td>\n",
" <td>New York, NY</td>\n",
" <td>40°42′46.8″N 74°0′48.6″W</td>\n",
" <td>United States</td>\n",
" <td>1,776</td>\n",
" <td>541</td>\n",
" <td>One World Trade Center</td>\n",
" <td>2013</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Category City \\\n",
"0 Mixed use Dubai \n",
"1 Self-supporting tower Tokyo \n",
"2 Guyed steel lattice mast Blanchard, North Dakota \n",
"3 Clock building Mecca \n",
"4 Office New York, NY \n",
"\n",
" Coordinates Country Height (feet) \\\n",
"0 25°11′50.0″N 55°16′26.6″E United Arab Emirates 2,717 \n",
"1 35°42′36.5″N 139°48′39″E Japan 2,080 \n",
"2 47°20′32″N 97°17′25″W United States 2,063 \n",
"3 21°25′08″N 39°49′35″E Saudi Arabia 1,972 \n",
"4 40°42′46.8″N 74°0′48.6″W United States 1,776 \n",
"\n",
" Height (metres) Structure Year built \n",
"0 828.1 Burj Khalifa 2010 \n",
"1 634 Tokyo Skytree 2011 \n",
"2 628.8 KVLY-TV mast 1963 \n",
"3 601 Abraj Al Bait Towers 2011 \n",
"4 541 One World Trade Center 2013 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Category</th>\n",
" <th>City</th>\n",
" <th>Coordinates</th>\n",
" <th>Country</th>\n",
" <th>Height (feet)</th>\n",
" <th>Height (metres)</th>\n",
" <th>Structure</th>\n",
" <th>Year built</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>Aerial tramway</td>\n",
" <td>Kaprun</td>\n",
" <td>47°11′58.62″N 12°41′16.96″E</td>\n",
" <td>Austria</td>\n",
" <td>373</td>\n",
" <td>113.6</td>\n",
" <td>Pillar of</td>\n",
" <td>1966</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47</th>\n",
" <td>Sphere</td>\n",
" <td>Stockholm</td>\n",
" <td>59°17′36.92″N 18°04′58.79″E</td>\n",
" <td>Sweden</td>\n",
" <td>279</td>\n",
" <td>85</td>\n",
" <td>Ericsson Globe</td>\n",
" <td>1989</td>\n",
" </tr>\n",
" <tr>\n",
" <th>48</th>\n",
" <td>Brick</td>\n",
" <td>Genoa</td>\n",
" <td>44°24′16.25″N 8°54′16.67″E</td>\n",
" <td>Italy</td>\n",
" <td>253</td>\n",
" <td>77</td>\n",
" <td>Lighthouse of Genoa</td>\n",
" <td>1543</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49</th>\n",
" <td>Gopuram</td>\n",
" <td>Murudeshwara</td>\n",
" <td>14°05′39″N 74°29′07″E</td>\n",
" <td>India</td>\n",
" <td>249</td>\n",
" <td>76</td>\n",
" <td>Murudeshwara Temple</td>\n",
" <td>2008</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50</th>\n",
" <td>Wooden</td>\n",
" <td>Săpânța</td>\n",
" <td>47°58′59.5″N 23°42′02.5″E</td>\n",
" <td>Romania</td>\n",
" <td>246</td>\n",
" <td>75</td>\n",
" <td>Church of the Săpânța-Peri Monastery</td>\n",
" <td>2003</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Category City Coordinates Country \\\n",
"46 Aerial tramway Kaprun 47°11′58.62″N 12°41′16.96″E Austria \n",
"47 Sphere Stockholm 59°17′36.92″N 18°04′58.79″E Sweden \n",
"48 Brick Genoa 44°24′16.25″N 8°54′16.67″E Italy \n",
"49 Gopuram Murudeshwara 14°05′39″N 74°29′07″E India \n",
"50 Wooden Săpânța 47°58′59.5″N 23°42′02.5″E Romania \n",
"\n",
" Height (feet) Height (metres) Structure \\\n",
"46 373 113.6 Pillar of \n",
"47 279 85 Ericsson Globe \n",
"48 253 77 Lighthouse of Genoa \n",
"49 249 76 Murudeshwara Temple \n",
"50 246 75 Church of the Săpânța-Peri Monastery \n",
"\n",
" Year built \n",
"46 1966 \n",
"47 1989 \n",
"48 1543 \n",
"49 2008 \n",
"50 2003 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(samples)\n",
"df.head()\n",
"df.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save Dataset\n",
"\n",
"We can now save the dataset using the same Wikipedia page name we use earlier to extract the dataset."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"dataset_file_name = '../datasets/wikipedia/{}.csv'.format(wikipedia_page)\n",
"if not trace:\n",
" df.to_csv(dataset_file_name, index=False)"
]
}
],
"metadata": {
"_draft": {
"nbviewer_url": "https://gist.github.com/407853f389ea711c82bd89faf90d0930"
},
"gist": {
"data": {
"description": "wikipedia-acquire-dataset-from-table.ipynb",
"public": true
},
"id": "407853f389ea711c82bd89faf90d0930"
},
"hide_input": false,
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
},
"toc": {
"colors": {
"hover_highlight": "#DAA520",
"running_highlight": "#FF0000",
"selected_highlight": "#FFD700"
},
"moveMenuLeft": true,
"nav_menu": {
"height": "138px",
"width": "252px"
},
"navigate_menu": true,
"number_sections": false,
"sideBar": true,
"threshold": 4,
"toc_cell": true,
"toc_section_display": "block",
"toc_window_display": true,
"widenNotebook": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment