manavsehgal · April 4, 2017 10:14
diff --git a/wikipedia-acquire-dataset-from-table-manavsehgal-com.ipynb b/wikipedia-acquire-dataset-from-table-manavsehgal-com.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is explained at [ManavSehgal.com](https://manavsehgal.com) where you can find other useful Notebooks which can speed up your machine learning workflow and projects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import pandas as pd\n",
    "\n",
    "# Use SHIFT+TAB keys to popup inplace code help\n",
    "%config IPCompleter.greedy = True\n",
    "\n",
    "# Output multiple statements from one input cell\n",
    "from IPython.core.interactiveshell import InteractiveShell\n",
    "InteractiveShell.ast_node_interactivity = \"all\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Customize Notebook\n",
    "\n",
    "**table_from_top.** If the Wikipedia page has one table then use `table_from_top = 1` value. Otherwise count table number from top and replace value to get specific table.\n",
    "\n",
    "**wikipedia_page.** Specify the wikipedia page name from where to source dataset. The CSV file will be saved with the same name.\n",
    "\n",
    "**trace.** Set `trace = True` to trace how feature values are extracted. Does not save extracted dataset. Prefixes applied parsing/extraction rules to extracted values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "table_from_top = 1\n",
    "wikipedia_page = 'List_of_tallest_buildings_and_structures'\n",
    "trace = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and Parse\n",
    "\n",
    "This section loads the Wikipedia page and parses the table data we are interested in converting to a dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "wikipedia_url = 'https://en.wikipedia.org/wiki/{}'.format(wikipedia_page)\n",
    "page = requests.get(wikipedia_url)\n",
    "soup = BeautifulSoup(page.content, 'lxml')\n",
    "tables = soup.find_all('table', {'class': 'wikitable'})\n",
    "table = tables[table_from_top - 1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quick Preview\n",
    "\n",
    "This section extracts the table header with feature or column names.\n",
    "\n",
    "Use this section to quick preview if you have the right table in processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Category'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'Structure'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'Country'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'City'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'Height (metres)'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'Height (feet)'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'Year built'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'Coordinates'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_names = []\n",
    "\n",
    "header_row = table.find('tr')\n",
    "for header in header_row.find_all('th'):\n",
    "    feature_name = ' '.join(header.find_all(text=True))\n",
    "    feature_name.replace('\\n', '')\n",
    "    feature_names.append(feature_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Wrangling\n",
    "\n",
    "This section applies data wrangling rules based on exceptions found when parsing Wikipedia tables.\n",
    "\n",
    "- If a feature value contains a link then extract text from the link.\n",
    "- Ignore text which starts with `[` square brackets.\n",
    "- Ignore image links (...flags) prefix link text.\n",
    "- Ignore hidden text used for IDs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def has_coords(tag):\n",
    "    if tag.has_attr('class'):\n",
    "        if tag['class'][0] == 'latitude' or tag['class'][0] == 'longitude':\n",
    "            return True\n",
    "    return False\n",
    "\n",
    "def get_coords(child):\n",
    "    coords = []\n",
    "    for coord in child.find_all(has_coords):\n",
    "        coords.append(coord.string)\n",
    "    if coords:\n",
    "        if trace:\n",
    "            return 'C = {}'.format(' '.join(coords))\n",
    "        else:\n",
    "            return ' '.join(coords)\n",
    "    else:\n",
    "        return ''\n",
    "\n",
    "samples = []\n",
    "sample_rows = table.find_all('tr')[1:]\n",
    "for sample_row in sample_rows:\n",
    "    features = []\n",
    "    for feature_col in sample_row.find_all('td'):\n",
    "        feature_value = ''\n",
    "        text = feature_col.string\n",
    "        if text:\n",
    "            if trace:\n",
    "                features.append('T = {}'.format(text))\n",
    "            else:\n",
    "                features.append(text)\n",
    "            continue\n",
    "        \n",
    "        for child in feature_col.children:\n",
    "            if child.name == 'span':\n",
    "                if child.has_attr('class'):\n",
    "                    if child['class'] == 'display:none':\n",
    "                        continue\n",
    "                if child.find_all(has_coords):\n",
    "                    feature_value = get_coords(child)\n",
    "                    if feature_value:\n",
    "                        break\n",
    "                    else:\n",
    "                        continue\n",
    "            if child.name == 'sup':\n",
    "                continue\n",
    "            if child.name == 'a':\n",
    "                if child.string[0] == '[':\n",
    "                    continue            \n",
    "            if child.name == 'a':\n",
    "                if trace:\n",
    "                    feature_value = 'A = {}'.format(child.string)\n",
    "                else:\n",
    "                    feature_value = child.string\n",
    "                break\n",
    "            if child.name == 'font':\n",
    "                if trace:\n",
    "                    feature_value = 'F = {}'.format(child.string)\n",
    "                else:\n",
    "                    feature_value = child.string\n",
    "                break\n",
    "            try:\n",
    "                # feature_value = '' for any tags not covered above\n",
    "                content = child.contents\n",
    "            except AttributeError:\n",
    "                # Handle whitespace between child tags, treated as a child string\n",
    "                if child.isspace():\n",
    "                    continue\n",
    "                if trace:\n",
    "                    feature_value = 'E = {}'.format(child)\n",
    "                else:\n",
    "                    feature_value = child\n",
    "                break\n",
    "        features.append(feature_value)\n",
    "    samples.append(dict(zip(feature_names, features)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preview Dataset\n",
    "\n",
    "This section enables you to preview the parsed dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Category</th>\n",
       "      <th>City</th>\n",
       "      <th>Coordinates</th>\n",
       "      <th>Country</th>\n",
       "      <th>Height (feet)</th>\n",
       "      <th>Height (metres)</th>\n",
       "      <th>Structure</th>\n",
       "      <th>Year built</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Mixed use</td>\n",
       "      <td>Dubai</td>\n",
       "      <td>25°11′50.0″N 55°16′26.6″E</td>\n",
       "      <td>United Arab Emirates</td>\n",
       "      <td>2,717</td>\n",
       "      <td>828.1</td>\n",
       "      <td>Burj Khalifa</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Self-supporting tower</td>\n",
       "      <td>Tokyo</td>\n",
       "      <td>35°42′36.5″N 139°48′39″E</td>\n",
       "      <td>Japan</td>\n",
       "      <td>2,080</td>\n",
       "      <td>634</td>\n",
       "      <td>Tokyo Skytree</td>\n",
       "      <td>2011</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Guyed steel lattice mast</td>\n",
       "      <td>Blanchard, North Dakota</td>\n",
       "      <td>47°20′32″N 97°17′25″W</td>\n",
       "      <td>United States</td>\n",
       "      <td>2,063</td>\n",
       "      <td>628.8</td>\n",
       "      <td>KVLY-TV mast</td>\n",
       "      <td>1963</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Clock building</td>\n",
       "      <td>Mecca</td>\n",
       "      <td>21°25′08″N 39°49′35″E</td>\n",
       "      <td>Saudi Arabia</td>\n",
       "      <td>1,972</td>\n",
       "      <td>601</td>\n",
       "      <td>Abraj Al Bait Towers</td>\n",
       "      <td>2011</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Office</td>\n",
       "      <td>New York, NY</td>\n",
       "      <td>40°42′46.8″N 74°0′48.6″W</td>\n",
       "      <td>United States</td>\n",
       "      <td>1,776</td>\n",
       "      <td>541</td>\n",
       "      <td>One World Trade Center</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   Category                     City  \\\n",
       "0                 Mixed use                    Dubai   \n",
       "1     Self-supporting tower                    Tokyo   \n",
       "2  Guyed steel lattice mast  Blanchard, North Dakota   \n",
       "3            Clock building                    Mecca   \n",
       "4                    Office             New York, NY   \n",
       "\n",
       "                 Coordinates               Country Height (feet)  \\\n",
       "0  25°11′50.0″N 55°16′26.6″E  United Arab Emirates         2,717   \n",
       "1   35°42′36.5″N 139°48′39″E                 Japan         2,080   \n",
       "2      47°20′32″N 97°17′25″W         United States         2,063   \n",
       "3      21°25′08″N 39°49′35″E          Saudi Arabia         1,972   \n",
       "4   40°42′46.8″N 74°0′48.6″W         United States         1,776   \n",
       "\n",
       "  Height (metres)               Structure Year built  \n",
       "0           828.1            Burj Khalifa       2010  \n",
       "1             634           Tokyo Skytree       2011  \n",
       "2           628.8            KVLY-TV mast       1963  \n",
       "3             601    Abraj Al Bait Towers       2011  \n",
       "4             541  One World Trade Center       2013  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Category</th>\n",
       "      <th>City</th>\n",
       "      <th>Coordinates</th>\n",
       "      <th>Country</th>\n",
       "      <th>Height (feet)</th>\n",
       "      <th>Height (metres)</th>\n",
       "      <th>Structure</th>\n",
       "      <th>Year built</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>46</th>\n",
       "      <td>Aerial tramway</td>\n",
       "      <td>Kaprun</td>\n",
       "      <td>47°11′58.62″N 12°41′16.96″E</td>\n",
       "      <td>Austria</td>\n",
       "      <td>373</td>\n",
       "      <td>113.6</td>\n",
       "      <td>Pillar of</td>\n",
       "      <td>1966</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47</th>\n",
       "      <td>Sphere</td>\n",
       "      <td>Stockholm</td>\n",
       "      <td>59°17′36.92″N 18°04′58.79″E</td>\n",
       "      <td>Sweden</td>\n",
       "      <td>279</td>\n",
       "      <td>85</td>\n",
       "      <td>Ericsson Globe</td>\n",
       "      <td>1989</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48</th>\n",
       "      <td>Brick</td>\n",
       "      <td>Genoa</td>\n",
       "      <td>44°24′16.25″N 8°54′16.67″E</td>\n",
       "      <td>Italy</td>\n",
       "      <td>253</td>\n",
       "      <td>77</td>\n",
       "      <td>Lighthouse of Genoa</td>\n",
       "      <td>1543</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49</th>\n",
       "      <td>Gopuram</td>\n",
       "      <td>Murudeshwara</td>\n",
       "      <td>14°05′39″N 74°29′07″E</td>\n",
       "      <td>India</td>\n",
       "      <td>249</td>\n",
       "      <td>76</td>\n",
       "      <td>Murudeshwara Temple</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50</th>\n",
       "      <td>Wooden</td>\n",
       "      <td>Săpânța</td>\n",
       "      <td>47°58′59.5″N 23°42′02.5″E</td>\n",
       "      <td>Romania</td>\n",
       "      <td>246</td>\n",
       "      <td>75</td>\n",
       "      <td>Church of the Săpânța-Peri Monastery</td>\n",
       "      <td>2003</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          Category          City                  Coordinates  Country  \\\n",
       "46  Aerial tramway        Kaprun  47°11′58.62″N 12°41′16.96″E  Austria   \n",
       "47          Sphere     Stockholm  59°17′36.92″N 18°04′58.79″E   Sweden   \n",
       "48          Brick          Genoa   44°24′16.25″N 8°54′16.67″E    Italy   \n",
       "49         Gopuram  Murudeshwara        14°05′39″N 74°29′07″E    India   \n",
       "50          Wooden       Săpânța    47°58′59.5″N 23°42′02.5″E  Romania   \n",
       "\n",
       "   Height (feet) Height (metres)                             Structure  \\\n",
       "46           373           113.6                            Pillar of    \n",
       "47           279              85                        Ericsson Globe   \n",
       "48           253              77                   Lighthouse of Genoa   \n",
       "49           249              76                   Murudeshwara Temple   \n",
       "50           246              75  Church of the Săpânța-Peri Monastery   \n",
       "\n",
       "   Year built  \n",
       "46       1966  \n",
       "47       1989  \n",
       "48       1543  \n",
       "49       2008  \n",
       "50       2003  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(samples)\n",
    "df.head()\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save Dataset\n",
    "\n",
    "We can now save the dataset using the same Wikipedia page name we use earlier to extract the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dataset_file_name = '../datasets/wikipedia/{}.csv'.format(wikipedia_page)\n",
    "if not trace:\n",
    "    df.to_csv(dataset_file_name, index=False)"
   ]
  }
 ],
 "metadata": {
  "_draft": {
   "nbviewer_url": "https://gist.github.com/407853f389ea711c82bd89faf90d0930"
  },
  "gist": {
   "data": {
    "description": "wikipedia-acquire-dataset-from-table.ipynb",
    "public": true
   },
   "id": "407853f389ea711c82bd89faf90d0930"
  },
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "138px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": false,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": true,
   "toc_section_display": "block",
   "toc_window_display": true,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This notebook is explained at [ManavSehgal.com](https://manavsehgal.com) where you can find other useful Notebooks which can speed up your machine learning workflow and projects."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import requests\n",
	"from bs4 import BeautifulSoup\n",
	"import pandas as pd\n",
	"\n",
	"# Use SHIFT+TAB keys to popup inplace code help\n",
	"%config IPCompleter.greedy = True\n",
	"\n",
	"# Output multiple statements from one input cell\n",
	"from IPython.core.interactiveshell import InteractiveShell\n",
	"InteractiveShell.ast_node_interactivity = \"all\""
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Customize Notebook\n",
	"\n",
	"table_from_top. If the Wikipedia page has one table then use `table_from_top = 1` value. Otherwise count table number from top and replace value to get specific table.\n",
	"\n",
	"wikipedia_page. Specify the wikipedia page name from where to source dataset. The CSV file will be saved with the same name.\n",
	"\n",
	"trace. Set `trace = True` to trace how feature values are extracted. Does not save extracted dataset. Prefixes applied parsing/extraction rules to extracted values."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"table_from_top = 1\n",
	"wikipedia_page = 'List_of_tallest_buildings_and_structures'\n",
	"trace = False"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Load and Parse\n",
	"\n",
	"This section loads the Wikipedia page and parses the table data we are interested in converting to a dataset."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"wikipedia_url = 'https://en.wikipedia.org/wiki/{}'.format(wikipedia_page)\n",
	"page = requests.get(wikipedia_url)\n",
	"soup = BeautifulSoup(page.content, 'lxml')\n",
	"tables = soup.find_all('table', {'class': 'wikitable'})\n",
	"table = tables[table_from_top - 1]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Quick Preview\n",
	"\n",
	"This section extracts the table header with feature or column names.\n",
	"\n",
	"Use this section to quick preview if you have the right table in processing."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'Category'"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	},
	{
	"data": {
	"text/plain": [
	"'Structure'"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	},
	{
	"data": {
	"text/plain": [
	"'Country'"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	},
	{
	"data": {
	"text/plain": [
	"'City'"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	},
	{
	"data": {
	"text/plain": [
	"'Height (metres)'"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	},
	{
	"data": {
	"text/plain": [
	"'Height (feet)'"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	},
	{
	"data": {
	"text/plain": [
	"'Year built'"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	},
	{
	"data": {
	"text/plain": [
	"'Coordinates'"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"feature_names = []\n",
	"\n",
	"header_row = table.find('tr')\n",
	"for header in header_row.find_all('th'):\n",
	" feature_name = ' '.join(header.find_all(text=True))\n",
	" feature_name.replace('\\n', '')\n",
	" feature_names.append(feature_name)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Data Wrangling\n",
	"\n",
	"This section applies data wrangling rules based on exceptions found when parsing Wikipedia tables.\n",
	"\n",
	"- If a feature value contains a link then extract text from the link.\n",
	"- Ignore text which starts with `[` square brackets.\n",
	"- Ignore image links (...flags) prefix link text.\n",
	"- Ignore hidden text used for IDs."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"def has_coords(tag):\n",
	" if tag.has_attr('class'):\n",
	" if tag['class'][0] == 'latitude' or tag['class'][0] == 'longitude':\n",
	" return True\n",
	" return False\n",
	"\n",
	"def get_coords(child):\n",
	" coords = []\n",
	" for coord in child.find_all(has_coords):\n",
	" coords.append(coord.string)\n",
	" if coords:\n",
	" if trace:\n",
	" return 'C = {}'.format(' '.join(coords))\n",
	" else:\n",
	" return ' '.join(coords)\n",
	" else:\n",
	" return ''\n",
	"\n",
	"samples = []\n",
	"sample_rows = table.find_all('tr')[1:]\n",
	"for sample_row in sample_rows:\n",
	" features = []\n",
	" for feature_col in sample_row.find_all('td'):\n",
	" feature_value = ''\n",
	" text = feature_col.string\n",
	" if text:\n",
	" if trace:\n",
	" features.append('T = {}'.format(text))\n",
	" else:\n",
	" features.append(text)\n",
	" continue\n",
	" \n",
	" for child in feature_col.children:\n",
	" if child.name == 'span':\n",
	" if child.has_attr('class'):\n",
	" if child['class'] == 'display:none':\n",
	" continue\n",
	" if child.find_all(has_coords):\n",
	" feature_value = get_coords(child)\n",
	" if feature_value:\n",
	" break\n",
	" else:\n",
	" continue\n",
	" if child.name == 'sup':\n",
	" continue\n",
	" if child.name == 'a':\n",
	" if child.string[0] == '[':\n",
	" continue \n",
	" if child.name == 'a':\n",
	" if trace:\n",
	" feature_value = 'A = {}'.format(child.string)\n",
	" else:\n",
	" feature_value = child.string\n",
	" break\n",
	" if child.name == 'font':\n",
	" if trace:\n",
	" feature_value = 'F = {}'.format(child.string)\n",
	" else:\n",
	" feature_value = child.string\n",
	" break\n",
	" try:\n",
	" # feature_value = '' for any tags not covered above\n",
	" content = child.contents\n",
	" except AttributeError:\n",
	" # Handle whitespace between child tags, treated as a child string\n",
	" if child.isspace():\n",
	" continue\n",
	" if trace:\n",
	" feature_value = 'E = {}'.format(child)\n",
	" else:\n",
	" feature_value = child\n",
	" break\n",
	" features.append(feature_value)\n",
	" samples.append(dict(zip(feature_names, features)))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Preview Dataset\n",
	"\n",
	"This section enables you to preview the parsed dataset."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>Category</th>\n",
	" <th>City</th>\n",
	" <th>Coordinates</th>\n",
	" <th>Country</th>\n",
	" <th>Height (feet)</th>\n",
	" <th>Height (metres)</th>\n",
	" <th>Structure</th>\n",
	" <th>Year built</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>Mixed use</td>\n",
	" <td>Dubai</td>\n",
	" <td>25°11′50.0″N 55°16′26.6″E</td>\n",
	" <td>United Arab Emirates</td>\n",
	" <td>2,717</td>\n",
	" <td>828.1</td>\n",
	" <td>Burj Khalifa</td>\n",
	" <td>2010</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>Self-supporting tower</td>\n",
	" <td>Tokyo</td>\n",
	" <td>35°42′36.5″N 139°48′39″E</td>\n",
	" <td>Japan</td>\n",
	" <td>2,080</td>\n",
	" <td>634</td>\n",
	" <td>Tokyo Skytree</td>\n",
	" <td>2011</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>Guyed steel lattice mast</td>\n",
	" <td>Blanchard, North Dakota</td>\n",
	" <td>47°20′32″N 97°17′25″W</td>\n",
	" <td>United States</td>\n",
	" <td>2,063</td>\n",
	" <td>628.8</td>\n",
	" <td>KVLY-TV mast</td>\n",
	" <td>1963</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>Clock building</td>\n",
	" <td>Mecca</td>\n",
	" <td>21°25′08″N 39°49′35″E</td>\n",
	" <td>Saudi Arabia</td>\n",
	" <td>1,972</td>\n",
	" <td>601</td>\n",
	" <td>Abraj Al Bait Towers</td>\n",
	" <td>2011</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>Office</td>\n",
	" <td>New York, NY</td>\n",
	" <td>40°42′46.8″N 74°0′48.6″W</td>\n",
	" <td>United States</td>\n",
	" <td>1,776</td>\n",
	" <td>541</td>\n",
	" <td>One World Trade Center</td>\n",
	" <td>2013</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" Category City \\\n",
	"0 Mixed use Dubai \n",
	"1 Self-supporting tower Tokyo \n",
	"2 Guyed steel lattice mast Blanchard, North Dakota \n",
	"3 Clock building Mecca \n",
	"4 Office New York, NY \n",
	"\n",
	" Coordinates Country Height (feet) \\\n",
	"0 25°11′50.0″N 55°16′26.6″E United Arab Emirates 2,717 \n",
	"1 35°42′36.5″N 139°48′39″E Japan 2,080 \n",
	"2 47°20′32″N 97°17′25″W United States 2,063 \n",
	"3 21°25′08″N 39°49′35″E Saudi Arabia 1,972 \n",
	"4 40°42′46.8″N 74°0′48.6″W United States 1,776 \n",
	"\n",
	" Height (metres) Structure Year built \n",
	"0 828.1 Burj Khalifa 2010 \n",
	"1 634 Tokyo Skytree 2011 \n",
	"2 628.8 KVLY-TV mast 1963 \n",
	"3 601 Abraj Al Bait Towers 2011 \n",
	"4 541 One World Trade Center 2013 "
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	},
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>Category</th>\n",
	" <th>City</th>\n",
	" <th>Coordinates</th>\n",
	" <th>Country</th>\n",
	" <th>Height (feet)</th>\n",
	" <th>Height (metres)</th>\n",
	" <th>Structure</th>\n",
	" <th>Year built</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>46</th>\n",
	" <td>Aerial tramway</td>\n",
	" <td>Kaprun</td>\n",
	" <td>47°11′58.62″N 12°41′16.96″E</td>\n",
	" <td>Austria</td>\n",
	" <td>373</td>\n",
	" <td>113.6</td>\n",
	" <td>Pillar of</td>\n",
	" <td>1966</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>47</th>\n",
	" <td>Sphere</td>\n",
	" <td>Stockholm</td>\n",
	" <td>59°17′36.92″N 18°04′58.79″E</td>\n",
	" <td>Sweden</td>\n",
	" <td>279</td>\n",
	" <td>85</td>\n",
	" <td>Ericsson Globe</td>\n",
	" <td>1989</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>48</th>\n",
	" <td>Brick</td>\n",
	" <td>Genoa</td>\n",
	" <td>44°24′16.25″N 8°54′16.67″E</td>\n",
	" <td>Italy</td>\n",
	" <td>253</td>\n",
	" <td>77</td>\n",
	" <td>Lighthouse of Genoa</td>\n",
	" <td>1543</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>49</th>\n",
	" <td>Gopuram</td>\n",
	" <td>Murudeshwara</td>\n",
	" <td>14°05′39″N 74°29′07″E</td>\n",
	" <td>India</td>\n",
	" <td>249</td>\n",
	" <td>76</td>\n",
	" <td>Murudeshwara Temple</td>\n",
	" <td>2008</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>50</th>\n",
	" <td>Wooden</td>\n",
	" <td>Săpânța</td>\n",
	" <td>47°58′59.5″N 23°42′02.5″E</td>\n",
	" <td>Romania</td>\n",
	" <td>246</td>\n",
	" <td>75</td>\n",
	" <td>Church of the Săpânța-Peri Monastery</td>\n",
	" <td>2003</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" Category City Coordinates Country \\\n",
	"46 Aerial tramway Kaprun 47°11′58.62″N 12°41′16.96″E Austria \n",
	"47 Sphere Stockholm 59°17′36.92″N 18°04′58.79″E Sweden \n",
	"48 Brick Genoa 44°24′16.25″N 8°54′16.67″E Italy \n",
	"49 Gopuram Murudeshwara 14°05′39″N 74°29′07″E India \n",
	"50 Wooden Săpânța 47°58′59.5″N 23°42′02.5″E Romania \n",
	"\n",
	" Height (feet) Height (metres) Structure \\\n",
	"46 373 113.6 Pillar of \n",
	"47 279 85 Ericsson Globe \n",
	"48 253 77 Lighthouse of Genoa \n",
	"49 249 76 Murudeshwara Temple \n",
	"50 246 75 Church of the Săpânța-Peri Monastery \n",
	"\n",
	" Year built \n",
	"46 1966 \n",
	"47 1989 \n",
	"48 1543 \n",
	"49 2008 \n",
	"50 2003 "
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"df = pd.DataFrame(samples)\n",
	"df.head()\n",
	"df.tail()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Save Dataset\n",
	"\n",
	"We can now save the dataset using the same Wikipedia page name we use earlier to extract the dataset."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"dataset_file_name = '../datasets/wikipedia/{}.csv'.format(wikipedia_page)\n",
	"if not trace:\n",
	" df.to_csv(dataset_file_name, index=False)"
	]
	}
	],
	"metadata": {
	"_draft": {
	"nbviewer_url": "https://gist.github.com/407853f389ea711c82bd89faf90d0930"
	},
	"gist": {
	"data": {
	"description": "wikipedia-acquire-dataset-from-table.ipynb",
	"public": true
	},
	"id": "407853f389ea711c82bd89faf90d0930"
	},
	"hide_input": false,
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.0"
	},
	"toc": {
	"colors": {
	"hover_highlight": "#DAA520",
	"running_highlight": "#FF0000",
	"selected_highlight": "#FFD700"
	},
	"moveMenuLeft": true,
	"nav_menu": {
	"height": "138px",
	"width": "252px"
	},
	"navigate_menu": true,
	"number_sections": false,
	"sideBar": true,
	"threshold": 4,
	"toc_cell": true,
	"toc_section_display": "block",
	"toc_window_display": true,
	"widenNotebook": false
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}