Skip to content

Instantly share code, notes, and snippets.

@ericbusboom
Created January 31, 2015 17:21
Show Gist options
  • Save ericbusboom/b7738f3f87d733061e4d to your computer and use it in GitHub Desktop.
Save ericbusboom/b7738f3f87d733061e4d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"worksheets": [
{
"cells": [
{
"metadata": {},
"cell_type": "heading",
"source": "Anscombe's Quartet, Grouping and Multi-Indexing",
"level": 1
},
{
"metadata": {},
"cell_type": "code",
"input": "%matplotlib inline\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n\npd.set_option('display.max_columns', 100)\n\n\nfrom __future__ import print_function\nprint('Pandas version ' + pd.__version__)\nprint('Numpy version ' + np.__version__)\n",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Anscombe's Quartet is an interesting dataset that demonstrates the need for ploting in the explorator data analysis process. It also has some fatures that make it a good dataset for demonstrating Pandas' grouping and multi-index features. "
},
{
"metadata": {},
"cell_type": "code",
"input": "url='https://gist.githubusercontent.com/ericbusboom/b2ac1d366c005cd2ed8c/raw/177091875a90311ee130cf903dcc7934fcc0e973/anscombes.csv'\naq = pd.read_csv(url, index_col='id')\n",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "First, let's explore some of the descriptive statistics for the datasets. "
},
{
"metadata": {},
"cell_type": "code",
"input": "aq.head()",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "aq.dataset.unique()",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Lets get the mean for each of the groups. We'll do it the hard way first. "
},
{
"metadata": {},
"cell_type": "code",
"input": "for ds in ('I','II','III','IV'):\n print(\"{} {}\".format(ds, aq[aq.dataset == ds]['y'].mean()))",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Of course, there is an easier way"
},
{
"metadata": {},
"cell_type": "code",
"input": "aq.groupby('dataset').mean()",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "aq.groupby('dataset').std()",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Very interesting. The mean an the standard deviation are the same for all datasets. "
},
{
"metadata": {},
"cell_type": "code",
"input": "aq.groupby('dataset').corr()",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "This gets us the information we are looking for, but it is hard to read, since most of the infmration is redundant. Let's see if we can select just the sinve (x,y) correlations for each dataset group. First, notice that the `aq` dataframe has a heirarchical index"
},
{
"metadata": {},
"cell_type": "code",
"input": "aqc = aq.groupby('dataset').corr()\naqc.index",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We'll name the second level of the index, then we can use the cross-section method, xs() to get the inner level. "
},
{
"metadata": {},
"cell_type": "code",
"input": "\naqc.index.names = ['dataset','var1']\naqc.index",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We can select inner levels of the multi index with the xs() method. \nSee http://pandas.pydata.org/pandas-docs/stable/advanced.html#cross-section"
},
{
"metadata": {},
"cell_type": "code",
"input": "aqc.xs('x',level='var1')['y']",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "heading",
"source": "Plotting the Quartet",
"level": 1
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Now we've seen that the four datasets of the quartet have the same descriptive statistics, so it's time for the punchline. "
},
{
"metadata": {},
"cell_type": "code",
"input": "import seaborn as sns # if fails, pip install seaborn\nsns.lmplot(\"x\", \"y\", col=\"dataset\", hue=\"dataset\", data=aq, col_wrap=2, ci=None, size=4,\n scatter_kws={\"s\": 50, \"alpha\": 1})",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
}
],
"metadata": {}
}
],
"metadata": {
"name": "",
"signature": "sha256:b8686379668da5f1064aa6eb8acf76993ae7e56a778188cea92f8e4a4a74e846"
},
"nbformat": 3
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment