ericbusboom · January 31, 2015 17:21
diff --git a/Anscombes Quartet and Grouping.ipynb b/Anscombes Quartet and Grouping.ipynb
 {
 "worksheets": [
  {
   "cells": [
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Anscombe's Quartet, Grouping and Multi-Indexing",
     "level": 1
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "%matplotlib inline\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n\npd.set_option('display.max_columns', 100)\n\n\nfrom __future__ import print_function\nprint('Pandas version ' + pd.__version__)\nprint('Numpy version ' + np.__version__)\n",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Anscombe's Quartet is an interesting dataset that demonstrates the need for ploting in the explorator data analysis process. It also has some fatures that make it a good dataset for demonstrating Pandas' grouping and multi-index features. "
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "url='https://gist.githubusercontent.com/ericbusboom/b2ac1d366c005cd2ed8c/raw/177091875a90311ee130cf903dcc7934fcc0e973/anscombes.csv'\naq = pd.read_csv(url, index_col='id')\n",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "First, let's explore some of the descriptive statistics for the datasets. "
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "aq.head()",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "aq.dataset.unique()",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Lets get the mean for each of the groups. We'll do it the hard way first. "
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "for ds in ('I','II','III','IV'):\n    print(\"{} {}\".format(ds, aq[aq.dataset == ds]['y'].mean()))",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Of course, there is an easier way"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "aq.groupby('dataset').mean()",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "aq.groupby('dataset').std()",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Very interesting. The mean an the standard deviation are the same for all datasets. "
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "aq.groupby('dataset').corr()",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "This gets us the information we are looking for, but it is hard to read, since most of the infmration is redundant. Let's see if we can select just the sinve (x,y) correlations for each dataset group. First, notice that the `aq` dataframe has a heirarchical index"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "aqc = aq.groupby('dataset').corr()\naqc.index",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "We'll name the second level of the index, then we can use the cross-section method, xs() to get the inner level. "
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "\naqc.index.names = ['dataset','var1']\naqc.index",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "We can select inner levels of the multi index with the xs() method. \nSee http://pandas.pydata.org/pandas-docs/stable/advanced.html#cross-section"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "aqc.xs('x',level='var1')['y']",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Plotting the Quartet",
     "level": 1
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Now we've seen that the four datasets of the quartet have the same descriptive statistics, so it's time for the punchline. "
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "import seaborn as sns # if fails, pip install seaborn\nsns.lmplot(\"x\", \"y\", col=\"dataset\", hue=\"dataset\", data=aq, col_wrap=2, ci=None, size=4,\n           scatter_kws={\"s\": 50, \"alpha\": 1})",
     "outputs": [],
     "language": "python",
     "trusted": true,
     "collapsed": false
    }
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "name": "",
  "signature": "sha256:b8686379668da5f1064aa6eb8acf76993ae7e56a778188cea92f8e4a4a74e846"
 },
 "nbformat": 3
 }
	{
	"worksheets": [
	{
	"cells": [
	{
	"metadata": {},
	"cell_type": "heading",
	"source": "Anscombe's Quartet, Grouping and Multi-Indexing",
	"level": 1
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "%matplotlib inline\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n\npd.set_option('display.max_columns', 100)\n\n\nfrom __future__ import print_function\nprint('Pandas version ' + pd.__version__)\nprint('Numpy version ' + np.__version__)\n",
	"outputs": [],
	"language": "python",
	"trusted": true,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Anscombe's Quartet is an interesting dataset that demonstrates the need for ploting in the explorator data analysis process. It also has some fatures that make it a good dataset for demonstrating Pandas' grouping and multi-index features. "
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "url='https://gist.githubusercontent.com/ericbusboom/b2ac1d366c005cd2ed8c/raw/177091875a90311ee130cf903dcc7934fcc0e973/anscombes.csv'\naq = pd.read_csv(url, index_col='id')\n",
	"outputs": [],
	"language": "python",
	"trusted": true,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "First, let's explore some of the descriptive statistics for the datasets. "
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "aq.head()",
	"outputs": [],
	"language": "python",
	"trusted": true,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "aq.dataset.unique()",
	"outputs": [],
	"language": "python",
	"trusted": true,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Lets get the mean for each of the groups. We'll do it the hard way first. "
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "for ds in ('I','II','III','IV'):\n print(\"{} {}\".format(ds, aq[aq.dataset == ds]['y'].mean()))",
	"outputs": [],
	"language": "python",
	"trusted": true,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Of course, there is an easier way"
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "aq.groupby('dataset').mean()",
	"outputs": [],
	"language": "python",
	"trusted": true,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "aq.groupby('dataset').std()",
	"outputs": [],
	"language": "python",
	"trusted": true,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Very interesting. The mean an the standard deviation are the same for all datasets. "
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "aq.groupby('dataset').corr()",
	"outputs": [],
	"language": "python",
	"trusted": true,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "This gets us the information we are looking for, but it is hard to read, since most of the infmration is redundant. Let's see if we can select just the sinve (x,y) correlations for each dataset group. First, notice that the `aq` dataframe has a heirarchical index"
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "aqc = aq.groupby('dataset').corr()\naqc.index",
	"outputs": [],
	"language": "python",
	"trusted": true,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "We'll name the second level of the index, then we can use the cross-section method, xs() to get the inner level. "
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "\naqc.index.names = ['dataset','var1']\naqc.index",
	"outputs": [],
	"language": "python",
	"trusted": true,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "We can select inner levels of the multi index with the xs() method. \nSee http://pandas.pydata.org/pandas-docs/stable/advanced.html#cross-section"
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "aqc.xs('x',level='var1')['y']",
	"outputs": [],
	"language": "python",
	"trusted": true,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "heading",
	"source": "Plotting the Quartet",
	"level": 1
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Now we've seen that the four datasets of the quartet have the same descriptive statistics, so it's time for the punchline. "
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "import seaborn as sns # if fails, pip install seaborn\nsns.lmplot(\"x\", \"y\", col=\"dataset\", hue=\"dataset\", data=aq, col_wrap=2, ci=None, size=4,\n scatter_kws={\"s\": 50, \"alpha\": 1})",
	"outputs": [],
	"language": "python",
	"trusted": true,
	"collapsed": false
	}
	],
	"metadata": {}
	}
	],
	"metadata": {
	"name": "",
	"signature": "sha256:b8686379668da5f1064aa6eb8acf76993ae7e56a778188cea92f8e4a4a74e846"
	},
	"nbformat": 3
	}