Skip to content

Instantly share code, notes, and snippets.

@mgilliam
Created March 2, 2018 20:54
Show Gist options
  • Save mgilliam/1e666750ff9fda9d3c090c44a9b4e8e9 to your computer and use it in GitHub Desktop.
Save mgilliam/1e666750ff9fda9d3c090c44a9b4e8e9 to your computer and use it in GitHub Desktop.
Pandas Time Grouper
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, a pandas **[DatetimeIndex]** is generated. The index starts from the current time and comprises 360,000 periods of 1 second each.\n",
"\n",
"The index is then converted to a pandas **[Series]**.\n",
"\n",
"[DatetimeIndex]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.html\n",
"[Series]: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2018-02-21T19:56:33.898000Z",
"start_time": "2018-02-21T19:56:33.409000Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"2018-03-02 12:51:39.932344 2018-03-02 12:51:39.932344\n",
"2018-03-02 12:51:40.932344 2018-03-02 12:51:40.932344\n",
"2018-03-02 12:51:41.932344 2018-03-02 12:51:41.932344\n",
"2018-03-02 12:51:42.932344 2018-03-02 12:51:42.932344\n",
"2018-03-02 12:51:43.932344 2018-03-02 12:51:43.932344\n",
"Freq: S, dtype: datetime64[ns]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"begin = pd.datetime.now()\n",
"dates = pd.date_range(begin, freq='S', periods=360000).to_series()\n",
"\n",
"dates.head()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2018-03-06 16:51:34.932344 2018-03-06 16:51:34.932344\n",
"2018-03-06 16:51:35.932344 2018-03-06 16:51:35.932344\n",
"2018-03-06 16:51:36.932344 2018-03-06 16:51:36.932344\n",
"2018-03-06 16:51:37.932344 2018-03-06 16:51:37.932344\n",
"2018-03-06 16:51:38.932344 2018-03-06 16:51:38.932344\n",
"Freq: S, dtype: datetime64[ns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dates.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The datetime values are rounded to the nearest second. From these, 36 values are taken via the **[sample]** method and sorted.\n",
"\n",
"To each of the `startDate` values, a random (integer) number of seconds is added to generate the `endDate` values. The **[timedelta]** here is constrained between zero and half the minimum of the [first discrete difference] of the `startDate` values (to guard against overlap).\n",
"\n",
"[sample]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sample.html\n",
"[timedelta]: https://docs.python.org/3.6/library/datetime.html#timedelta-objects\n",
"[first discrete difference]: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.diff.html"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2018-02-21T19:56:33.898000Z",
"start_time": "2018-02-21T19:56:33.409000Z"
}
},
"outputs": [],
"source": [
"from datetime import timedelta\n",
"import numpy as np\n",
"\n",
"startDate = dates.dt.round(freq='S').sample(n=36).sort_values()\n",
"dt_max = np.floor(startDate.diff().min().total_seconds() / 2)\n",
"endDate = startDate + timedelta(seconds=np.random.randint(0, dt_max))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, the synthetic data is assembled into a pandas **[DataFrame]** by [concatenating] the `startDate` and `endDate`.\n",
"\n",
"To this is added a `value` column for the step measurement (random) values for each \\[`startDate`, `endDate`\\] interval. The resultant **[DataFrame]** is then sorted by time, and the (integer) index is reset.\n",
"\n",
"[DataFrame]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html\n",
"[concatenating]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2018-02-21T19:56:33.898000Z",
"start_time": "2018-02-21T19:56:33.409000Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>startDate</th>\n",
" <th>endDate</th>\n",
" <th>value</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2018-03-02 14:08:12</td>\n",
" <td>2018-03-02 14:08:18</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2018-03-02 19:41:24</td>\n",
" <td>2018-03-02 19:41:30</td>\n",
" <td>68</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2018-03-02 20:33:13</td>\n",
" <td>2018-03-02 20:33:19</td>\n",
" <td>33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2018-03-03 15:03:05</td>\n",
" <td>2018-03-03 15:03:11</td>\n",
" <td>47</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2018-03-03 18:26:16</td>\n",
" <td>2018-03-03 18:26:22</td>\n",
" <td>57</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" startDate endDate value\n",
"0 2018-03-02 14:08:12 2018-03-02 14:08:18 30\n",
"1 2018-03-02 19:41:24 2018-03-02 19:41:30 68\n",
"2 2018-03-02 20:33:13 2018-03-02 20:33:19 33\n",
"3 2018-03-03 15:03:05 2018-03-03 15:03:11 47\n",
"4 2018-03-03 18:26:16 2018-03-03 18:26:22 57"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cols = ['startDate', 'endDate']\n",
"\n",
"steps = pd.concat([startDate, endDate], axis=1, keys=cols)\n",
"steps['value'] = np.random.randint(0, 100, size=len(steps))\n",
"steps.sort_values(by=['startDate', 'endDate'], inplace=True)\n",
"steps.reset_index(drop=True, inplace=True)\n",
"\n",
"steps.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `steps` are [grouped] into daily bins according to the `startDate` column values. This results in a pandas **DataFrameGroupBy** object.\n",
"\n",
"[grouped]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2018-02-21T19:57:35.436000Z",
"start_time": "2018-02-21T19:57:35.415000Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<pandas.core.groupby.DataFrameGroupBy object at 0x00000254B675BB38>"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"daily = steps.groupby(pd.Grouper(key='startDate', freq='D'))\n",
"\n",
"daily"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The total daily steps can be calculated by applying the built-in `sum` [aggregation] to the **GroupBy** object. While `daily` remains a **GroupBy** object, applying an aggregation function yields a **DataFrame**.\n",
"\n",
"[aggregation]: https://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>value</th>\n",
" </tr>\n",
" <tr>\n",
" <th>startDate</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2018-03-02</th>\n",
" <td>131</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2018-03-03</th>\n",
" <td>185</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2018-03-04</th>\n",
" <td>645</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2018-03-05</th>\n",
" <td>329</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2018-03-06</th>\n",
" <td>338</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" value\n",
"startDate \n",
"2018-03-02 131\n",
"2018-03-03 185\n",
"2018-03-04 645\n",
"2018-03-05 329\n",
"2018-03-06 338"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"daily.sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This **DataFrame** can of course be [plotted].\n",
"\n",
"[plotted]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x254b882f780>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"\n",
"daily.sum().plot(figsize=(12, 8));"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example -- due to the irregularity of the step value measurement reporting intervals -- many of the possible aggregations to the **GroupBy** object are not interesting at all. For example, the `max` function, if applied to the grouped data, would return the maximum value for all reported intervals within each daily group. Similarly for `min`, `mean`, and `std`."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2018-02-21T20:00:10.778000Z",
"start_time": "2018-02-21T20:00:10.751000Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe thead tr:last-of-type th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th colspan=\"4\" halign=\"left\">value</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>min</th>\n",
" <th>max</th>\n",
" <th>mean</th>\n",
" <th>std</th>\n",
" </tr>\n",
" <tr>\n",
" <th>startDate</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2018-03-02</th>\n",
" <td>30</td>\n",
" <td>68</td>\n",
" <td>43.666667</td>\n",
" <td>21.126603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2018-03-03</th>\n",
" <td>12</td>\n",
" <td>57</td>\n",
" <td>37.000000</td>\n",
" <td>22.704625</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2018-03-04</th>\n",
" <td>14</td>\n",
" <td>97</td>\n",
" <td>64.500000</td>\n",
" <td>28.336274</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2018-03-05</th>\n",
" <td>7</td>\n",
" <td>65</td>\n",
" <td>29.909091</td>\n",
" <td>19.408527</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2018-03-06</th>\n",
" <td>10</td>\n",
" <td>89</td>\n",
" <td>48.285714</td>\n",
" <td>31.393812</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" value \n",
" min max mean std\n",
"startDate \n",
"2018-03-02 30 68 43.666667 21.126603\n",
"2018-03-03 12 57 37.000000 22.704625\n",
"2018-03-04 14 97 64.500000 28.336274\n",
"2018-03-05 7 65 29.909091 19.408527\n",
"2018-03-06 10 89 48.285714 31.393812"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"daily.agg(['min', 'max', 'mean', 'std'])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (3.6-64)",
"language": "python",
"name": "py36-64"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"toc_cell": false,
"toc_position": {},
"toc_section_display": "block",
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment