Skip to content

Instantly share code, notes, and snippets.

@NicolasHug
Created April 28, 2022 14:07
Show Gist options
  • Save NicolasHug/96a75c2d754ff2a7c52afca2c0b628d4 to your computer and use it in GitHub Desktop.
Save NicolasHug/96a75c2d754ff2a7c52afca2c0b628d4 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "f5cc3a05-6a5f-4d2f-a2c1-cef095e74824",
"metadata": {},
"source": [
"# Implementing and using iterable datasets: What Could Go Wrong?\n",
"\n",
"\n",
"**Goal**: Understand different issues *users* have to deal with when they use **iterable** datasets. *Users* = us, as domain libs authors; and also our downstream users.\n",
"\n",
"**Why is this relevant to us?** Because datapipes are planned to be first-class citizens of the PyTorch data loading ecosystem. Datapipes **are** iterable datasets so everything we'll discuss is relevant for datapipes as well. For now, these issues aren't addressed.\n",
"\n",
"**Disclaimer**: this talk might be confusing. It's actually the point (kinda).\n"
]
},
{
"cell_type": "markdown",
"id": "e8149cc9-8a81-4707-a731-95fdce554f20",
"metadata": {},
"source": [
"## Let's start with the basics\n",
"\n",
"#### Map-style datasets"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "83a62715-0c6a-4e21-a2c8-201620fcc9a0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n",
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n",
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n",
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n",
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n",
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n",
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n",
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n",
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n",
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n"
]
}
],
"source": [
"import torch\n",
"import torch.utils.data as data\n",
"\n",
"\n",
"class MyMapStyleDS:\n",
" \n",
" def __init__(self, size=100):\n",
" self.size = size\n",
" \n",
" def __getitem__(self, i): # Returns the i'th sample\n",
" # Here: read from disk [+ decoding]\n",
" s = i\n",
" return s\n",
" \n",
" def __len__(self):\n",
" return self.size\n",
" \n",
" \n",
"mapstyle_ds = MyMapStyleDS()\n",
"mapstyle_dl = data.DataLoader(mapstyle_ds, batch_size=10)\n",
"\n",
"for batch in mapstyle_dl:\n",
" print(batch)\n",
" # Here: forward and backward passes"
]
},
{
"cell_type": "markdown",
"id": "78b3f0b5-a652-4717-86d1-e48d05303026",
"metadata": {},
"source": [
"#### Iterable datasets"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "e0046826-0e29-43da-b03e-f1a13e8df510",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n",
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n",
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n",
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n",
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n",
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n",
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n",
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n",
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n",
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n"
]
}
],
"source": [
"class MyIterableDS(data.IterableDataset):\n",
" \n",
" def __init__(self, size=100):\n",
" self.size = size\n",
" \n",
" def __iter__(self): # iterate over samples\n",
" for s in range(self.size):\n",
" yield s\n",
" \n",
" def __len__(self):\n",
" return self.size\n",
" \n",
" \n",
"iter_ds = MyIterableDS()\n",
"iter_dl = data.DataLoader(iter_ds, batch_size=10)\n",
"\n",
"for batch in iter_dl:\n",
" print(batch)\n",
" # Here: forward and backward passes"
]
},
{
"cell_type": "markdown",
"id": "23a4c2e3-433d-44a7-a788-1715fb59e3cc",
"metadata": {},
"source": [
"### So far so good\n",
"## Let's add some parallelism <o/\n",
"\n",
"We'll cover:\n",
"\n",
"- DataLoader parallelism\n",
"- DDP parallelism\n",
"\n",
"Fun fact: they're not mutually exclusive\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "f3725d1f-7253-4f8a-96de-a3cc75bb01bf",
"metadata": {},
"source": [
"### DataLoader parallelism\n",
"\n",
"#### Map-style - EZPZ lemon squeezy"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2f268c08-5f42-4f93-9612-e17f6c58520a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n",
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n",
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n",
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n",
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n",
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n",
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n",
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n",
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n",
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n"
]
}
],
"source": [
"mapstyle_dl = data.DataLoader(mapstyle_ds, batch_size=10, num_workers=4)\n",
"\n",
"for batch in mapstyle_dl:\n",
" print(batch)"
]
},
{
"cell_type": "markdown",
"id": "190df695-40d0-4391-90ae-40002b85cb9a",
"metadata": {},
"source": [
"#### Iterable - ~EZPZ lemon squeezy~"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "0ac29b80-cc17-494f-88c7-1d50c1abab45",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n",
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n",
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n",
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n",
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n",
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n",
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n",
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n",
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n",
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n",
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n",
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n",
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n",
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n",
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n",
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n",
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n",
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n",
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n",
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n",
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n",
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n",
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n",
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n",
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n",
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n",
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n",
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n",
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n",
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n",
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n",
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n",
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n",
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n",
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n",
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n",
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n",
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n",
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n",
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n"
]
}
],
"source": [
"iter_dl = data.DataLoader(iter_ds, batch_size=10, num_workers=4)\n",
"\n",
"for batch in iter_dl:\n",
" print(batch)"
]
},
{
"attachments": {
"1df0c385-533f-449f-8073-11646c3044d1.png": {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment