Created
April 28, 2022 14:07
-
-
Save NicolasHug/96a75c2d754ff2a7c52afca2c0b628d4 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "f5cc3a05-6a5f-4d2f-a2c1-cef095e74824", | |
"metadata": {}, | |
"source": [ | |
"# Implementing and using iterable datasets: What Could Go Wrong?\n", | |
"\n", | |
"\n", | |
"**Goal**: Understand different issues *users* have to deal with when they use **iterable** datasets. *Users* = us, as domain libs authors; and also our downstream users.\n", | |
"\n", | |
"**Why is this relevant to us?** Because datapipes are planned to be first-class citizens of the PyTorch data loading ecosystem. Datapipes **are** iterable datasets so everything we'll discuss is relevant for datapipes as well. For now, these issues aren't addressed.\n", | |
"\n", | |
"**Disclaimer**: this talk might be confusing. It's actually the point (kinda).\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "e8149cc9-8a81-4707-a731-95fdce554f20", | |
"metadata": {}, | |
"source": [ | |
"## Let's start with the basics\n", | |
"\n", | |
"#### Map-style datasets" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "83a62715-0c6a-4e21-a2c8-201620fcc9a0", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", | |
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n", | |
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n", | |
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n", | |
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n", | |
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n", | |
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n", | |
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n", | |
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n", | |
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n" | |
] | |
} | |
], | |
"source": [ | |
"import torch\n", | |
"import torch.utils.data as data\n", | |
"\n", | |
"\n", | |
"class MyMapStyleDS:\n", | |
" \n", | |
" def __init__(self, size=100):\n", | |
" self.size = size\n", | |
" \n", | |
" def __getitem__(self, i): # Returns the i'th sample\n", | |
" # Here: read from disk [+ decoding]\n", | |
" s = i\n", | |
" return s\n", | |
" \n", | |
" def __len__(self):\n", | |
" return self.size\n", | |
" \n", | |
" \n", | |
"mapstyle_ds = MyMapStyleDS()\n", | |
"mapstyle_dl = data.DataLoader(mapstyle_ds, batch_size=10)\n", | |
"\n", | |
"for batch in mapstyle_dl:\n", | |
" print(batch)\n", | |
" # Here: forward and backward passes" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "78b3f0b5-a652-4717-86d1-e48d05303026", | |
"metadata": {}, | |
"source": [ | |
"#### Iterable datasets" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "e0046826-0e29-43da-b03e-f1a13e8df510", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", | |
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n", | |
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n", | |
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n", | |
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n", | |
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n", | |
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n", | |
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n", | |
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n", | |
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n" | |
] | |
} | |
], | |
"source": [ | |
"class MyIterableDS(data.IterableDataset):\n", | |
" \n", | |
" def __init__(self, size=100):\n", | |
" self.size = size\n", | |
" \n", | |
" def __iter__(self): # iterate over samples\n", | |
" for s in range(self.size):\n", | |
" yield s\n", | |
" \n", | |
" def __len__(self):\n", | |
" return self.size\n", | |
" \n", | |
" \n", | |
"iter_ds = MyIterableDS()\n", | |
"iter_dl = data.DataLoader(iter_ds, batch_size=10)\n", | |
"\n", | |
"for batch in iter_dl:\n", | |
" print(batch)\n", | |
" # Here: forward and backward passes" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "23a4c2e3-433d-44a7-a788-1715fb59e3cc", | |
"metadata": {}, | |
"source": [ | |
"### So far so good\n", | |
"## Let's add some parallelism <o/\n", | |
"\n", | |
"We'll cover:\n", | |
"\n", | |
"- DataLoader parallelism\n", | |
"- DDP parallelism\n", | |
"\n", | |
"Fun fact: they're not mutually exclusive\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f3725d1f-7253-4f8a-96de-a3cc75bb01bf", | |
"metadata": {}, | |
"source": [ | |
"### DataLoader parallelism\n", | |
"\n", | |
"#### Map-style - EZPZ lemon squeezy" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "2f268c08-5f42-4f93-9612-e17f6c58520a", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", | |
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n", | |
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n", | |
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n", | |
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n", | |
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n", | |
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n", | |
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n", | |
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n", | |
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n" | |
] | |
} | |
], | |
"source": [ | |
"mapstyle_dl = data.DataLoader(mapstyle_ds, batch_size=10, num_workers=4)\n", | |
"\n", | |
"for batch in mapstyle_dl:\n", | |
" print(batch)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "190df695-40d0-4391-90ae-40002b85cb9a", | |
"metadata": {}, | |
"source": [ | |
"#### Iterable - ~EZPZ lemon squeezy~" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "0ac29b80-cc17-494f-88c7-1d50c1abab45", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", | |
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", | |
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", | |
"tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", | |
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n", | |
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n", | |
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n", | |
"tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])\n", | |
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n", | |
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n", | |
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n", | |
"tensor([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])\n", | |
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n", | |
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n", | |
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n", | |
"tensor([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])\n", | |
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n", | |
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n", | |
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n", | |
"tensor([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])\n", | |
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n", | |
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n", | |
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n", | |
"tensor([50, 51, 52, 53, 54, 55, 56, 57, 58, 59])\n", | |
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n", | |
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n", | |
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n", | |
"tensor([60, 61, 62, 63, 64, 65, 66, 67, 68, 69])\n", | |
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n", | |
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n", | |
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n", | |
"tensor([70, 71, 72, 73, 74, 75, 76, 77, 78, 79])\n", | |
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n", | |
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n", | |
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n", | |
"tensor([80, 81, 82, 83, 84, 85, 86, 87, 88, 89])\n", | |
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n", | |
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n", | |
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n", | |
"tensor([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])\n" | |
] | |
} | |
], | |
"source": [ | |
"iter_dl = data.DataLoader(iter_ds, batch_size=10, num_workers=4)\n", | |
"\n", | |
"for batch in iter_dl:\n", | |
" print(batch)" | |
] | |
}, | |
{ | |
"attachments": { | |
"1df0c385-533f-449f-8073-11646c3044d1.png": { |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment