Created
May 6, 2021 22:06
-
-
Save yashbonde/782887b82ffab61e126fb69122bf54bf to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In this quick script we are trying to solve sharding problem: | |
often in very large datasets there is no way to tokenize everything and store | |
them. Considering the CLM datasets we have a fixed dataset where each row | |
has dynamic number of tokens. A dummy looks like follows: | |
j n sequence (w/o EOT = 42) | |
[0] [15] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], | |
[1] [13] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], | |
[2] [11] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], | |
[3] [13] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], | |
[4] [15] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], | |
[5] [ 8] [0, 1, 2, 3, 4, 5, 6], | |
[6] [14] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], | |
[7] [ 8] [0, 1, 2, 3, 4, 5, 6], | |
[8] [11] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], | |
[9] [10] [0, 1, 2, 3, 4, 5, 6, 7, 8] | |
j: index in ds | |
n: number of tokens in this seq + 1 | |
During initialisation we provide | |
a) seqlen: Size of each output sequence | |
b) stride: Difference between two consecutive samples. Same as Convolution | |
When training the model we train on continuous spans (size = seqlen) | |
and these spans are obtained by merging multiple sequences or from the | |
same sequence itself. | |
- for seqlen = 10 and stride = 10 | |
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 42, 0, 1, 2, 3, 4], ... | |
stride = seqlen ensures there is no overlap in the sequences | |
- for seqlen = 10, and stride = 5 | |
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 42, 0, 1, 2, 3, 4], ... | |
[5, 6, 7, 8, 9, 10, 11, 12, 13, 420], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ... | |
notice how sequences have overlaps | |
TASK: given a list of lists (see above) called `ds`, `seqlen` and `stride` | |
a) can you find the total number of samples (l) in the dataset | |
b) given any i <= l can you get me the correct sequence |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment