yashbonde · May 6, 2021 22:06
diff --git a/sharded.txt b/sharded.txt
 In this quick script we are trying to solve sharding problem:
 often in very large datasets there is no way to tokenize everything and store
 them. Considering the CLM datasets we have a fixed dataset where each row
 has dynamic number of tokens. A dummy looks like follows:

 j   n     sequence (w/o EOT = 42)
 [0] [15]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
 [1] [13]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [2] [11]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [3] [13]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [4] [15]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
 [5] [ 8]  [0, 1, 2, 3, 4, 5, 6],
 [6] [14]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
 [7] [ 8]  [0, 1, 2, 3, 4, 5, 6],
 [8] [11]  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [9] [10]  [0, 1, 2, 3, 4, 5, 6, 7, 8]

 j: index in ds
 n: number of tokens in this seq + 1

 During initialisation we provide
 a) seqlen: Size of each output sequence
 b) stride: Difference between two consecutive samples. Same as Convolution

 When training the model we train on continuous spans (size = seqlen)
 and these spans are obtained by merging multiple sequences or from the
 same sequence itself.

 - for seqlen = 10 and stride = 10
  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 42, 0, 1, 2, 3, 4], ...
  stride = seqlen ensures there is no overlap in the sequences

 - for seqlen = 10, and stride = 5
  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 42, 0, 1, 2, 3, 4], ...
  [5, 6, 7, 8, 9, 10, 11, 12, 13, 420], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ...
  notice how sequences have overlaps

 TASK: given a list of lists (see above) called `ds`, `seqlen` and `stride`
 a) can you find the total number of samples (l) in the dataset
 b) given any i <= l can you get me the correct sequence
	In this quick script we are trying to solve sharding problem:
	often in very large datasets there is no way to tokenize everything and store
	them. Considering the CLM datasets we have a fixed dataset where each row
	has dynamic number of tokens. A dummy looks like follows:

	j n sequence (w/o EOT = 42)
	[0] [15] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
	[1] [13] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
	[2] [11] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
	[3] [13] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
	[4] [15] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
	[5] [ 8] [0, 1, 2, 3, 4, 5, 6],
	[6] [14] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
	[7] [ 8] [0, 1, 2, 3, 4, 5, 6],
	[8] [11] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
	[9] [10] [0, 1, 2, 3, 4, 5, 6, 7, 8]

	j: index in ds
	n: number of tokens in this seq + 1

	During initialisation we provide
	a) seqlen: Size of each output sequence
	b) stride: Difference between two consecutive samples. Same as Convolution

	When training the model we train on continuous spans (size = seqlen)
	and these spans are obtained by merging multiple sequences or from the
	same sequence itself.

	- for seqlen = 10 and stride = 10
	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 42, 0, 1, 2, 3, 4], ...
	stride = seqlen ensures there is no overlap in the sequences

	- for seqlen = 10, and stride = 5
	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 42, 0, 1, 2, 3, 4], ...
	[5, 6, 7, 8, 9, 10, 11, 12, 13, 420], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ...
	notice how sequences have overlaps

	TASK: given a list of lists (see above) called `ds`, `seqlen` and `stride`
	a) can you find the total number of samples (l) in the dataset
	b) given any i <= l can you get me the correct sequence