⚠️ 🟢 Issue: training error
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6[1,mpirank:0,algo-1]<stderr>:,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
...
[1,mpirank:1,algo-2]<stdout>: File "train.py", line 675, in <module>
[1,mpirank:1,algo-2]<stdout>: main(task)
[1,mpirank:1,algo-2]<stdout>: File "train.py", line 572, in main
[1,mpirank:1,algo-2]<stdout>: train(task)
[1,mpirank:1,algo-2]<stdout>: File "train.py", line 277, in train
[1,mpirank:1,algo-2]<stdout>: loss.backward()
[1,mpirank:1,algo-2]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
[1,mpirank:1,algo-2]<stdout>: torch.autograd.backward(
[1,mpirank:1,algo-2]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
[1,mpirank:1,algo-2]<stdout>: Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[1,mpirank:1,algo-2]<stdout>:RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
[1,mpirank:0,algo-1]<stdout>: Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[1,mpirank:0,algo-1]<stdout>:RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
- ✅ Solution:
def label_transform(x):
## Original lables are (1,2,3,4,5)
## Convert to (0,1,2,3,4)
return torch.tensor(int(x.decode())-1, dtype=torch.int64)
class WebDatasetDDP(IterableDataset):
def __init__(self,
...
super().__init__()
self.dataset = (
...
wds.DataPipeline(
...
wds.map_tuple(
key_transform,
train_transform,
label_transform,
),
...
- ChatGPT (might be wrong)
The error message you're encountering, Assertion 't >= 0 && t < n_classes' failed
, is related to the Negative Log Likelihood Loss (NLLLoss) in PyTorch, which is typically used for classification tasks. This assertion error means that one or more of the target labels in your dataset are either negative or greater than or equal to the number of classes (n_classes
). In simpler terms, the model is encountering a target value (label) that is not within the valid range for classification.
The issue typically occurs because the target labels in your dataset have values that are outside the expected range of class indices. For a classification task with n_classes
possible categories, valid target labels should range from 0
to n_classes - 1
.
-
Check the target labels: Ensure that the target labels in your dataset are integers and fall within the valid range
[0, n_classes - 1]
.- If you have
n_classes = 5
, then the valid labels are0, 1, 2, 3, 4
. - Any label outside this range, such as
-1
or5
, will trigger the assertion error.
How to check: Add a check to validate the range of your target labels before passing them to the model.
assert torch.all((target >= 0) & (target < n_classes)), "Invalid target label encountered!"
- If you have
-
Preprocess target labels: If your dataset contains labels that are outside the valid range, preprocess the target labels to ensure they are within the correct range.
Example:
# Assuming you need to remap the target labels to be in the range [0, n_classes-1] target = target.clamp(0, n_classes - 1)
-
Check for invalid class weights: If you're using class weights in your loss function (e.g., passing
weight
toNLLLoss
orCrossEntropyLoss
), ensure that the class weights correspond to the number of classes and that they match the indices of the classes correctly. -
Verify number of classes in your dataset: Double-check that the
n_classes
variable in your code corresponds to the actual number of unique classes in your dataset. Ifn_classes
is set incorrectly, targets outside this range will cause the error.You can inspect the number of unique classes using:
unique_classes = torch.unique(targets) print(unique_classes) # Ensure this matches with your n_classes
-
Label smoothing: If you use label smoothing or other data augmentation techniques, ensure that these methods don't produce invalid target labels.
If you're using a dataset with 5 classes (indexed from 0
to 4
), but your dataset contains labels like 5
or -1
, you could preprocess your targets as follows:
# Suppose you have targets that are incorrectly labeled
targets = torch.tensor([0, 1, 2, 5, -1]) # Example target with invalid labels
# Clamp target values between 0 and n_classes - 1
n_classes = 5
targets = targets.clamp(0, n_classes - 1)
print(targets) # Output will be [0, 1, 2, 4, 0]
- Log the target values right before passing them to the loss function to check if they are within the correct range.
- Print the unique classes in the dataset to confirm if there are any out-of-bound labels:
print(torch.unique(targets)) # Should print valid class indices only
Let me know how these steps work for you or if you need further assistance!
__iter__
method.