⚠️ 🟢 Issue: training error
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6[1,mpirank:0,algo-1]<stderr>:,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
...
[1,mpirank:1,algo-2]<stdout>: File "train.py", line 675, in <module>
[1,mpirank:1,algo-2]<stdout>: main(task)
[1,mpirank:1,algo-2]<stdout>: File "train.py", line 572, in main
[1,mpirank:1,algo-2]<stdout>: train(task)
[1,mpirank:1,algo-2]<stdout>: File "train.py", line 277, in train
[1,mpirank:1,algo-2]<stdout>: loss.backward()
[1,mpirank:1,algo-2]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
[1,mpirank:1,algo-2]<stdout>: torch.autograd.backward(
[1,mpirank:1,algo-2]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
[1,mpirank:1,algo-2]<stdout>: Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[1,mpirank:1,algo-2]<stdout>:RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
[1,mpirank:0,algo-1]<stdout>: Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[1,mpirank:0,algo-1]<stdout>:RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
- ✅ Solution:
def label_transform(x):
## Original lables are (1,2,3,4,5)
## Convert to (0,1,2,3,4)
return torch.tensor(int(x.decode())-1, dtype=torch.int64)
class WebDatasetDDP(IterableDataset):
def __init__(self,
...
super().__init__()
self.dataset = (
...
wds.DataPipeline(
...
wds.map_tuple(
key_transform,
train_transform,
label_transform,
),
...
- ChatGPT (might be wrong)
The error message you're encountering, Assertion 't >= 0 && t < n_classes' failed
, is related to the Negative Log Likelihood Loss (NLLLoss) in PyTorch, which is typically used for classification tasks. This assertion error means that one or more of the target labels in your dataset are either negative or greater than or equal to the number of classes (n_classes
). In simpler terms, the model is encountering a target value (label) that is not within the valid range for classification.
The issue typically occurs because the target labels in your dataset have values that are outside the expected range of class indices. For a classification task with n_classes
possible categories, valid target labels should range from 0
to n_classes - 1
.
-
Check the target labels: Ensure that the target labels in your dataset are integers and fall within the valid range
[0, n_classes - 1]
.- If you have
n_classes = 5
, then the valid labels are0, 1, 2, 3, 4
. - Any label outside this range, such as
-1
or5
, will trigger the assertion error.
How to check: Add a check to validate the range of your target labels before passing them to the model.
assert torch.all((target >= 0) & (target < n_classes)), "Invalid target label encountered!"
- If you have
-
Preprocess target labels: If your dataset contains labels that are outside the valid range, preprocess the target labels to ensure they are within the correct range.
Example:
# Assuming you need to remap the target labels to be in the range [0, n_classes-1] target = target.clamp(0, n_classes - 1)
-
Check for invalid class weights: If you're using class weights in your loss function (e.g., passing
weight
toNLLLoss
orCrossEntropyLoss
), ensure that the class weights correspond to the number of classes and that they match the indices of the classes correctly. -
Verify number of classes in your dataset: Double-check that the
n_classes
variable in your code corresponds to the actual number of unique classes in your dataset. Ifn_classes
is set incorrectly, targets outside this range will cause the error.You can inspect the number of unique classes using:
unique_classes = torch.unique(targets) print(unique_classes) # Ensure this matches with your n_classes
-
Label smoothing: If you use label smoothing or other data augmentation techniques, ensure that these methods don't produce invalid target labels.
If you're using a dataset with 5 classes (indexed from 0
to 4
), but your dataset contains labels like 5
or -1
, you could preprocess your targets as follows:
# Suppose you have targets that are incorrectly labeled
targets = torch.tensor([0, 1, 2, 5, -1]) # Example target with invalid labels
# Clamp target values between 0 and n_classes - 1
n_classes = 5
targets = targets.clamp(0, n_classes - 1)
print(targets) # Output will be [0, 1, 2, 4, 0]
- Log the target values right before passing them to the loss function to check if they are within the correct range.
- Print the unique classes in the dataset to confirm if there are any out-of-bound labels:
print(torch.unique(targets)) # Should print valid class indices only
Let me know how these steps work for you or if you need further assistance!
The error you're seeing is a timeout error in a distributed training setup, specifically with SMDDP (SageMaker Distributed Data Parallel). The error indicates that a collective communication operation (in this case,
all_gather
) took longer than the timeout limit of 1800 seconds (30 minutes), causing the job to terminate. This typically happens during synchronization of data across nodes when using distributed data parallelism.Possible Causes:
Synchronization Delay: One or more GPUs are lagging behind, causing synchronization delays. This can happen if some GPUs are processing batches more slowly than others.
Communication Bottleneck: If you're using multiple instances, the network communication between instances (nodes) could be slow or unstable, causing distributed operations (such as
all_gather
) to time out.Batch Size Discrepancy: If the batch sizes across nodes or GPUs are not equal, or if the data loading process is inconsistent, it can cause desynchronization during collective operations like
all_gather
.Deadlocks or Resource Contention: There might be issues with resource contention, deadlocks, or inefficient scheduling on the distributed backend.
Memory Pressure: If any of the GPUs runs out of memory, it could cause one GPU to stall while others continue processing, leading to synchronization timeouts.
Steps to Fix:
Reduce the Batch Size: Try reducing the batch size to ensure that each GPU is able to process data without hitting memory limits or causing delays.
Check the Data Loading Pipeline: Make sure that the data loading is evenly distributed across all nodes and workers. Imbalances here can cause one node to wait for others to finish.
Use Gradient Accumulation: If reducing batch size is not enough, try gradient accumulation to simulate larger batch sizes while avoiding memory issues.
Tune Timeout Settings: You can try increasing the timeout setting for collective operations to allow for more time, although this may only be a temporary fix.
Check Network Speed: If multiple instances are used, check the network speed and ensure that the communication between the instances is not a bottleneck.
Inspect Resource Utilization: Check GPU and memory utilization on each instance. Ensure that no instance is underutilized or overburdened.
Ensure Proper Sharding: Verify that the dataset is properly sharded across all nodes to avoid data imbalance. Using
split_by_worker
orsplit_by_node
in the data pipeline can help.Let me know if you'd like to dive deeper into any of these fixes or need further clarification.