⚠️ 🟢 Issue: training error
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6[1,mpirank:0,algo-1]<stderr>:,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
...
[1,mpirank:1,algo-2]<stdout>: File "train.py", line 675, in <module>
[1,mpirank:1,algo-2]<stdout>: main(task)
[1,mpirank:1,algo-2]<stdout>: File "train.py", line 572, in main
[1,mpirank:1,algo-2]<stdout>: train(task)
[1,mpirank:1,algo-2]<stdout>: File "train.py", line 277, in train
[1,mpirank:1,algo-2]<stdout>: loss.backward()
[1,mpirank:1,algo-2]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
[1,mpirank:1,algo-2]<stdout>: torch.autograd.backward(
[1,mpirank:1,algo-2]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
[1,mpirank:1,algo-2]<stdout>: Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[1,mpirank:1,algo-2]<stdout>:RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
[1,mpirank:0,algo-1]<stdout>: File "/opt/conda/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
[1,mpirank:0,algo-1]<stdout>: Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[1,mpirank:0,algo-1]<stdout>:RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
- ✅ Solution:
def label_transform(x):
## Original lables are (1,2,3,4,5)
## Convert to (0,1,2,3,4)
return torch.tensor(int(x.decode())-1, dtype=torch.int64)
class WebDatasetDDP(IterableDataset):
def __init__(self,
...
super().__init__()
self.dataset = (
...
wds.DataPipeline(
...
wds.map_tuple(
key_transform,
train_transform,
label_transform,
),
...
- ChatGPT (might be wrong)
The error message you're encountering, Assertion 't >= 0 && t < n_classes' failed
, is related to the Negative Log Likelihood Loss (NLLLoss) in PyTorch, which is typically used for classification tasks. This assertion error means that one or more of the target labels in your dataset are either negative or greater than or equal to the number of classes (n_classes
). In simpler terms, the model is encountering a target value (label) that is not within the valid range for classification.
The issue typically occurs because the target labels in your dataset have values that are outside the expected range of class indices. For a classification task with n_classes
possible categories, valid target labels should range from 0
to n_classes - 1
.
-
Check the target labels: Ensure that the target labels in your dataset are integers and fall within the valid range
[0, n_classes - 1]
.- If you have
n_classes = 5
, then the valid labels are0, 1, 2, 3, 4
. - Any label outside this range, such as
-1
or5
, will trigger the assertion error.
How to check: Add a check to validate the range of your target labels before passing them to the model.
assert torch.all((target >= 0) & (target < n_classes)), "Invalid target label encountered!"
- If you have
-
Preprocess target labels: If your dataset contains labels that are outside the valid range, preprocess the target labels to ensure they are within the correct range.
Example:
# Assuming you need to remap the target labels to be in the range [0, n_classes-1] target = target.clamp(0, n_classes - 1)
-
Check for invalid class weights: If you're using class weights in your loss function (e.g., passing
weight
toNLLLoss
orCrossEntropyLoss
), ensure that the class weights correspond to the number of classes and that they match the indices of the classes correctly. -
Verify number of classes in your dataset: Double-check that the
n_classes
variable in your code corresponds to the actual number of unique classes in your dataset. Ifn_classes
is set incorrectly, targets outside this range will cause the error.You can inspect the number of unique classes using:
unique_classes = torch.unique(targets) print(unique_classes) # Ensure this matches with your n_classes
-
Label smoothing: If you use label smoothing or other data augmentation techniques, ensure that these methods don't produce invalid target labels.
If you're using a dataset with 5 classes (indexed from 0
to 4
), but your dataset contains labels like 5
or -1
, you could preprocess your targets as follows:
# Suppose you have targets that are incorrectly labeled
targets = torch.tensor([0, 1, 2, 5, -1]) # Example target with invalid labels
# Clamp target values between 0 and n_classes - 1
n_classes = 5
targets = targets.clamp(0, n_classes - 1)
print(targets) # Output will be [0, 1, 2, 4, 0]
- Log the target values right before passing them to the loss function to check if they are within the correct range.
- Print the unique classes in the dataset to confirm if there are any out-of-bound labels:
print(torch.unique(targets)) # Should print valid class indices only
Let me know how these steps work for you or if you need further assistance!
AllGather
and eventually timeout error?✅ use
dist.broadcast()
anddist.barrier()
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html
distributed model parallelism, sharded data parallelism...
Yes, you're correct! In distributed training with multiple nodes, the state (like validation loss or early stopping condition) might not be the same across all nodes, leading to inconsistent stopping behavior if not handled properly.
Using
dist.barrier()
ensures synchronization across all nodes in the distributed environment. This allows you to make sure that once a stopping condition is met (e.g., early stopping), all nodes will reach a consensus and stop the training at the same time.Here’s how you can use
dist.barrier()
for early stopping in your SageMaker Distributed Data Parallel (SDP) setup:Rank 0 Node (Master Node) Handling: Compute the early stopping condition (e.g., checking the validation loss) only on the master node (rank 0).
Broadcast the Early Stopping Signal: Use
dist.broadcast()
to share the early stopping decision from the master node to all worker nodes.Sync Nodes with
dist.barrier()
: Ensure all nodes stop training together based on the early stopping decision.Here’s a revised version of the code with these ideas implemented:
Key Points:
Rank 0 (Master Node): The early stopping logic (like tracking validation loss and deciding when to stop) runs only on the master node (
rank == 0
).Broadcast Early Stop Signal: The master node (
rank == 0
) usesdist.broadcast()
to share the early stopping decision with all other nodes. The other nodes receive this broadcast and set theirearly_stop
flag accordingly.Synchronization (
dist.barrier()
): All nodes are synchronized at the end of the epoch or right before early stopping to ensure that no node continues training after the early stop condition is met.With this approach, you ensure that all nodes stop at the same time when the early stopping condition is met, preventing inconsistent behavior between nodes.