20250205_udacity_nd189_capstone_training_issues.md

⚠️🟢 Issue: training error

[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6[1,mpirank:0,algo-1]<stderr>:,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
...
[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 675, in <module>
[1,mpirank:1,algo-2]<stdout>:    main(task)
[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 572, in main
[1,mpirank:1,algo-2]<stdout>:    train(task)
[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 277, in train
[1,mpirank:1,algo-2]<stdout>:    loss.backward()
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
[1,mpirank:1,algo-2]<stdout>:    torch.autograd.backward(
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
[1,mpirank:1,algo-2]<stdout>:    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[1,mpirank:1,algo-2]<stdout>:RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
[1,mpirank:0,algo-1]<stdout>:    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[1,mpirank:0,algo-1]<stdout>:RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

✅ Solution:

def label_transform(x):
    ## Original lables are (1,2,3,4,5)
    ## Convert to (0,1,2,3,4)
    return torch.tensor(int(x.decode())-1, dtype=torch.int64)
    
class WebDatasetDDP(IterableDataset):
    def __init__(self,
    ...
        super().__init__()
        self.dataset = (
        ...
        wds.DataPipeline(
        ...
        wds.map_tuple(
            key_transform,
            train_transform, 
            label_transform,  
        ),
        ...

ChatGPT (might be wrong)

The error message you're encountering, Assertion 't >= 0 && t < n_classes' failed, is related to the Negative Log Likelihood Loss (NLLLoss) in PyTorch, which is typically used for classification tasks. This assertion error means that one or more of the target labels in your dataset are either negative or greater than or equal to the number of classes (n_classes). In simpler terms, the model is encountering a target value (label) that is not within the valid range for classification.

Cause

The issue typically occurs because the target labels in your dataset have values that are outside the expected range of class indices. For a classification task with n_classes possible categories, valid target labels should range from 0 to n_classes - 1.

Solutions:

Check the target labels: Ensure that the target labels in your dataset are integers and fall within the valid range [0, n_classes - 1].
- If you have n_classes = 5, then the valid labels are 0, 1, 2, 3, 4.
- Any label outside this range, such as -1 or 5, will trigger the assertion error.
How to check: Add a check to validate the range of your target labels before passing them to the model.
```
assert torch.all((target >= 0) & (target < n_classes)), "Invalid target label encountered!"
```
Preprocess target labels: If your dataset contains labels that are outside the valid range, preprocess the target labels to ensure they are within the correct range.

Example:
```
# Assuming you need to remap the target labels to be in the range [0, n_classes-1]
target = target.clamp(0, n_classes - 1)
```
Check for invalid class weights: If you're using class weights in your loss function (e.g., passing weight to NLLLoss or CrossEntropyLoss), ensure that the class weights correspond to the number of classes and that they match the indices of the classes correctly.
Verify number of classes in your dataset: Double-check that the n_classes variable in your code corresponds to the actual number of unique classes in your dataset. If n_classes is set incorrectly, targets outside this range will cause the error.

You can inspect the number of unique classes using:
```
unique_classes = torch.unique(targets)
print(unique_classes)  # Ensure this matches with your n_classes
```
Label smoothing: If you use label smoothing or other data augmentation techniques, ensure that these methods don't produce invalid target labels.

Example Fix:

If you're using a dataset with 5 classes (indexed from 0 to 4), but your dataset contains labels like 5 or -1, you could preprocess your targets as follows:

# Suppose you have targets that are incorrectly labeled
targets = torch.tensor([0, 1, 2, 5, -1])  # Example target with invalid labels

# Clamp target values between 0 and n_classes - 1
n_classes = 5
targets = targets.clamp(0, n_classes - 1)

print(targets)  # Output will be [0, 1, 2, 4, 0]

Debugging:

Log the target values right before passing them to the loss function to check if they are within the correct range.
Print the unique classes in the dataset to confirm if there are any out-of-bound labels:
```
print(torch.unique(targets))  # Should print valid class indices only
```

Let me know how these steps work for you or if you need further assistance!

⚠️🟢 Issue: another type of broadcasting error.

✅ I can now confirm that the issue was related to how to properly breaking the loop with the SageMaker-Torch DDP framework rather than WebDataset. I simply replaced dist.broadcast(broadcast_early_stop, src=0) with dist.all_reduce(broadcast_early_stop, op=dist.ReduceOp.SUM), and the error is gone.

GitHub Issues (closed): How to Implement Early Stopping with WebDataset in SageMaker Distributed Data Parallel (SMDDP) Framework? #446

[1,mpirank:0,algo-1]<stdout>:👉 VAL: Average loss: 439.5972, Accuracy: 252/1536 (16.41%)
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:👉 Train Epoch: 1, Learning Rate: 0.1
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:1,algo-2]<stderr>:terminate called after throwing an instance of '[1,mpirank:1,algo-2]<stderr>:SMDDPTimeoutError[1,mpirank:1,algo-2]<stderr>:'
[1,mpirank:1,algo-2]<stderr>:  what():  [1,mpirank:1,algo-2]<stderr>:
[1,mpirank:1,algo-2]<stderr>:
[1,mpirank:1,algo-2]<stderr>:#011Timeout: A call to 'broadcast' has taken over 1800.000000 seconds. Terminating the distributed job.It might be one of the workers failed during forward and backward propagation and failed to call "broadcast".
[1,mpirank:1,algo-2]<stderr>:
[1,mpirank:1,algo-2]<stderr>:#011Extend timeout using dist.init_process_group(timeout=timedelta(minutes=60)
[1,mpirank:1,algo-2]<stderr>:#011Extend timeout using dist.init(timeout=timedelta(minutes=60)
[1,mpirank:1,algo-2]<stderr>:#011or refer to the debugging guide. Verify that all ranks call
[1,mpirank:1,algo-2]<stderr>:#011collective operations in the same order and within timeout period.
[1,mpirank:1,algo-2]<stderr>:
[1,mpirank:1,algo-2]<stderr>:[algo-2:00105] *** Process received signal ***
[1,mpirank:1,algo-2]<stderr>:[algo-2:00105] Signal: Aborted (6)
[1,mpirank:1,algo-2]<stderr>:[algo-2:00105] Signal code:  (-6)

UnexpectedStatusException: Error for Training job p5-amazon-bin-job-20250207-172703: Failed. Reason: 
AlgorithmError: SMDDPTimeoutError:
ExitCode 134
ErrorMessage "Exception ignored in: :<function Pipe.__del__ at 0x7f268daebc10>
 Traceback (most recent call last)
 File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
 self.close()
 File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
 self.wait_for_child()
 File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
 raise IOError(f"{self.args}: exit {self.status} (read) {info}")
 OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/train/train-shard-000004.tar -',), {'shell': True, 
'bufsize': 8192}): exit 1 (read) {}
 Exception ignored in: <function Pipe.__del__ at 0x7f20e0860c10>
 OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/train/train-shard-000005.tar -',), {'shell': True, 
'bufsize': 8192}): exit 1 (read) {}
 Exception ignored in: Traceback (most recent call last)
 <function Pipe.__del__ at 0x7f20e0860c10>  File "/opt/conda/lib/python3.. Check troubleshooting guide for common 
errors: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html

wait_for_child() hang the process? Likely not.

[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    [1,mpirank:0,algo-1]<stderr>:self.wait_for_child()

nov05/20250205_udacity_nd189_capstone_training_issues.md

Cause

Solutions:

Example Fix:

Debugging:

nov05 commented Feb 8, 2025 •

edited

Loading

Uh oh!

nov05/20250205_udacity_nd189_capstone_training_issues.md

Cause

Solutions:

Example Fix:

Debugging:

nov05 commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nov05 commented Feb 8, 2025 •

edited

Loading