nov05/20250205_udacity_nd189_capstone_training_issues.md

Last active February 8, 2025 06:40

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/nov05/1bdc15eda0e781640b46ab28d38f45bd.js"></script>
Save nov05/1bdc15eda0e781640b46ab28d38f45bd to your computer and use it in GitHub Desktop.

Raw

20250205_udacity_nd189_capstone_training_issues.md

⚠️🟢 Issue: training error

[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6[1,mpirank:0,algo-1]<stderr>:,0,0] Assertion `t >= 0 && t < n_classes` failed.
[1,mpirank:0,algo-1]<stderr>:../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
...
[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 675, in <module>
[1,mpirank:1,algo-2]<stdout>:    main(task)
[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 572, in main
[1,mpirank:1,algo-2]<stdout>:    train(task)
[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 277, in train
[1,mpirank:1,algo-2]<stdout>:    loss.backward()
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
[1,mpirank:1,algo-2]<stdout>:    torch.autograd.backward(
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
[1,mpirank:1,algo-2]<stdout>:    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[1,mpirank:1,algo-2]<stdout>:RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
[1,mpirank:0,algo-1]<stdout>:    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[1,mpirank:0,algo-1]<stdout>:RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

✅ Solution:

def label_transform(x):
    ## Original lables are (1,2,3,4,5)
    ## Convert to (0,1,2,3,4)
    return torch.tensor(int(x.decode())-1, dtype=torch.int64)
    
class WebDatasetDDP(IterableDataset):
    def __init__(self,
    ...
        super().__init__()
        self.dataset = (
        ...
        wds.DataPipeline(
        ...
        wds.map_tuple(
            key_transform,
            train_transform, 
            label_transform,  
        ),
        ...

ChatGPT (might be wrong)

The error message you're encountering, Assertion 't >= 0 && t < n_classes' failed, is related to the Negative Log Likelihood Loss (NLLLoss) in PyTorch, which is typically used for classification tasks. This assertion error means that one or more of the target labels in your dataset are either negative or greater than or equal to the number of classes (n_classes). In simpler terms, the model is encountering a target value (label) that is not within the valid range for classification.

Cause

The issue typically occurs because the target labels in your dataset have values that are outside the expected range of class indices. For a classification task with n_classes possible categories, valid target labels should range from 0 to n_classes - 1.

Solutions:

Check the target labels: Ensure that the target labels in your dataset are integers and fall within the valid range [0, n_classes - 1].
- If you have n_classes = 5, then the valid labels are 0, 1, 2, 3, 4.
- Any label outside this range, such as -1 or 5, will trigger the assertion error.
How to check: Add a check to validate the range of your target labels before passing them to the model.
```
assert torch.all((target >= 0) & (target < n_classes)), "Invalid target label encountered!"
```
Preprocess target labels: If your dataset contains labels that are outside the valid range, preprocess the target labels to ensure they are within the correct range.

Example:
```
# Assuming you need to remap the target labels to be in the range [0, n_classes-1]
target = target.clamp(0, n_classes - 1)
```
Check for invalid class weights: If you're using class weights in your loss function (e.g., passing weight to NLLLoss or CrossEntropyLoss), ensure that the class weights correspond to the number of classes and that they match the indices of the classes correctly.
Verify number of classes in your dataset: Double-check that the n_classes variable in your code corresponds to the actual number of unique classes in your dataset. If n_classes is set incorrectly, targets outside this range will cause the error.

You can inspect the number of unique classes using:
```
unique_classes = torch.unique(targets)
print(unique_classes)  # Ensure this matches with your n_classes
```
Label smoothing: If you use label smoothing or other data augmentation techniques, ensure that these methods don't produce invalid target labels.

Example Fix:

If you're using a dataset with 5 classes (indexed from 0 to 4), but your dataset contains labels like 5 or -1, you could preprocess your targets as follows:

# Suppose you have targets that are incorrectly labeled
targets = torch.tensor([0, 1, 2, 5, -1])  # Example target with invalid labels

# Clamp target values between 0 and n_classes - 1
n_classes = 5
targets = targets.clamp(0, n_classes - 1)

print(targets)  # Output will be [0, 1, 2, 4, 0]

Debugging:

Log the target values right before passing them to the loss function to check if they are within the correct range.
Print the unique classes in the dataset to confirm if there are any out-of-bound labels:
```
print(torch.unique(targets))  # Should print valid class indices only
```

Let me know how these steps work for you or if you need further assistance!

Author

nov05 commented Feb 6, 2025

⚠️🟢 Issue: superclass doesn't implement __iter__ method.

[1,mpirank:0,algo-1]<stderr>:  File "train.py", line 678, in <module>
[1,mpirank:0,algo-1]<stderr>:    main(task)
[1,mpirank:0,algo-1]<stderr>:  File "train.py", line 575, in main
[1,mpirank:0,algo-1]<stderr>:    train(task)
[1,mpirank:0,algo-1]<stderr>:  File "train.py", line 271, in train
[1,mpirank:0,algo-1]<stderr>:    for batch_idx, (data, target) in enumerate(task.train_loader):
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 437, in __iter__
[1,mpirank:0,algo-1]<stderr>:    return self._get_iterator()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 380, in _get_iterator
[1,mpirank:0,algo-1]<stderr>:    return _SingleProcessDataLoaderIter(self)
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 697, in __init__
[1,mpirank:0,algo-1]<stderr>:    self._dataset_fetcher = _DatasetKind.create_fetcher(
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 82, in create_fetcher
[1,mpirank:0,algo-1]<stderr>:    return _utils.fetch._IterableDatasetFetcher(dataset, auto_collation, collate_fn, drop_last)
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 23, in __init__
[1,mpirank:0,algo-1]<stderr>:    self.dataset_iter = iter(dataset)
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataset.py", line 169, in __iter__
[1,mpirank:0,algo-1]<stderr>:    raise NotImplementedError
[1,mpirank:0,algo-1]<stderr>:NotImplementedError

✅ Solution:

class WebDatasetDDP(IterableDataset):
    def __init__(self,
        super().__init__()
...
    def __iter__(self): 
        for key,image,label in self.dataset:  ## Use dataset keys to distribute data
            if key%self.world_size == self.rank:  ## Ensure each GPU gets different data
                yield (image, label)

Author

nov05 commented Feb 6, 2025 •

edited

Loading

⚠️🟢 Issue

[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 678, in <module>
[1,mpirank:1,algo-2]<stdout>:    main(task)
[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 575, in main
[1,mpirank:1,algo-2]<stdout>:    train(task)
[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 276, in train
[1,mpirank:1,algo-2]<stdout>:    loss = task.train_criterion(output, target)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
[1,mpirank:1,algo-2]<stdout>:    return forward_call(*input, **kwargs)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 1174, in forward
[1,mpirank:1,algo-2]<stdout>:    return F.cross_entropy(input, target, weight=self.weight,
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 3026, in cross_entropy
[1,mpirank:1,algo-2]<stdout>:    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
[1,mpirank:1,algo-2]<stdout>:RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Char'

✅ Changed the tensor type from torch.int8 to torch.int64. It worked.

def label_transform(x):
    ## Original lables are (1,2,3,4,5)
    ## Convert to (0,1,2,3,4)
    return torch.tensor(int(x.decode())-1, dtype=torch.int64)

Author

nov05 commented Feb 6, 2025 •

edited

Loading

⚠️🟢 Issue: SageMaker dataloader is incompatible with WebDataset.

✅ Use WebDataset.DataPipeline instead, and implement __len__() and __iter()__ by myself.

[1,mpirank:0,algo-1]<stdout>:  File "train.py", line 683, in <module>
[1,mpirank:0,algo-1]<stdout>:    main(task)
[1,mpirank:0,algo-1]<stdout>:  File "train.py", line 582, in main
[1,mpirank:0,algo-1]<stdout>:    eval(task, phase='eval')
[1,mpirank:0,algo-1]<stdout>:  File "train.py", line 315, in eval
[1,mpirank:0,algo-1]<stdout>:    for data, target in data_loader:
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 636, in __next__
[1,mpirank:0,algo-1]<stdout>:    (data, worker_id) = self._next_data()
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 721, in _next_data
[1,mpirank:0,algo-1]<stdout>:    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
[1,mpirank:0,algo-1]<stdout>:    data.append(next(self.dataset_iter))
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/pipeline.py", line 71, in iterator
[1,mpirank:0,algo-1]<stdout>:    for sample in self.iterator1():
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/filters.py", line 423, in _map_tuple
[1,mpirank:0,algo-1]<stdout>:    for sample in data:
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/filters.py", line 400, in _to_tuple
[1,mpirank:0,algo-1]<stdout>:    for sample in data:
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/compat.py", line 101, in check_empty
[1,mpirank:0,algo-1]<stdout>:    for sample in source:
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/tariterators.py", line 227, in group_by_keys
[1,mpirank:0,algo-1]<stdout>:    for filesample in data:
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/tariterators.py", line 173, in tar_file_expander
[1,mpirank:0,algo-1]<stdout>:    for source in data:
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/cache.py", line 155, in __call__
[1,mpirank:0,algo-1]<stdout>:    for url in urls:
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/filters.py", line 217, in _shuffle
[1,mpirank:0,algo-1]<stdout>:    for sample in data:
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/shardlists.py", line 85, in split_by_worker
[1,mpirank:0,algo-1]<stdout>:    yield from src
[1,mpirank:0,algo-1]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/shardlists.py", line 73, in single_node_only
[1,mpirank:0,algo-1]<stdout>:    raise ValueError(
[1,mpirank:0,algo-1]<stdout>:ValueError: you need to add an explicit nodesplitter to your input pipeline for multi-node training

Author

nov05 commented Feb 6, 2025

⚠️🟢 Issue: time out when saving the training model.

[1,mpirank:0,algo-1]<stdout>: 2025-02-06 10:05:33 Uploading - Uploading generated training model
[1,mpirank:0,algo-1]<stderr>:SMDDPTimeoutError
[1,mpirank:0,algo-1]<stderr>:'
[1,mpirank:0,algo-1]<stderr>:  what():
[1,mpirank:0,algo-1]<stderr>:
[1,mpirank:0,algo-1]<stderr>:
[1,mpirank:0,algo-1]<stderr>:#011Timeout: A call to 'allGather' has taken over 1800.000000 seconds. Terminating the distributed job.It might be one of the workers failed during forward and backward propagation and failed to call "allGather".
[1,mpirank:0,algo-1]<stderr>:
[1,mpirank:0,algo-1]<stderr>:#011Extend timeout using dist.init_process_group(timeout=timedelta(minutes=60)
[1,mpirank:0,algo-1]<stderr>:#011Extend timeout using dist.init(timeout=timedelta(minutes=60)
[1,mpirank:0,algo-1]<stderr>:#011or refer to the debugging guide. Verify that all ranks call
[1,mpirank:0,algo-1]<stderr>:#011collective operations in the same order and within timeout period.
[1,mpirank:0,algo-1]<stderr>:
[1,mpirank:0,algo-1]<stderr>:[algo-2:00106] *** Process received signal ***
[1,mpirank:0,algo-1]<stderr>:[algo-2:00106] Signal: Aborted (6)
[1,mpirank:0,algo-1]<stderr>:[algo-2:00106] Signal code:  (-6)
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2282, in all_gather
[1,mpirank:0,algo-1]<stderr>:    work.wait()
[1,mpirank:0,algo-1]<stderr>:RuntimeError: Timeout: A call to a collective SMDDP operation has taken over 1800 seconds. Terminating the distributed job.

✅ Is this to save Distributed Model Parallel model? Yes.
Save model at one gpu with model.state_dict().

torch.save(task.model.module.state_dict(), path)  ## SMDMP
torch.save(task.model.state_dict(), path)  ## SMDDP

https://sagemaker.readthedocs.io/en/v2.20.0/api/training/smd_model_parallel.html
https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-v2.html

You can use the SMP library to accelerate the training and fine-tuning of large language models (LLMs), large vision models (LVMs), and foundation models (FMs) with hundreds of billions of parameters.

The SageMaker model parallelism library v2 (SMP v2) aligns the library’s APIs and methods with open source PyTorch Fully Sharded Data Parallelism (FSDP), which gives you the benefit of SMP performance optimizations with minimal code changes.

Author

nov05 commented Feb 6, 2025 •

edited

Loading

⚠️ issue

[1,mpirank:1,algo-2]<stdout>:  File "train.py", line 284, in train
[1,mpirank:1,algo-2]<stdout>:    output = task.model(data)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
[1,mpirank:1,algo-2]<stdout>:    return forward_call(*input, **kwargs)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
[1,mpirank:1,algo-2]<stdout>:    output = self._run_ddp_forward(*inputs, **kwargs)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
[1,mpirank:1,algo-2]<stdout>:    return module_to_run(*inputs[0], **kwargs[0])
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
[1,mpirank:1,algo-2]<stdout>:    return forward_call(*input, **kwargs)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torchvision/models/resnet.py", line 285, in forward
[1,mpirank:1,algo-2]<stdout>:    return self._forward_impl(x)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torchvision/models/resnet.py", line 269, in _forward_impl
[1,mpirank:1,algo-2]<stdout>:    x = self.bn1(x)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
[1,mpirank:1,algo-2]<stdout>:    return forward_call(*input, **kwargs)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 752, in forward
[1,mpirank:1,algo-2]<stdout>:    return sync_batch_norm.apply(
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/_functions.py", line 65, in forward
[1,mpirank:1,algo-2]<stdout>:    dist.all_gather(combined_list, combined, process_group, async_op=False)
[1,mpirank:1,algo-2]<stdout>:  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2282, in all_gather
[1,mpirank:1,algo-2]<stdout>:    work.wait()
[1,mpirank:1,algo-2]<stdout>:RuntimeError: Timeout: A call to a collective SMDDP operation has taken over 1800 seconds. Terminating the distributed job.

ChatGPT (could be wrong)

The error you're seeing is a timeout error in a distributed training setup, specifically with SMDDP (SageMaker Distributed Data Parallel). The error indicates that a collective communication operation (in this case, all_gather) took longer than the timeout limit of 1800 seconds (30 minutes), causing the job to terminate. This typically happens during synchronization of data across nodes when using distributed data parallelism.

Possible Causes:

Synchronization Delay: One or more GPUs are lagging behind, causing synchronization delays. This can happen if some GPUs are processing batches more slowly than others.
Communication Bottleneck: If you're using multiple instances, the network communication between instances (nodes) could be slow or unstable, causing distributed operations (such as all_gather) to time out.
Batch Size Discrepancy: If the batch sizes across nodes or GPUs are not equal, or if the data loading process is inconsistent, it can cause desynchronization during collective operations like all_gather.
Deadlocks or Resource Contention: There might be issues with resource contention, deadlocks, or inefficient scheduling on the distributed backend.
Memory Pressure: If any of the GPUs runs out of memory, it could cause one GPU to stall while others continue processing, leading to synchronization timeouts.

Steps to Fix:

Reduce the Batch Size: Try reducing the batch size to ensure that each GPU is able to process data without hitting memory limits or causing delays.
Check the Data Loading Pipeline: Make sure that the data loading is evenly distributed across all nodes and workers. Imbalances here can cause one node to wait for others to finish.
Use Gradient Accumulation: If reducing batch size is not enough, try gradient accumulation to simulate larger batch sizes while avoiding memory issues.
Tune Timeout Settings: You can try increasing the timeout setting for collective operations to allow for more time, although this may only be a temporary fix.
Check Network Speed: If multiple instances are used, check the network speed and ensure that the communication between the instances is not a bottleneck.
Inspect Resource Utilization: Check GPU and memory utilization on each instance. Ensure that no instance is underutilized or overburdened.
Ensure Proper Sharding: Verify that the dataset is properly sharded across all nodes to avoid data imbalance. Using split_by_worker or split_by_node in the data pipeline can help.

Let me know if you'd like to dive deeper into any of these fixes or need further clarification.

Author

nov05 commented Feb 7, 2025 •

edited

Loading

⚠️🟢 Issue 1: broken pipe. It seems I can ignore it for now.

issue 2:

[1,mpirank:0,algo-1]<stdout>:👉 Train Epoch: 1, Learning Rate: 0.00016[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:1,algo-2]<stdout>:🔹 Rank 1, Train Epoch: 1 [256.0/2000 (14%)], Loss: 1.595960
[1,mpirank:0,algo-1]<stdout>:🔹 Rank 0, Train Epoch: 1 [256.0/2000 (14%)], Loss: 1.611980
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:1,algo-2]<stdout>:🔹 Rank 1, Train Epoch: 1 [5376.0/2000 (157%)], Loss: 1.581787
[1,mpirank:0,algo-1]<stdout>:🔹 Rank 0, Train Epoch: 1 [5376.0/2000 (157%)], Loss: 1.498845

my fix: wds.webloader().with_epoch()
train_data_size // (batch_size * world_size`)

my fix

          print(
              "🔹 Rank {}, Train Epoch: {} [{}/{} ({:.0f}%)], Loss: {:.6f}".format(
                  dist.get_rank(),                                         ## 1. global rank
                  task.current_epoch,                                      ## 2. current epoch
                  num_samples,                                             ## 3. samples that have been used
                  task.config.train_data_size,                             ## 4. total number of samples
                                                                           ## 5. progress within the epoch
                  100.0 * num_samples * dist.get_world_size() / task.config.train_data_size,       
                  loss.item(),                                             ## 6. loss value
              )
          )

    ## WebDataset dataloader
    task.train_loader = (
        wds.WebLoader(
            train_dataset, 
            batch_size=None, 
            num_workers=task.config.num_cpu,
        ).unbatched()
        .shuffle(1000)
        .batched(task.config.batch_size)
        ## get total number of batches likely from:
        ## total number of samples in the dataset, epochs, batch_size, GPUs
        .with_epoch(task.config.train_data_size * task.config.epochs 
                    // (task.config.batch_size * dist.get_world_size()))   
    )

            print(
                "🔹 Rank {}, Train Epoch: {} [{}/{} ({:.0f}%)], Loss: {:.6f}".format(
                    dist.get_rank(),                                         ## 1. global rank
                    task.current_epoch,                                      ## 2. current epoch
                                                                                ## 3. samples that have been used
                    task.config.batch_size * ((batch_idx+1) * dist.get_world_size()),                 
                    task.config.train_data_size,                             ## 4. total number of samples
                                                                             ## 5. batch progress within the epoch
                    100.0 * (batch_idx+1) / (task.config.train_data_size // (task.config.batch_size*dist.get_world_size())),          
                    loss.item(),                                             ## 6. loss value
                )
            )

ChatGPT (might be wrong):

The error message you're encountering, specifically the [Errno 32] Broken pipe, indicates a failure in the process of downloading a shard (test-shard-000000.tar) from an S3 bucket (s3://p5-amazon-bin-images/webdataset/test/). The broken pipe error typically means that the connection was abruptly terminated or interrupted during the data transfer.

Here's what might be happening and how to fix it:

Possible Causes:

Network Connectivity Issues: The connection between your training instance and S3 could have been disrupted or unstable, causing the download to fail.
Insufficient Permissions: If your IAM role or S3 permissions are not set up correctly, the download might fail, leading to the broken pipe error.
S3 Throttling: S3 can throttle requests if you exceed request limits or perform too many requests in a short period, resulting in connection drops.
File Size or Corruption: If the file being downloaded is large or corrupted, it could cause issues when being streamed over a network.
Timeout or Slow Download: If the download takes too long and a timeout is triggered, it could cause the connection to close prematurely.

Suggested Fixes:

Retry Mechanism: Implement a retry mechanism in your data loader to handle transient network issues. This can be done using libraries like boto3, where you can configure retry policies.
Verify IAM Permissions: Ensure that the role attached to your instance or the AWS credentials used have sufficient permissions to access the S3 bucket and its objects.
Check Network Connectivity: If you're using distributed training, check the network configuration between the nodes and ensure that there is a stable connection to the S3 bucket.
Increase Timeout Settings: If possible, increase the timeout settings for S3 downloads or data transfers to prevent premature termination of the connection.
Download Data Locally: As a workaround, you can try downloading the data to local storage before training starts, then access it from local storage during training. This reduces the risk of network issues affecting data access.
Throttling Consideration: If you suspect throttling, review your S3 usage patterns and try to reduce the number of simultaneous requests. You can also consider contacting AWS support if you're hitting limits frequently.

If the issue persists after trying these steps, you might want to test the connection separately by downloading the file manually from S3 using aws s3 cp to check if the problem is related to your training code or environment configuration.

2025-02-07 06:45:56 Starting - Starting the training job...
2025-02-07 06:46:10 Starting - Preparing the instances for training...
2025-02-07 06:46:44 Downloading - Downloading input data...
2025-02-07 06:47:09 Downloading - Downloading the training image...............
2025-02-07 06:50:06 Training - Training image download completed. Training in progress.bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
/opt/conda/lib/python3.9/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "cipher": algorithms.TripleDES,
/opt/conda/lib/python3.9/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "class": algorithms.TripleDES,
2025-02-07 06:50:18,057 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2025-02-07 06:50:18,079 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-02-07 06:50:18,092 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2025-02-07 06:50:18,096 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel
2025-02-07 06:50:18,096 sagemaker_pytorch_container.training INFO     Invoking user training script.
2025-02-07 06:50:19,298 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt
Collecting wandb (from -r requirements.txt (line 1))
Downloading wandb-0.19.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting webdataset==0.2.100 (from -r requirements.txt (line 2))
Downloading webdataset-0.2.100-py3-none-any.whl.metadata (12 kB)
Collecting braceexpand (from webdataset==0.2.100->-r requirements.txt (line 2))
Downloading braceexpand-0.1.7-py2.py3-none-any.whl.metadata (3.0 kB)
Requirement already satisfied: numpy in /opt/conda/lib/python3.9/site-packages (from webdataset==0.2.100->-r requirements.txt (line 2)) (1.23.5)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.9/site-packages (from webdataset==0.2.100->-r requirements.txt (line 2)) (6.0.1)
Requirement already satisfied: click!=8.0.0,>=7.1 in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (8.1.7)
Collecting docker-pycreds>=0.4.0 (from wandb->-r requirements.txt (line 1))
Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting eval-type-backport (from wandb->-r requirements.txt (line 1))
Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)
Collecting gitpython!=3.1.29,>=1.0.0 (from wandb->-r requirements.txt (line 1))
Downloading GitPython-3.1.44-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: platformdirs in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (4.2.2)
Requirement already satisfied: protobuf!=4.21.0,!=5.28.0,<6,>=3.15.0 in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (3.20.3)
Requirement already satisfied: psutil>=5.0.0 in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (5.9.8)
Requirement already satisfied: pydantic<3,>=2.6 in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (2.8.2)
Requirement already satisfied: requests<3,>=2.0.0 in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (2.32.3)
Collecting sentry-sdk>=2.0.0 (from wandb->-r requirements.txt (line 1))
Downloading sentry_sdk-2.20.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting setproctitle (from wandb->-r requirements.txt (line 1))
Downloading setproctitle-1.3.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (71.0.4)
Requirement already satisfied: typing-extensions<5,>=4.4 in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (4.12.2)
Requirement already satisfied: six>=1.4.0 in /opt/conda/lib/python3.9/site-packages (from docker-pycreds>=0.4.0->wandb->-r requirements.txt (line 1)) (1.16.0)
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.29,>=1.0.0->wandb->-r requirements.txt (line 1))
Downloading gitdb-4.0.12-py3-none-any.whl.metadata (1.2 kB)
Requirement already satisfied: annotated-types>=0.4.0 in /opt/conda/lib/python3.9/site-packages (from pydantic<3,>=2.6->wandb->-r requirements.txt (line 1)) (0.7.0)
Requirement already satisfied: pydantic-core==2.20.1 in /opt/conda/lib/python3.9/site-packages (from pydantic<3,>=2.6->wandb->-r requirements.txt (line 1)) (2.20.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.9/site-packages (from requests<3,>=2.0.0->wandb->-r requirements.txt (line 1)) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.9/site-packages (from requests<3,>=2.0.0->wandb->-r requirements.txt (line 1)) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.9/site-packages (from requests<3,>=2.0.0->wandb->-r requirements.txt (line 1)) (1.26.19)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.9/site-packages (from requests<3,>=2.0.0->wandb->-r requirements.txt (line 1)) (2024.7.4)
Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython!=3.1.29,>=1.0.0->wandb->-r requirements.txt (line 1))
Downloading smmap-5.0.2-py3-none-any.whl.metadata (4.3 kB)
Downloading webdataset-0.2.100-py3-none-any.whl (74 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74.8/74.8 kB 7.2 MB/s eta 0:00:00
Downloading wandb-0.19.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 64.5 MB/s eta 0:00:00
Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Downloading GitPython-3.1.44-py3-none-any.whl (207 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 207.6/207.6 kB 23.9 MB/s eta 0:00:00
Downloading sentry_sdk-2.20.0-py2.py3-none-any.whl (322 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 322.6/322.6 kB 29.1 MB/s eta 0:00:00
Downloading braceexpand-0.1.7-py2.py3-none-any.whl (5.9 kB)
Downloading eval_type_backport-0.2.2-py3-none-any.whl (5.8 kB)
Downloading setproctitle-1.3.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
Downloading gitdb-4.0.12-py3-none-any.whl (62 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.8/62.8 kB 10.4 MB/s eta 0:00:00
Downloading smmap-5.0.2-py3-none-any.whl (24 kB)
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
/opt/conda/lib/python3.9/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "cipher": algorithms.TripleDES,
/opt/conda/lib/python3.9/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "class": algorithms.TripleDES,
2025-02-07 06:50:19,808 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2025-02-07 06:50:19,831 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-02-07 06:50:19,845 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2025-02-07 06:50:19,848 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel
2025-02-07 06:50:19,848 sagemaker_pytorch_container.training INFO     Invoking user training script.
2025-02-07 06:50:21,040 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt
Collecting wandb (from -r requirements.txt (line 1))
Downloading wandb-0.19.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Installing collected packages: braceexpand, webdataset, smmap, setproctitle, sentry-sdk, eval-type-backport, docker-pycreds, gitdb, gitpython, wandb
Collecting webdataset==0.2.100 (from -r requirements.txt (line 2))
Downloading webdataset-0.2.100-py3-none-any.whl.metadata (12 kB)
Collecting braceexpand (from webdataset==0.2.100->-r requirements.txt (line 2))
Downloading braceexpand-0.1.7-py2.py3-none-any.whl.metadata (3.0 kB)
Requirement already satisfied: numpy in /opt/conda/lib/python3.9/site-packages (from webdataset==0.2.100->-r requirements.txt (line 2)) (1.23.5)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.9/site-packages (from webdataset==0.2.100->-r requirements.txt (line 2)) (6.0.1)
Requirement already satisfied: click!=8.0.0,>=7.1 in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (8.1.7)
Collecting docker-pycreds>=0.4.0 (from wandb->-r requirements.txt (line 1))
Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting eval-type-backport (from wandb->-r requirements.txt (line 1))
Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)
Collecting gitpython!=3.1.29,>=1.0.0 (from wandb->-r requirements.txt (line 1))
Downloading GitPython-3.1.44-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: platformdirs in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (4.2.2)
Requirement already satisfied: protobuf!=4.21.0,!=5.28.0,<6,>=3.15.0 in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (3.20.3)
Requirement already satisfied: psutil>=5.0.0 in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (5.9.8)
Requirement already satisfied: pydantic<3,>=2.6 in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (2.8.2)
Requirement already satisfied: requests<3,>=2.0.0 in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (2.32.3)
Collecting sentry-sdk>=2.0.0 (from wandb->-r requirements.txt (line 1))
Downloading sentry_sdk-2.20.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting setproctitle (from wandb->-r requirements.txt (line 1))
Downloading setproctitle-1.3.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (71.0.4)
Requirement already satisfied: typing-extensions<5,>=4.4 in /opt/conda/lib/python3.9/site-packages (from wandb->-r requirements.txt (line 1)) (4.12.2)
Requirement already satisfied: six>=1.4.0 in /opt/conda/lib/python3.9/site-packages (from docker-pycreds>=0.4.0->wandb->-r requirements.txt (line 1)) (1.16.0)
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.29,>=1.0.0->wandb->-r requirements.txt (line 1))
Downloading gitdb-4.0.12-py3-none-any.whl.metadata (1.2 kB)
Requirement already satisfied: annotated-types>=0.4.0 in /opt/conda/lib/python3.9/site-packages (from pydantic<3,>=2.6->wandb->-r requirements.txt (line 1)) (0.7.0)
Requirement already satisfied: pydantic-core==2.20.1 in /opt/conda/lib/python3.9/site-packages (from pydantic<3,>=2.6->wandb->-r requirements.txt (line 1)) (2.20.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.9/site-packages (from requests<3,>=2.0.0->wandb->-r requirements.txt (line 1)) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.9/site-packages (from requests<3,>=2.0.0->wandb->-r requirements.txt (line 1)) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.9/site-packages (from requests<3,>=2.0.0->wandb->-r requirements.txt (line 1)) (1.26.19)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.9/site-packages (from requests<3,>=2.0.0->wandb->-r requirements.txt (line 1)) (2024.7.4)
Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython!=3.1.29,>=1.0.0->wandb->-r requirements.txt (line 1))
Downloading smmap-5.0.2-py3-none-any.whl.metadata (4.3 kB)
Downloading webdataset-0.2.100-py3-none-any.whl (74 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74.8/74.8 kB 10.8 MB/s eta 0:00:00
Downloading wandb-0.19.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 87.7 MB/s eta 0:00:00
Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Downloading GitPython-3.1.44-py3-none-any.whl (207 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 207.6/207.6 kB 42.7 MB/s eta 0:00:00
Downloading sentry_sdk-2.20.0-py2.py3-none-any.whl (322 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 322.6/322.6 kB 47.7 MB/s eta 0:00:00
Downloading braceexpand-0.1.7-py2.py3-none-any.whl (5.9 kB)
Downloading eval_type_backport-0.2.2-py3-none-any.whl (5.8 kB)
Downloading setproctitle-1.3.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
Downloading gitdb-4.0.12-py3-none-any.whl (62 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.8/62.8 kB 16.5 MB/s eta 0:00:00
Downloading smmap-5.0.2-py3-none-any.whl (24 kB)
Installing collected packages: braceexpand, webdataset, smmap, setproctitle, sentry-sdk, eval-type-backport, docker-pycreds, gitdb, gitpython, wandb
Successfully installed braceexpand-0.1.7 docker-pycreds-0.4.0 eval-type-backport-0.2.2 gitdb-4.0.12 gitpython-3.1.44 sentry-sdk-2.20.0 setproctitle-1.3.4 smmap-5.0.2 wandb-0.19.6 webdataset-0.2.100
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[notice] A new release of pip is available: 24.1.2 -> 25.0
[notice] To update, run: pip install --upgrade pip
2025-02-07 06:50:23,486 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2025-02-07 06:50:23,486 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 0 from exiting process.
2025-02-07 06:50:23,534 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-02-07 06:50:23,578 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-02-07 06:50:23,596 sagemaker-training-toolkit INFO     Starting MPI run as worker node.
2025-02-07 06:50:23,596 sagemaker-training-toolkit INFO     Waiting for MPI Master to create SSH daemon.
2025-02-07 06:50:23,598 sagemaker-training-toolkit INFO     Cannot connect to host algo-1
2025-02-07 06:50:23,598 sagemaker-training-toolkit INFO     Connection failed with exception: 
 [Errno None] Unable to connect to port 22 on 10.0.250.121.              Can be ignored for worker when master completes and exits.
2025-02-07 06:50:24,599 sagemaker-training-toolkit INFO     Cannot connect to host algo-1
2025-02-07 06:50:24,600 sagemaker-training-toolkit INFO     Connection failed with exception: 
 [Errno None] Unable to connect to port 22 on 10.0.250.121.              Can be ignored for worker when master completes and exits.
Successfully installed braceexpand-0.1.7 docker-pycreds-0.4.0 eval-type-backport-0.2.2 gitdb-4.0.12 gitpython-3.1.44 sentry-sdk-2.20.0 setproctitle-1.3.4 smmap-5.0.2 wandb-0.19.6 webdataset-0.2.100
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[notice] A new release of pip is available: 24.1.2 -> 25.0
[notice] To update, run: pip install --upgrade pip
2025-02-07 06:50:25,168 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2025-02-07 06:50:25,168 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 0 from exiting process.
2025-02-07 06:50:25,218 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-02-07 06:50:25,268 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-02-07 06:50:25,288 sagemaker-training-toolkit INFO     Starting MPI run as worker node.
2025-02-07 06:50:25,288 sagemaker-training-toolkit INFO     Creating SSH daemon.
2025-02-07 06:50:25,292 sagemaker-training-toolkit INFO     Waiting for MPI workers to establish their SSH connections
2025-02-07 06:50:25,293 sagemaker-training-toolkit INFO     Cannot connect to host algo-2 at port 22. Retrying...
2025-02-07 06:50:25,293 sagemaker-training-toolkit INFO     Connection closed
2025-02-07 06:50:25,610 paramiko.transport INFO     Connected (version 2.0, client OpenSSH_8.2p1)
2025-02-07 06:50:25,770 paramiko.transport INFO     Authentication (publickey) successful!
2025-02-07 06:50:25,771 sagemaker-training-toolkit INFO     Can connect to host algo-1
2025-02-07 06:50:25,771 sagemaker-training-toolkit INFO     MPI Master online, creating SSH daemon.
2025-02-07 06:50:25,771 sagemaker-training-toolkit INFO     Writing environment variables to /etc/environment for the MPI process.
2025-02-07 06:50:25,778 sagemaker-training-toolkit INFO     Waiting for MPI process to finish.
2025-02-07 06:50:26,304 paramiko.transport INFO     Connected (version 2.0, client OpenSSH_8.2p1)
2025-02-07 06:50:26,457 paramiko.transport INFO     Authentication (publickey) successful!
2025-02-07 06:50:26,457 sagemaker-training-toolkit INFO     Can connect to host algo-2 at port 22
2025-02-07 06:50:26,457 sagemaker-training-toolkit INFO     Connection closed
2025-02-07 06:50:26,457 sagemaker-training-toolkit INFO     Worker algo-2 available for communication
2025-02-07 06:50:26,458 sagemaker-training-toolkit INFO     Network interface name: eth0
2025-02-07 06:50:26,458 sagemaker-training-toolkit INFO     Host: ['algo-1', 'algo-2']
2025-02-07 06:50:26,489 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-02-07 06:50:26,509 sagemaker-training-toolkit INFO     instance type: ml.g4dn.xlarge
2025-02-07 06:50:26,509 sagemaker-training-toolkit INFO     Env Hosts: ['algo-1', 'algo-2'] Hosts: ['algo-1:1', 'algo-2:1'] process_per_hosts: 1 num_processes: 2
2025-02-07 06:50:26,537 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-02-07 06:50:26,557 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {
        "sagemaker_distributed_dataparallel_custom_mpi_options": "",
        "sagemaker_distributed_dataparallel_enabled": true,
        "sagemaker_instance_type": "ml.g4dn.xlarge"
    },
    "channel_input_dirs": {},
    "current_host": "algo-1",
    "current_instance_group": "homogeneousCluster",
    "current_instance_group_hosts": [
        "algo-1",
        "algo-2"
    ],
    "current_instance_type": "ml.g4dn.xlarge",
    "distribution_hosts": [
        "algo-1",
        "algo-2"
    ],
    "distribution_instance_groups": [
        "homogeneousCluster"
    ],
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1",
        "algo-2"
    ],
    "hyperparameters": {
        "batch-size": 256,
        "class-weights-dict": {
            "1": 1.7004885993485341,
            "2": 0.9083079599826012,
            "3": 0.7832708177044261,
            "4": 0.8799831436999579,
            "5": 1.1137066666666666
        },
        "debug": false,
        "early-stopping-patience": 5,
        "epochs": 2,
        "lr-sched-gamma": 0.5,
        "lr-sched-step-size": 5,
        "model-type": "resnet50",
        "opt-learning-rate": 8e-05,
        "opt-weight-decay": 1e-05,
        "test-data-path": "s3://p5-amazon-bin-images/webdataset/test/test-shard-{000000..000000}.tar",
        "test-data-size": 1000,
        "train-data-path": "s3://p5-amazon-bin-images/webdataset/train/train-shard-{000000..000001}.tar",
        "train-data-size": 2000,
        "val-data-path": "s3://p5-amazon-bin-images/webdataset/val/val-shard-{000000..000000}.tar",
        "val-data-size": 1000,
        "wandb": true
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {},
    "input_dir": "/opt/ml/input",
    "instance_groups": [
        "homogeneousCluster"
    ],
    "instance_groups_dict": {
        "homogeneousCluster": {
            "instance_group_name": "homogeneousCluster",
            "instance_type": "ml.g4dn.xlarge",
            "hosts": [
                "algo-1",
                "algo-2"
            ]
        }
    },
    "is_hetero": false,
    "is_master": true,
    "is_modelparallel_enabled": null,
    "is_smddpmprun_installed": true,
    "is_smddprun_installed": true,
    "job_name": "p5-amazon-bin-job-20250207-004549",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://p5-amazon-bin-images-train/p5-amazon-bin-job-20250207-004549/source/sourcedir.tar.gz",
    "module_name": "train",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 1,
    "num_neurons": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.g4dn.xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1",
            "algo-2"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.g4dn.xlarge",
                "hosts": [
                    "algo-1",
                    "algo-2"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "train.py"
}
Environment variables:
SM_HOSTS=["algo-1","algo-2"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"batch-size":256,"class-weights-dict":{"1":1.7004885993485341,"2":0.9083079599826012,"3":0.7832708177044261,"4":0.8799831436999579,"5":1.1137066666666666},"debug":false,"early-stopping-patience":5,"epochs":2,"lr-sched-gamma":0.5,"lr-sched-step-size":5,"model-type":"resnet50","opt-learning-rate":8e-05,"opt-weight-decay":1e-05,"test-data-path":"s3://p5-amazon-bin-images/webdataset/test/test-shard-{000000..000000}.tar","test-data-size":1000,"train-data-path":"s3://p5-amazon-bin-images/webdataset/train/train-shard-{000000..000001}.tar","train-data-size":2000,"val-data-path":"s3://p5-amazon-bin-images/webdataset/val/val-shard-{000000..000000}.tar","val-data-size":1000,"wandb":true}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={"sagemaker_distributed_dataparallel_custom_mpi_options":"","sagemaker_distributed_dataparallel_enabled":true,"sagemaker_instance_type":"ml.g4dn.xlarge"}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1","algo-2"],"instance_groups":[{"hosts":["algo-1","algo-2"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_CURRENT_INSTANCE_TYPE=ml.g4dn.xlarge
SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
SM_CURRENT_INSTANCE_GROUP_HOSTS=["algo-1","algo-2"]
SM_INSTANCE_GROUPS=["homogeneousCluster"]
SM_INSTANCE_GROUPS_DICT={"homogeneousCluster":{"hosts":["algo-1","algo-2"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}}
SM_DISTRIBUTION_INSTANCE_GROUPS=["homogeneousCluster"]
SM_IS_HETERO=false
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=1
SM_NUM_NEURONS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://p5-amazon-bin-images-train/p5-amazon-bin-job-20250207-004549/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{"sagemaker_distributed_dataparallel_custom_mpi_options":"","sagemaker_distributed_dataparallel_enabled":true,"sagemaker_instance_type":"ml.g4dn.xlarge"},"channel_input_dirs":{},"current_host":"algo-1","current_instance_group":"homogeneousCluster","current_instance_group_hosts":["algo-1","algo-2"],"current_instance_type":"ml.g4dn.xlarge","distribution_hosts":["algo-1","algo-2"],"distribution_instance_groups":["homogeneousCluster"],"framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1","algo-2"],"hyperparameters":{"batch-size":256,"class-weights-dict":{"1":1.7004885993485341,"2":0.9083079599826012,"3":0.7832708177044261,"4":0.8799831436999579,"5":1.1137066666666666},"debug":false,"early-stopping-patience":5,"epochs":2,"lr-sched-gamma":0.5,"lr-sched-step-size":5,"model-type":"resnet50","opt-learning-rate":8e-05,"opt-weight-decay":1e-05,"test-data-path":"s3://p5-amazon-bin-images/webdataset/test/test-shard-{000000..000000}.tar","test-data-size":1000,"train-data-path":"s3://p5-amazon-bin-images/webdataset/train/train-shard-{000000..000001}.tar","train-data-size":2000,"val-data-path":"s3://p5-amazon-bin-images/webdataset/val/val-shard-{000000..000000}.tar","val-data-size":1000,"wandb":true},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","instance_groups":["homogeneousCluster"],"instance_groups_dict":{"homogeneousCluster":{"hosts":["algo-1","algo-2"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"is_smddpmprun_installed":true,"is_smddprun_installed":true,"job_name":"p5-amazon-bin-job-20250207-004549","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://p5-amazon-bin-images-train/p5-amazon-bin-job-20250207-004549/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"num_neurons":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1","algo-2"],"instance_groups":[{"hosts":["algo-1","algo-2"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--batch-size","256","--class-weights-dict","1=1.7004885993485341,2=0.9083079599826012,3=0.7832708177044261,4=0.8799831436999579,5=1.1137066666666666","--debug","False","--early-stopping-patience","5","--epochs","2","--lr-sched-gamma","0.5","--lr-sched-step-size","5","--model-type","resnet50","--opt-learning-rate","8e-05","--opt-weight-decay","1e-05","--test-data-path","s3://p5-amazon-bin-images/webdataset/test/test-shard-{000000..000000}.tar","--test-data-size","1000","--train-data-path","s3://p5-amazon-bin-images/webdataset/train/train-shard-{000000..000001}.tar","--train-data-size","2000","--val-data-path","s3://p5-amazon-bin-images/webdataset/val/val-shard-{000000..000000}.tar","--val-data-size","1000","--wandb","True"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_HP_BATCH-SIZE=256
SM_HP_CLASS-WEIGHTS-DICT={"1":1.7004885993485341,"2":0.9083079599826012,"3":0.7832708177044261,"4":0.8799831436999579,"5":1.1137066666666666}
SM_HP_DEBUG=false
SM_HP_EARLY-STOPPING-PATIENCE=5
SM_HP_EPOCHS=2
SM_HP_LR-SCHED-GAMMA=0.5
SM_HP_LR-SCHED-STEP-SIZE=5
SM_HP_MODEL-TYPE=resnet50
SM_HP_OPT-LEARNING-RATE=8e-05
SM_HP_OPT-WEIGHT-DECAY=1e-05
SM_HP_TEST-DATA-PATH=s3://p5-amazon-bin-images/webdataset/test/test-shard-{000000..000000}.tar
SM_HP_TEST-DATA-SIZE=1000
SM_HP_TRAIN-DATA-PATH=s3://p5-amazon-bin-images/webdataset/train/train-shard-{000000..000001}.tar
SM_HP_TRAIN-DATA-SIZE=2000
SM_HP_VAL-DATA-PATH=s3://p5-amazon-bin-images/webdataset/val/val-shard-{000000..000000}.tar
SM_HP_VAL-DATA-SIZE=1000
SM_HP_WANDB=true
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python39.zip:/opt/conda/lib/python3.9:/opt/conda/lib/python3.9/lib-dynload:/opt/conda/lib/python3.9/site-packages
Invoking script with the following command:
mpirun --host algo-1:1,algo-2:1 -np 2 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 2 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_HOMOGENEOUS=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.9/site-packages/gethostname.cpython-39-x86_64-linux-gnu.so -x SMDATAPARALLEL_SERVER_ADDR=algo-1 -x SMDATAPARALLEL_SERVER_PORT=7592 -x SAGEMAKER_INSTANCE_TYPE=ml.g4dn.xlarge smddprun /opt/conda/bin/python3.9 -m mpi4py train.py --batch-size 256 --class-weights-dict 1=1.7004885993485341,2=0.9083079599826012,3=0.7832708177044261,4=0.8799831436999579,5=1.1137066666666666 --debug False --early-stopping-patience 5 --epochs 2 --lr-sched-gamma 0.5 --lr-sched-step-size 5 --model-type resnet50 --opt-learning-rate 8e-05 --opt-weight-decay 1e-05 --test-data-path s3://p5-amazon-bin-images/webdataset/test/test-shard-{000000..000000}.tar --test-data-size 1000 --train-data-path s3://p5-amazon-bin-images/webdataset/train/train-shard-{000000..000001}.tar --train-data-size 2000 --val-data-path s3://p5-amazon-bin-images/webdataset/val/val-shard-{000000..000000}.tar --val-data-size 1000 --wandb True
Warning: Permanently added 'algo-2,10.0.207.214' (ECDSA) to the list of known hosts.
[1,mpirank:0,algo-1]<stderr>:curl: /opt/conda/lib/libcurl.so.4: no version information available (required by curl)
[1,mpirank:1,algo-2]<stderr>:curl: /opt/conda/lib/libcurl.so.4: no version information available (required by curl)
[1,mpirank:0,algo-1]<stderr>:curl: /opt/conda/lib/libcurl.so.4: no version information available (required by curl)
[1,mpirank:1,algo-2]<stderr>:curl: /opt/conda/lib/libcurl.so.4: no version information available (required by curl)
[1,mpirank:0,algo-1]<stderr>:curl: /opt/conda/lib/libcurl.so.4: no version information available (required by curl)
[1,mpirank:1,algo-2]<stderr>:curl: /opt/conda/lib/libcurl.so.4: no version information available (required by curl)
2025-02-07 06:50:27,782 sagemaker-training-toolkit INFO     Process[es]: [psutil.Process(pid=86, name='orted', status='sleeping', started='06:50:26')]
2025-02-07 06:50:27,782 sagemaker-training-toolkit INFO     Orted process found [psutil.Process(pid=86, name='orted', status='sleeping', started='06:50:26')]
2025-02-07 06:50:27,782 sagemaker-training-toolkit INFO     Waiting for orted process [psutil.Process(pid=86, name='orted', status='sleeping', started='06:50:26')]
[1,mpirank:0,algo-1]<stdout>:DDP Mode
[1,mpirank:0,algo-1]<stderr>:train.py:41: DeprecationWarning: smdistributed.dataparallel.torch.dist is deprecated in the SageMaker distributed data parallel library v1.4.0+.Please use torch.distributed and specify 'smddp' as a backend when initializing process group as follows:torch.distributed.init_process_group(backend='smddp')For more information, see the library's API documentation at https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html
[1,mpirank:0,algo-1]<stderr>:  import smdistributed.dataparallel.torch.distributed as dist
[1,mpirank:1,algo-2]<stdout>:DDP Mode
[1,mpirank:1,algo-2]<stderr>:train.py:41: DeprecationWarning: smdistributed.dataparallel.torch.dist is deprecated in the SageMaker distributed data parallel library v1.4.0+.Please use torch.distributed and specify 'smddp' as a backend when initializing process group as follows:torch.distributed.init_process_group(backend='smddp')For more information, see the library's API documentation at https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html
[1,mpirank:1,algo-2]<stderr>:  import smdistributed.dataparallel.torch.distributed as dist
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Bootstrap : Using eth0:10.0.250.121<0>
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO cudaDriverVersion 12040
[1,mpirank:0,algo-1]<stdout>:NCCL version 2.14.3+cuda11.7
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO cudaDriverVersion 12040
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] find_ofi_provider:608 NCCL WARN NET/OFI Couldn't find any optimal provider
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] ofi_init:1355 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.250.121<0>
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Using network Socket
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Bootstrap : Using eth0:10.0.207.214<0>
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,mpirank:1,algo-2]<stdout>:
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] find_ofi_provider:608 NCCL WARN NET/OFI Couldn't find any optimal provider
[1,mpirank:1,algo-2]<stdout>:
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] ofi_init:1355 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.207.214<0>
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Using network Socket
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 00/02 :    0   1
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 01/02 :    0   1
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
[1,mpirank:1,algo-2]<stdout>:algo-2:105:125 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 00/0 : 0[1e0] -> 1[1e0] [receive] via NET/Socket/0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:128 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 00/0 : 1[1e0] -> 0[1e0] [receive] via NET/Socket/0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:125 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 01/0 : 0[1e0] -> 1[1e0] [receive] via NET/Socket/0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:128 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 01/0 : 1[1e0] -> 0[1e0] [receive] via NET/Socket/0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 00/0 : 1[1e0] -> 0[1e0] [send] via NET/Socket/0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 00/0 : 0[1e0] -> 1[1e0] [send] via NET/Socket/0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 01/0 : 1[1e0] -> 0[1e0] [send] via NET/Socket/0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 01/0 : 0[1e0] -> 1[1e0] [send] via NET/Socket/0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Connected all rings
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Connected all trees
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Connected all rings
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Connected all trees
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO comm 0x559c43d15700 rank 0 nranks 2 cudaDev 0 busId 1e0 - Init COMPLETE
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Using network Socket
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO comm 0x55f64143c800 rank 1 nranks 2 cudaDev 0 busId 1e0 - Init COMPLETE
[1,mpirank:1,algo-2]<stdout>:NCCL version 2.14.3+cuda11.7
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Using network Socket
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 00/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 01/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 02/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 03/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 04/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 05/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 06/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 07/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 08/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 09/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 10/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 11/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 12/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 13/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 14/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 15/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 16/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 17/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 18/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 19/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 20/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 21/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 22/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 23/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 24/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 25/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 26/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 27/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 28/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 29/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 30/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Channel 31/32 :    0
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Connected all rings
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO Connected all trees
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 00/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 01/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 02/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 03/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 04/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 05/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 06/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 07/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 08/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 09/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 10/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 11/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 12/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 13/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 14/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 15/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 16/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 17/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 18/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 19/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 20/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 21/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 22/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 23/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 24/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 25/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 26/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 27/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 28/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 29/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 30/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Channel 31/32 :    0
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Connected all rings
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO Connected all trees
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
[1,mpirank:0,algo-1]<stdout>:algo-1:106:106 [0] NCCL INFO comm 0x559c43dab2c0 rank 0 nranks 1 cudaDev 0 busId 1e0 - Init COMPLETE
[1,mpirank:0,algo-1]<stdout>:Running smdistributed.dataparallel v1.7.0[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:SMDDP: Multi node ENA mode
[1,mpirank:1,algo-2]<stdout>:algo-2:105:105 [0] NCCL INFO comm 0x55f641618790 rank 0 nranks 1 cudaDev 0 busId 1e0 - Init COMPLETE
[1,mpirank:1,algo-2]<stdout>:🟢 SageMkaer DDP is initialized.
[1,mpirank:1,algo-2]<stdout>:👉 Total GPU count: 2
[1,mpirank:1,algo-2]<stdout>:👉 Rank: 1, Local Rank: 0
[1,mpirank:1,algo-2]<stdout>:👉 Device: cuda, Rank: 1, Local rank: 0
[1,mpirank:0,algo-1]<stdout>:🟢 SageMkaer DDP is initialized.
[1,mpirank:0,algo-1]<stdout>:👉 Total GPU count: 2
[1,mpirank:0,algo-1]<stdout>:👉 Rank: 0, Local Rank: 0
[1,mpirank:0,algo-1]<stdout>:👉 task.config:
[1,mpirank:0,algo-1]<stdout>:{
[1,mpirank:0,algo-1]<stdout>:'batch_size': [1,mpirank:0,algo-1]<stdout>:256,
[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:'class_weights_dict':
[1,mpirank:0,algo-1]<stdout>:{[1,mpirank:0,algo-1]<stdout>:1: 1.7004885993485341,
[1,mpirank:0,algo-1]<stdout>:                        2: 0.9083079599826012,
[1,mpirank:0,algo-1]<stdout>:                        3: 0.7832708177044261,
[1,mpirank:0,algo-1]<stdout>:                        4: 0.8799831436999579[1,mpirank:0,algo-1]<stdout>:,
[1,mpirank:0,algo-1]<stdout>:                        5: [1,mpirank:0,algo-1]<stdout>:1.1137066666666666},
[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:'debug': False,
[1,mpirank:0,algo-1]<stdout>: 'early_stopping_patience': [1,mpirank:0,algo-1]<stdout>:5[1,mpirank:0,algo-1]<stdout>:,
[1,mpirank:0,algo-1]<stdout>: 'epochs': 2,
[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:'lr_sched_gamma': [1,mpirank:0,algo-1]<stdout>:0.5,
[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:'lr_sched_step_size':
[1,mpirank:0,algo-1]<stdout>:5,
[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:'model_arch': [1,mpirank:0,algo-1]<stdout>:'resnet34',
[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:'model_dir': [1,mpirank:0,algo-1]<stdout>:'/opt/ml/model',
[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:'opt_learning_rate': [1,mpirank:0,algo-1]<stdout>:8e-05,
[1,mpirank:0,algo-1]<stdout>: 'opt_weight_decay': 1e-05,
[1,mpirank:0,algo-1]<stdout>: 'output_data_dir': '/opt/ml/output/data',
[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:'test_data_path': [1,mpirank:0,algo-1]<stdout>:'s3://p5-amazon-bin-images/webdataset/test/test-shard-{000000..000000}.tar',
[1,mpirank:0,algo-1]<stdout>: 'test_data_size': 1000,
[1,mpirank:0,algo-1]<stdout>: 'test_shards': 1[1,mpirank:0,algo-1]<stdout>:,
[1,mpirank:0,algo-1]<stdout>: 'train_data_path': [1,mpirank:0,algo-1]<stdout>:'s3://p5-amazon-bin-images/webdataset/train/train-shard-{000000..000001}.tar',
[1,mpirank:0,algo-1]<stdout>: 'train_data_size': [1,mpirank:0,algo-1]<stdout>:2000,
[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:'train_shards': [1,mpirank:0,algo-1]<stdout>:2,
[1,mpirank:0,algo-1]<stdout>: 'use_cuda': True,
[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:'val_data_path': [1,mpirank:0,algo-1]<stdout>:'s3://p5-amazon-bin-images/webdataset/val/val-shard-{000000..000000}.tar',
[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:'val_data_size': [1,mpirank:0,algo-1]<stdout>:1000,
[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:'val_shards': 1[1,mpirank:0,algo-1]<stdout>:,
[1,mpirank:0,algo-1]<stdout>: 'wandb': True[1,mpirank:0,algo-1]<stdout>:}[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stderr>:wandb: Currently logged in as: nov05 to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[1,mpirank:0,algo-1]<stderr>:wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[1,mpirank:0,algo-1]<stderr>:wandb: Tracking run with wandb version 0.19.6
[1,mpirank:0,algo-1]<stderr>:wandb: Run data is saved locally in /opt/ml/code/wandb/run-20250207_065033-p5-amazon-bin-job-20250207-004549-pvvj7m-algo-1
[1,mpirank:0,algo-1]<stderr>:wandb: Run `wandb offline` to turn off syncing.
[1,mpirank:0,algo-1]<stderr>:wandb: Syncing run p5-amazon-bin-job-20250207-004549-pvvj7m-algo-1
[1,mpirank:0,algo-1]<stderr>:wandb: ⭐️ View project at https://wandb.ai/nov05/udacity-awsmle-resnet34-amazon-bin
[1,mpirank:0,algo-1]<stderr>:wandb: 🚀 View run at https://wandb.ai/nov05/udacity-awsmle-resnet34-amazon-bin/runs/p5-amazon-bin-job-20250207-004549-pvvj7m-algo-1
[1,mpirank:0,algo-1]<stdout>:👉 Device: cuda, Rank: 0, Local rank: 0
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:👉 Rank 0: Model resnet34 has been created successfully.
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:1,algo-2]<stdout>:👉 Rank 1: Model resnet34 has been created successfully.
[1,mpirank:0,algo-1]<stdout>:👉 Train Epoch: 0, Learning Rate: 0.00016
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:[2025-02-07 06:50:34.488 algo-1:106 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
[1,mpirank:1,algo-2]<stdout>:[2025-02-07 06:50:34.488 algo-2:105 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
[1,mpirank:0,algo-1]<stdout>:[2025-02-07 06:50:34.844 algo-1:106 INFO profiler_config_parser.py:111] User has disabled profiler.
[1,mpirank:0,algo-1]<stdout>:[2025-02-07 06:50:34.845 algo-1:106 INFO json_config.py:92] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[1,mpirank:0,algo-1]<stdout>:[2025-02-07 06:50:34.846 algo-1:106 INFO hook.py:206] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[1,mpirank:0,algo-1]<stdout>:[2025-02-07 06:50:34.847 algo-1:106 INFO hook.py:259] Saving to /opt/ml/output/tensors
[1,mpirank:0,algo-1]<stdout>:[2025-02-07 06:50:34.847 algo-1:106 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
[1,mpirank:1,algo-2]<stdout>:[2025-02-07 06:50:34.869 algo-2:105 INFO profiler_config_parser.py:111] User has disabled profiler.
[1,mpirank:1,algo-2]<stdout>:[2025-02-07 06:50:34.869 algo-2:105 INFO json_config.py:92] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[1,mpirank:1,algo-2]<stdout>:[2025-02-07 06:50:34.870 algo-2:105 INFO hook.py:206] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[1,mpirank:1,algo-2]<stdout>:[2025-02-07 06:50:34.870 algo-2:105 INFO hook.py:259] Saving to /opt/ml/output/tensors
[1,mpirank:1,algo-2]<stdout>:[2025-02-07 06:50:34.871 algo-2:105 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
[1,mpirank:1,algo-2]<stdout>:🔹 Rank 1, Train Epoch: 0 [256.0/2000 (14%)], Loss: 2.673072
[1,mpirank:0,algo-1]<stdout>:🔹 Rank 0, Train Epoch: 0 [256.0/2000 (14%)], Loss: 2.652567
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stderr>:INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
[1,mpirank:1,algo-2]<stderr>:INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
[1,mpirank:0,algo-1]<stdout>:🔹 Rank 0, Train Epoch: 0 [5376.0/2000 (157%)], Loss: 1.539597
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:1,algo-2]<stdout>:🔹 Rank 1, Train Epoch: 0 [5376.0/2000 (157%)], Loss: 1.496449
[1,mpirank:0,algo-1]<stderr>:download failed: s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar to - [Errno 32] Broken pipe
[1,mpirank:0,algo-1]<stderr>:download failed: s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar to - [Errno 32] Broken pipe
[1,mpirank:0,algo-1]<stderr>:Exception ignored in:
[1,mpirank:0,algo-1]<stderr>:<function Pipe.__del__ at 0x7fc7b34f8670>
[1,mpirank:0,algo-1]<stderr>:Traceback (most recent call last):
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
[1,mpirank:0,algo-1]<stderr>:    [1,mpirank:0,algo-1]<stderr>:self.close()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    self.wait_for_child()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
[1,mpirank:0,algo-1]<stderr>:    raise IOError(f"{self.args}: exit {self.status} (read) {info}")
[1,mpirank:0,algo-1]<stderr>:OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar -',), {'shell': True, 'bufsize': 8192}): exit 1 (read) {}
[1,mpirank:0,algo-1]<stderr>:Exception ignored in:
[1,mpirank:0,algo-1]<stderr>:<function Pipe.__del__ at 0x7fc7b34f8670>
[1,mpirank:0,algo-1]<stderr>:Traceback (most recent call last):
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
[1,mpirank:0,algo-1]<stderr>:    self.close()[1,mpirank:0,algo-1]<stderr>:
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    self.wait_for_child()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
[1,mpirank:0,algo-1]<stderr>:
[1,mpirank:0,algo-1]<stderr>:raise IOError(f"{self.args}: exit {self.status} (read) {info}")
[1,mpirank:0,algo-1]<stderr>:OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar -',), {'shell': True, 'bufsize': 8192}): exit 1 (read) {}
[1,mpirank:0,algo-1]<stderr>:download failed: s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar to - [Errno 32] Broken pipe
[1,mpirank:0,algo-1]<stderr>:Exception ignored in: <function Pipe.__del__ at 0x7fc7b34f8670>
[1,mpirank:0,algo-1]<stderr>:Traceback (most recent call last):
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
[1,mpirank:0,algo-1]<stderr>:    self.close()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    self.wait_for_child()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
[1,mpirank:0,algo-1]<stderr>:    raise IOError(f"{self.args}: exit {self.status} (read) {info}")
[1,mpirank:0,algo-1]<stderr>:OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar -',), {'shell': True, 'bufsize': 8192}): exit 1 (read) {}
[1,mpirank:0,algo-1]<stderr>:download failed: s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar to - [Errno 32] Broken pipe
[1,mpirank:0,algo-1]<stderr>:Exception ignored in: [1,mpirank:0,algo-1]<stderr>:<function Pipe.__del__ at 0x7fc7b34f8670>
[1,mpirank:0,algo-1]<stderr>:Traceback (most recent call last):
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
[1,mpirank:0,algo-1]<stderr>:    self.close()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    self.wait_for_child()[1,mpirank:0,algo-1]<stderr>:
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
[1,mpirank:0,algo-1]<stderr>:    raise IOError(f"{self.args}: exit {self.status} (read) {info}")[1,mpirank:0,algo-1]<stderr>:
[1,mpirank:0,algo-1]<stderr>:OSError: [1,mpirank:0,algo-1]<stderr>:(('aws s3 cp s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar -',), {'shell': True, 'bufsize': 8192}): exit 1 (read) {}
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:👉 VAL: Average loss: 1.6597, Accuracy: 484.0/1792.0 (27.01%)
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:👉 Train Epoch: 1, Learning Rate: 0.00016[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:1,algo-2]<stdout>:🔹 Rank 1, Train Epoch: 1 [256.0/2000 (14%)], Loss: 1.595960
[1,mpirank:0,algo-1]<stdout>:🔹 Rank 0, Train Epoch: 1 [256.0/2000 (14%)], Loss: 1.611980
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:1,algo-2]<stdout>:🔹 Rank 1, Train Epoch: 1 [5376.0/2000 (157%)], Loss: 1.581787
[1,mpirank:0,algo-1]<stdout>:🔹 Rank 0, Train Epoch: 1 [5376.0/2000 (157%)], Loss: 1.498845
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stderr>:download failed: s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar to - [Errno 32] Broken pipe
[1,mpirank:0,algo-1]<stderr>:download failed: s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar to - [Errno 32] Broken pipe
[1,mpirank:0,algo-1]<stderr>:download failed: s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar to - [Errno 32] Broken pipe
[1,mpirank:0,algo-1]<stderr>:Exception ignored in: <function Pipe.__del__ at 0x7fc7b34f8670>
[1,mpirank:0,algo-1]<stderr>:Traceback (most recent call last):
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
[1,mpirank:0,algo-1]<stderr>:    self.close()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    self.wait_for_child()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
[1,mpirank:0,algo-1]<stderr>:    raise IOError(f"{self.args}: exit {self.status} (read) {info}")
[1,mpirank:0,algo-1]<stderr>:OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar -',), {'shell': True, 'bufsize': 8192}): exit 1 (read) {}
[1,mpirank:0,algo-1]<stderr>:Exception ignored in: <function Pipe.__del__ at 0x7fc7b34f8670>
[1,mpirank:0,algo-1]<stderr>:Traceback (most recent call last):
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
[1,mpirank:0,algo-1]<stderr>:    self.close()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    self.wait_for_child()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
[1,mpirank:0,algo-1]<stderr>:    raise IOError(f"{self.args}: exit {self.status} (read) {info}")
[1,mpirank:0,algo-1]<stderr>:OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar -',), {'shell': True, 'bufsize': 8192}): exit 1 (read) {}
[1,mpirank:0,algo-1]<stderr>:Exception ignored in: [1,mpirank:0,algo-1]<stderr>:<function Pipe.__del__ at 0x7fc7b34f8670>[1,mpirank:0,algo-1]<stderr>:
[1,mpirank:0,algo-1]<stderr>:Traceback (most recent call last):
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
[1,mpirank:0,algo-1]<stderr>:    [1,mpirank:0,algo-1]<stderr>:self.close()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    self.wait_for_child()[1,mpirank:0,algo-1]<stderr>:
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
[1,mpirank:0,algo-1]<stderr>:    [1,mpirank:0,algo-1]<stderr>:raise IOError(f"{self.args}: exit {self.status} (read) {info}")
[1,mpirank:0,algo-1]<stderr>:OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar -',), {'shell': True, 'bufsize': 8192}): exit 1 (read) {}[1,mpirank:0,algo-1]<stderr>:
[1,mpirank:0,algo-1]<stderr>:download failed: s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar to - [Errno 32] Broken pipe
[1,mpirank:0,algo-1]<stderr>:Exception ignored in: [1,mpirank:0,algo-1]<stderr>:<function Pipe.__del__ at 0x7fc7b34f8670>
[1,mpirank:0,algo-1]<stderr>:Traceback (most recent call last):
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
[1,mpirank:0,algo-1]<stderr>:    self.close()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    self.wait_for_child()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
[1,mpirank:0,algo-1]<stderr>:
[1,mpirank:0,algo-1]<stderr>:raise IOError(f"{self.args}: exit {self.status} (read) {info}")
[1,mpirank:0,algo-1]<stderr>:OSError: [1,mpirank:0,algo-1]<stderr>:(('aws s3 cp s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar -',), {'shell': True, 'bufsize': 8192}): exit 1 (read) {}
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:👉 VAL: Average loss: 1.5300, Accuracy: 422.0/1792.0 (23.55%)
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:🟢 Start testing...
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stderr>:download failed: s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar to - [Errno 32] Broken pipe
[1,mpirank:0,algo-1]<stderr>:download failed: s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar to - [Errno 32] Broken pipe
[1,mpirank:0,algo-1]<stderr>:Exception ignored in:
[1,mpirank:0,algo-1]<stderr>:<function Pipe.__del__ at 0x7fc7b34f8670>
[1,mpirank:0,algo-1]<stderr>:Traceback (most recent call last):
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
[1,mpirank:0,algo-1]<stderr>:    self.close()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    self.wait_for_child()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
[1,mpirank:0,algo-1]<stderr>:    raise IOError(f"{self.args}: exit {self.status} (read) {info}")
[1,mpirank:0,algo-1]<stderr>:OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar -',), {'shell': True, 'bufsize': 8192}): exit 1 (read) {}
[1,mpirank:0,algo-1]<stderr>:Exception ignored in: <function Pipe.__del__ at 0x7fc7b34f8670>
[1,mpirank:0,algo-1]<stderr>:Traceback (most recent call last):
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
[1,mpirank:0,algo-1]<stderr>:    self.close()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    self.wait_for_child()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
[1,mpirank:0,algo-1]<stderr>:    raise IOError(f"{self.args}: exit {self.status} (read) {info}")
[1,mpirank:0,algo-1]<stderr>:OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar -',), {'shell': True, 'bufsize': 8192}): exit 1 (read) {}
[1,mpirank:0,algo-1]<stderr>:download failed: s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar to - [Errno 32] Broken pipe
[1,mpirank:0,algo-1]<stderr>:download failed: s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar to - [Errno 32] Broken pipe
[1,mpirank:0,algo-1]<stderr>:Exception ignored in: <function Pipe.__del__ at 0x7fc7b34f8670>
[1,mpirank:0,algo-1]<stderr>:Traceback (most recent call last):
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
[1,mpirank:0,algo-1]<stderr>:    self.close()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    self.wait_for_child()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
[1,mpirank:0,algo-1]<stderr>:    raise IOError(f"{self.args}: exit {self.status} (read) {info}")
[1,mpirank:0,algo-1]<stderr>:OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar -',), {'shell': True, 'bufsize': 8192}): exit 1 (read) {}
[1,mpirank:0,algo-1]<stderr>:Exception ignored in: [1,mpirank:0,algo-1]<stderr>:<function Pipe.__del__ at 0x7fc7b34f8670>[1,mpirank:0,algo-1]<stderr>:
[1,mpirank:0,algo-1]<stderr>:Traceback (most recent call last):
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
[1,mpirank:0,algo-1]<stderr>:    self.close()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    self.wait_for_child()
[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
[1,mpirank:0,algo-1]<stderr>:    raise IOError(f"{self.args}: exit {self.status} (read) {info}")
[1,mpirank:0,algo-1]<stderr>:OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/test/test-shard-000000.tar -',), {'shell': True, 'bufsize': 8192}): exit 1 (read) {}
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:👉 TEST: Average loss: 1.5300, Accuracy: 422.0/1792.0 (23.55%)
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:👉 Model saved at '/opt/ml/model/model.pth'
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stderr>:wandb:
[1,mpirank:0,algo-1]<stderr>:wandb: 
[1,mpirank:0,algo-1]<stderr>:wandb: Run history:
[1,mpirank:0,algo-1]<stderr>:wandb: Rank 0, train_loss ▅█▄▂▃▂▁▂▂▂▁▁▁▂▁▂▁▁▁▁▂▁▁▁▁▁▁▁▁▁
[1,mpirank:0,algo-1]<stderr>:wandb: val_accuracy_epoch █▁
[1,mpirank:0,algo-1]<stderr>:wandb:     val_loss_epoch █▁
[1,mpirank:0,algo-1]<stderr>:wandb: 
[1,mpirank:0,algo-1]<stderr>:wandb: Run summary:
[1,mpirank:0,algo-1]<stderr>:wandb: Rank 0, train_loss 1.46435
[1,mpirank:0,algo-1]<stderr>:wandb: val_accuracy_epoch 23.54911
[1,mpirank:0,algo-1]<stderr>:wandb:     val_loss_epoch 1.52998
[1,mpirank:0,algo-1]<stderr>:wandb:
[1,mpirank:0,algo-1]<stderr>:wandb: 🚀 View run p5-amazon-bin-job-20250207-004549-pvvj7m-algo-1 at: https://wandb.ai/nov05/udacity-awsmle-resnet34-amazon-bin/runs/p5-amazon-bin-job-20250207-004549-pvvj7m-algo-1
[1,mpirank:0,algo-1]<stderr>:wandb: ⭐️ View project at: https://wandb.ai/nov05/udacity-awsmle-resnet34-amazon-bin
[1,mpirank:0,algo-1]<stderr>:wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
[1,mpirank:0,algo-1]<stderr>:wandb: Find logs at: ./wandb/run-20250207_065033-p5-amazon-bin-job-20250207-004549-pvvj7m-algo-1/logs
2025-02-07 06:52:52,225 sagemaker-training-toolkit INFO     Invoked on_terminate from psutil.wait_for_procs
2025-02-07 06:52:52,225 sagemaker-training-toolkit INFO     process psutil.Process(pid=86, name='orted', status='terminated', started='06:50:26') terminated with exit code None
2025-02-07 06:52:52,226 sagemaker-training-toolkit INFO     Reporting status for ORTEd process. gone: [psutil.Process(pid=86, name='orted', status='terminated', started='06:50:26')] alive: []
2025-02-07 06:52:52,226 sagemaker-training-toolkit INFO     Orted process exited
2025-02-07 06:52:52,206 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2025-02-07 06:52:52,207 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 0 from exiting process.
2025-02-07 06:52:52,207 sagemaker-training-toolkit INFO     Begin writing status file from leader node to worker nodes
2025-02-07 06:52:52,207 sagemaker-training-toolkit INFO     Start writing mpirun finished status to algo-2
2025-02-07 06:52:52,418 sagemaker-training-toolkit INFO     output from subprocess run CompletedProcess(args=['ssh', 'algo-2', 'touch', '/tmp/done.algo-1'], returncode=0, stdout='', stderr='')
2025-02-07 06:52:52,418 sagemaker-training-toolkit INFO     Finished writing status file
2025-02-07 06:53:22,254 sagemaker-training-toolkit INFO     Begin looking for status file on algo-2
2025-02-07 06:53:22,254 sagemaker-training-toolkit INFO     MPI training job status file found. Exit gracefully
2025-02-07 06:53:22,254 sagemaker-training-toolkit INFO     End looking for status file
2025-02-07 06:53:22,254 sagemaker-training-toolkit INFO     MPI process finished.
2025-02-07 06:53:22,255 sagemaker-training-toolkit INFO     Reporting training SUCCESS
2025-02-07 06:53:22,447 sagemaker-training-toolkit INFO     Finished writing status file from leader node to worker nodes
2025-02-07 06:53:22,447 sagemaker-training-toolkit INFO     Reporting training SUCCESS

2025-02-07 06:53:45 Uploading - Uploading generated training model
2025-02-07 06:53:45 Completed - Training job completed
Training seconds: 844
Billable seconds: 844
CPU times: total: 13.6 s
Wall time: 8min 12s

Author

nov05 commented Feb 7, 2025 •

edited

Loading

⚠️🟢 Issue: out of memory

✅ solution: reduce batch size from 256 to 128

[1,mpirank:0,algo-1]<stdout>:🔹 Rank 0, Train Epoch: 0 [2816.0/7308 (77%)], Loss: 1.607739
[1,mpirank:1,algo-2]<stdout>:🔹 Rank 1, Train Epoch: 0 [2816.0/7308 (77%)], Loss: 1.592015
[1,mpirank:0,algo-1]<stdout>:
2025-02-07 09:44:00 Uploading - Uploading generated training model[1,mpirank:0,algo-1]<stderr>:/opt/conda/bin/runwithenvvars: line 55:   106 Killed                  $@
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
2025-02-07 09:43:54,459 sagemaker-training-toolkit INFO     Invoked on_terminate from psutil.wait_for_procs
2025-02-07 09:43:54,460 sagemaker-training-toolkit INFO     process psutil.Process(pid=86, name='orted', status='terminated', started='09:42:45') terminated with exit code None
2025-02-07 09:43:54,460 sagemaker-training-toolkit INFO     Reporting status for ORTEd process. gone: [psutil.Process(pid=86, name='orted', status='terminated', started='09:42:45')] alive: []
2025-02-07 09:43:54,460 sagemaker-training-toolkit INFO     Orted process exited
2025-02-07 09:44:12 Failed - Training job failed

UnexpectedStatusException: Error for Training job p5-amazon-bin-job-20250207-033810: Failed. Reason: ClientError: 
Please use an instance type with more memory, or reduce the size of job data processed on an instance.. Check 
troubleshooting guide for common errors: 
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html

Author

nov05 commented Feb 7, 2025 •

edited

Loading

⚠️🟢 issue: early stopping seems to cause error in AllGather and eventually timeout error?

✅ use dist.broadcast() and dist.barrier()

https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html

SMDDP AllGather collective operation
AllGather is a collective operation where each worker starts with an input buffer, and then concatenates or gathers the input buffers from all other workers into an output buffer.
AllGather is heavily used in distributed training techniques such as sharded data parallelism where each individual worker holds a fraction of a model, or a sharded layer. The workers call AllGather before forward and backward passes to reconstruct the sharded layers. The forward and backward passes continue onward after the parameters are all gathered. During the backward pass, each worker also calls ReduceScatter to collect (reduce) gradients and break (scatter) them into gradient shards to update the corresponding sharded layer.

distributed model parallelism, sharded data parallelism...

error

terminate called after throwing an instance of ':SMDDPTimeoutError:'
 what()
 #011Timeout: A call to 'allGather' has taken over 1800.000000 seconds. Terminating the distributed job.It might be one of the workers failed during forward and backward propagation and failed to call "allGather".
 #011Extend timeout using dist.init_process_group(timeout=timedelta(minutes=60)
 #011Extend timeout using dist.init(timeout=timedelta(minutes=60)
 #011or refer to the debugging guide. Verify that all ranks call
 #011collective operations in the same order and within timeout period.
 [algo-2:00105] *** Process received signal ***
 [algo-2:00105] Signal: Aborted (6)
 [algo-2:00105] Signal code:  (-6)
 [algo-2:00105] :[ 0] :/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f4e7a93c420]
 [algo-2:00105] :[ 1]
 /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f4e7a61d00b]
 [algo-2:00105] :[ 2] :/usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f4e7a5fc859]
 [algo-2:00105]
 [ 3] :/opt/conda/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xc0)[0x7f4e4f792026]
 [ 4]
 /opt/conda/lib/libstdc++.so.6(+0xb0514)[0x7f4e4f790514]
 [ 5] :/opt/conda/lib/libstdc++.so.6(+0xb0566)[0x7f4e4f790566]
 [ 6] :/opt/conda/lib/libstdc++.so.6(__cxa_rethrow+0x0)[0x7f4e4f790758]
 [algo-2:00105] :[ 7] :/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/lib/libsmddpcpp.so(_Z24default_timeout_callbacklSs+0x279)[0x7f4de9ea542f]
 [ 8] :/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/lib/libsmddpcpp.so(_ZNSt17_Function_handlerIFvlSsEPS0_E9_M_invokeERKSt9_Any_dataOlOSs+0x70)[0x7f4de9ea7281]
 [algo-2:00105] :[ 9]
 /opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/lib/libsmddpcpp.so(_ZNKSt8functionIFvlSsEEclElSs+0x65)[0x7f4de9ea6e0f]
 [algo-2:00105] :[10]
 /opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/lib/libsmddpcpp.so(_ZN17TimeoutController9controlfnEv+0x471)[0x7f4de9ea59ef]
 [algo-2:00105] :[11] :/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/lib/libsmddpcpp.so(_ZSt13__invoke_implIvM17TimeoutControllerFvvEPS0_JEET_St21__invoke_memfun_derefOT0_OT1_DpOT2_+0x69)[0x7f4de9ea7eb7]
 [algo-2:00105] :[12] :/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/lib/libsmddpcpp.so(_ZSt8__invokeIM17TimeoutControllerFvvEJPS0_EENSt15__invoke_resultIT_JDpT0_EE4typeEOS5_DpOS6_+0x3b)[0x7f4de9ea7de6]
 [13] :/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/lib/libsmddpcpp.so(_ZNSt6thread8_InvokerISt5tupleIJM17TimeoutControllerFvvEPS2_EEE9_M_invokeIJLm0ELm1EEEEvSt12_Index_tupleIJXspT_EEE+0x47)[0x7f4de9ea7d45]
 [algo-2:00105] :[14]
 /opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/lib/libsmddpcpp.so(_ZNSt6thread8_InvokerISt5tupleIJM17TimeoutControllerFvvEPS2_EEEclEv+0x1c)[0x7f4de9ea7cfc]
 [algo-2:00105] :[15] :/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/lib/libsmddpcpp.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJM17TimeoutControllerFvvEPS3_EEEEE6_M_runEv+0x20)[0x7f4de9ea7cdc]
 [algo-2:00105] :[16]
 /opt/conda/lib/libstdc++.so.6(+0xcda93)[0x7f4e4f7ada93]
 [algo-2:00105] :[17] :/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f4e7a930609]
 [algo-2:00105] :[18] :/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f4e7a6f9353]
 [algo-2:00105] *** End of error message ***
 /opt/conda/bin/runwithenvvars: line 55:   105 Aborted                 (core dumped) $@
 --------------------------------------------------------------------------
 Primary job  terminated normally, but 1 process returned
 a non-zero exit code. Per user-direction, the job has been aborted.
 mpirun.real detected that one or more processes exited with non-zero status, thus causing
 the job to be terminated. The first process to do so was
 Process name: [[41139,1],1]
 Exit code:    134"
Command "mpirun --host algo-1:1,algo-2:1 -np 2 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 2 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_HOMOGENEOUS=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.9/site-packages/gethostname.cpython-39-x86_64-linux-gnu.so -x SMDATAPARALLEL_SERVER_ADDR=algo-1 -x SMDATAPARALLEL_SERVER_PORT=7592 -x SAGEMAKER_INSTANCE_TYPE=ml.g4dn.xlarge smddprun /opt/conda/bin/python3.9 -m mpi4py train_10k.py --batch-size 128 --class-weights-dict 1=1.7004885993485341,2=0.9083079599826012,3=0.7832708177044261,4=0.8799831436999579,5=1.1137066666666666 --debug False --early-stopping-patience 5 --epochs 30 --lr-sched-step-size 10 --model-arch resnet34 --opt-learning-rate 0.1 --opt-momentum 0.9 --opt-type sgd --test-data-path s3://p5-amazon-bin-images/webdataset/test/test-shard-{000000..000001}.tar --test-data-size 1567 --train-data-path s3://p5-amazon-bin-images/webdataset/train/train-shard-{000000..000007}.tar --train-data-size 7308 --val-data-path s3://p5-amazon-bin-images/webdataset/val/val-shard-{000000..000001}.tar --val-data-size 1566 --wandb True"
2025-02-07 19:23:45,918 sagemaker-training-toolkit ERROR    Encountered exit_code 1
2025-02-07 19:23:45,916 sagemaker-training-toolkit INFO     Invoked on_terminate from psutil.wait_for_procs
2025-02-07 19:23:45,917 sagemaker-training-toolkit INFO     process psutil.Process(pid=86, name='orted', status='terminated', started='18:40:47') terminated with exit code None
2025-02-07 19:23:45,917 sagemaker-training-toolkit INFO     Reporting status for ORTEd process. gone: [psutil.Process(pid=86, name='orted', status='terminated', started='18:40:47')] alive: []
2025-02-07 19:23:45,917 sagemaker-training-toolkit INFO     Orted process exited

2025-02-07 19:23:56 Uploading - Uploading generated training model
2025-02-07 19:24:14 Failed - Training job failed

ChatGPT (might be wrong)

Yes, you're correct! In distributed training with multiple nodes, the state (like validation loss or early stopping condition) might not be the same across all nodes, leading to inconsistent stopping behavior if not handled properly.

Using dist.barrier() ensures synchronization across all nodes in the distributed environment. This allows you to make sure that once a stopping condition is met (e.g., early stopping), all nodes will reach a consensus and stop the training at the same time.

Here’s how you can use dist.barrier() for early stopping in your SageMaker Distributed Data Parallel (SDP) setup:

Rank 0 Node (Master Node) Handling: Compute the early stopping condition (e.g., checking the validation loss) only on the master node (rank 0).
Broadcast the Early Stopping Signal: Use dist.broadcast() to share the early stopping decision from the master node to all worker nodes.
Sync Nodes with dist.barrier(): Ensure all nodes stop training together based on the early stopping decision.

Here’s a revised version of the code with these ideas implemented:

import torch
import torch.distributed as dist

# Assuming you're using SageMaker's Distributed Data Parallel (SDP)
from sagemaker_pytorch_ddp import DistributedDataParallel as DDP

# Your model, data, optimizer, etc.
model = MyModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)

# Early stopping criteria
early_stopping_patience = 5
best_val_loss = float('inf')
epochs_without_improvement = 0
early_stop = False  # Flag for early stopping, will be broadcasted

# Distributed initialization
rank = dist.get_rank()  # Rank of the current node

for epoch in range(num_epochs):
    model.train()
    
    # Training loop
    for batch in train_loader:
        optimizer.zero_grad()
        output = model(batch)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()

    # Validation step (assume val_loader and loss_fn are defined)
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for val_batch in val_loader:
            output = model(val_batch)
            loss = loss_fn(output, val_target)
            val_loss += loss.item()

    val_loss /= len(val_loader)

    # Early stopping check (only on rank 0)
    if rank == 0:
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            epochs_without_improvement = 0
        else:
            epochs_without_improvement += 1

        if epochs_without_improvement >= early_stopping_patience:
            early_stop = True  # Early stop flag is set

    # Broadcast the early stop decision to all nodes
    early_stop_tensor = torch.tensor(int(early_stop), dtype=torch.int32)
    dist.broadcast(early_stop_tensor, src=0)

    # Update early_stop flag for all nodes
    early_stop = early_stop_tensor.item() == 1

    # If early stopping, break the training loop
    if early_stop:
        print(f"Early stopping at epoch {epoch} on node {rank}")
        dist.barrier()  # Synchronize all processes before stopping
        break

    # Ensure all nodes sync at the end of the epoch
    dist.barrier()

Key Points:

Rank 0 (Master Node): The early stopping logic (like tracking validation loss and deciding when to stop) runs only on the master node (rank == 0).
Broadcast Early Stop Signal: The master node (rank == 0) uses dist.broadcast() to share the early stopping decision with all other nodes. The other nodes receive this broadcast and set their early_stop flag accordingly.
Synchronization (dist.barrier()): All nodes are synchronized at the end of the epoch or right before early stopping to ensure that no node continues training after the early stop condition is met.

With this approach, you ensure that all nodes stop at the same time when the early stopping condition is met, preventing inconsistent behavior between nodes.

Author

nov05 commented Feb 7, 2025 •

edited

Loading

⚠️🟢 issue: SMDDP broadcasting error:

 File "train_v1.py", line 449, in main
 dist.broadcast(braodcast_early_stop, src=0)  ## src is the process rank
 File "/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/torch/distributed.py", line 156, in wrapper
 return func(*args, **kwargs)
 File "/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/__init__.py", line 58, in wrapper
 File "/opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/torch/distributed.py", line 200, in broadcast
 return torchdst.broadcast(tensor, src=src, group=None, async_op=async_op)
 File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
 work = default_pg.broadcast([tensor], opts)
 ValueError: Expected devices to have indices, got cpu

✅ solution: Move early_stop_tensor to Device: The early_stop_tensor is explicitly moved to the correct device using .to(device).

Author

nov05 commented Feb 8, 2025 •

edited

Loading

⚠️🟢 Issue: another type of broadcasting error.

✅ I can now confirm that the issue was related to how to properly breaking the loop with the SageMaker-Torch DDP framework rather than WebDataset. I simply replaced dist.broadcast(broadcast_early_stop, src=0) with dist.all_reduce(broadcast_early_stop, op=dist.ReduceOp.SUM), and the error is gone.

GitHub Issues (closed): How to Implement Early Stopping with WebDataset in SageMaker Distributed Data Parallel (SMDDP) Framework? #446

[1,mpirank:0,algo-1]<stdout>:👉 VAL: Average loss: 439.5972, Accuracy: 252/1536 (16.41%)
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:0,algo-1]<stdout>:👉 Train Epoch: 1, Learning Rate: 0.1
[1,mpirank:0,algo-1]<stdout>:
[1,mpirank:1,algo-2]<stderr>:terminate called after throwing an instance of '[1,mpirank:1,algo-2]<stderr>:SMDDPTimeoutError[1,mpirank:1,algo-2]<stderr>:'
[1,mpirank:1,algo-2]<stderr>:  what():  [1,mpirank:1,algo-2]<stderr>:
[1,mpirank:1,algo-2]<stderr>:
[1,mpirank:1,algo-2]<stderr>:#011Timeout: A call to 'broadcast' has taken over 1800.000000 seconds. Terminating the distributed job.It might be one of the workers failed during forward and backward propagation and failed to call "broadcast".
[1,mpirank:1,algo-2]<stderr>:
[1,mpirank:1,algo-2]<stderr>:#011Extend timeout using dist.init_process_group(timeout=timedelta(minutes=60)
[1,mpirank:1,algo-2]<stderr>:#011Extend timeout using dist.init(timeout=timedelta(minutes=60)
[1,mpirank:1,algo-2]<stderr>:#011or refer to the debugging guide. Verify that all ranks call
[1,mpirank:1,algo-2]<stderr>:#011collective operations in the same order and within timeout period.
[1,mpirank:1,algo-2]<stderr>:
[1,mpirank:1,algo-2]<stderr>:[algo-2:00105] *** Process received signal ***
[1,mpirank:1,algo-2]<stderr>:[algo-2:00105] Signal: Aborted (6)
[1,mpirank:1,algo-2]<stderr>:[algo-2:00105] Signal code:  (-6)

UnexpectedStatusException: Error for Training job p5-amazon-bin-job-20250207-172703: Failed. Reason: 
AlgorithmError: SMDDPTimeoutError:
ExitCode 134
ErrorMessage "Exception ignored in: :<function Pipe.__del__ at 0x7f268daebc10>
 Traceback (most recent call last)
 File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 121, in __del__
 self.close()
 File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
 self.wait_for_child()
 File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 83, in wait_for_child
 raise IOError(f"{self.args}: exit {self.status} (read) {info}")
 OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/train/train-shard-000004.tar -',), {'shell': True, 
'bufsize': 8192}): exit 1 (read) {}
 Exception ignored in: <function Pipe.__del__ at 0x7f20e0860c10>
 OSError: (('aws s3 cp s3://p5-amazon-bin-images/webdataset/train/train-shard-000005.tar -',), {'shell': True, 
'bufsize': 8192}): exit 1 (read) {}
 Exception ignored in: Traceback (most recent call last)
 <function Pipe.__del__ at 0x7f20e0860c10>  File "/opt/conda/lib/python3.. Check troubleshooting guide for common 
errors: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html

wait_for_child() hang the process? Likely not.

[1,mpirank:0,algo-1]<stderr>:  File "/opt/conda/lib/python3.9/site-packages/webdataset/gopen.py", line 109, in close
[1,mpirank:0,algo-1]<stderr>:    [1,mpirank:0,algo-1]<stderr>:self.wait_for_child()

nov05/20250205_udacity_nd189_capstone_training_issues.md

Cause

Solutions:

Example Fix:

Debugging:

nov05 commented Feb 6, 2025

Uh oh!

nov05 commented Feb 6, 2025 •

edited

Loading

Uh oh!

nov05 commented Feb 6, 2025 •

edited

Loading

Uh oh!

nov05 commented Feb 6, 2025

Uh oh!

nov05 commented Feb 6, 2025 •

edited

Loading

Uh oh!

nov05 commented Feb 7, 2025 •

edited

Loading

Uh oh!

nov05 commented Feb 7, 2025 •

edited

Loading

Uh oh!

nov05 commented Feb 7, 2025 •

edited

Loading

Uh oh!

nov05 commented Feb 7, 2025 •

edited

Loading

Uh oh!

nov05 commented Feb 8, 2025 •

edited

Loading

Uh oh!

nov05/20250205_udacity_nd189_capstone_training_issues.md

Cause

Solutions:

Example Fix:

Debugging:

nov05 commented Feb 6, 2025

Uh oh!

nov05 commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nov05 commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nov05 commented Feb 6, 2025

Uh oh!

nov05 commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Possible Causes:

Steps to Fix:

Uh oh!

nov05 commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Possible Causes:

Suggested Fixes:

Uh oh!

nov05 commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nov05 commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Points:

Uh oh!

nov05 commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nov05 commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nov05 commented Feb 6, 2025 •

edited

Loading

nov05 commented Feb 6, 2025 •

edited

Loading

nov05 commented Feb 6, 2025 •

edited

Loading

nov05 commented Feb 7, 2025 •

edited

Loading

nov05 commented Feb 7, 2025 •

edited

Loading

nov05 commented Feb 7, 2025 •

edited

Loading

nov05 commented Feb 7, 2025 •

edited

Loading

nov05 commented Feb 8, 2025 •

edited

Loading