Using Neuron for AWS Trainium

I would highly recommend reading through the Neuron Wiki at https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html

One experiment that's pending on my plate is to use Amazon Q Developer to migrate code from using CUDA to Neuron. If you've done it, I'd love to hear about it.

1. Choose an Instance Type

Select an Amazon EC2 Trn2 instance, which is powered by AWS Trainium chips. These instances are purpose-built for high-performance deep learning training of generative AI models.

2. Set Up the Environment

Launch a Trn2 instance using the AWS Deep Learning AMI (DLAMI) with Neuron support. Alternatively, you can use AWS Neuron Deep Learning Containers (Neuron DLCs) which are pre-configured for Trainium2.
Connect to your instance via SSH.
Activate the appropriate Neuron conda environment:

source activate aws_neuron_pytorch_p36

2.1. SDK and Framework Support

Use Neuron SDK version 2.21 or later, which introduces support for AWS Trainium2 chips and Amazon EC2 Trn2 instances.
The SDK now supports PyTorch 2.5 across the Neuron ecosystem.

3. Install the Neuron SDK

The Neuron SDK should already be pre-installed on the DLAMI. If not, you can install it using pip:

pip install torch-neuron neuron-cc

4. Prepare Your Model

Import your existing PyTorch model.
Remove any CUDA-specific code or dependencies.

5. Modify Your Code

Replace CUDA device assignments with Neuron XLA device:

import torch_xla.core.xla_model as xm

# Replace: device = torch.device('cuda')
device = xm.xla_device()

Move your model and tensors to the Neuron device:

model = model.to(device)
tensor = tensor.to(device)

6. Compile Your Model

Use the Neuron compiler to optimize your model for Trainium:

import torch_neuron

model_neuron = torch_neuron.trace(model, example_inputs)

7. Train Your Model

Modify your training loop to use Neuron-specific operations:

import torch_xla.core.xla_model as xm

optimizer = torch.optim.Adam(model.parameters())

for epoch in range(num_epochs):
    for batch in dataloader:
        inputs, labels = batch[0].to(device), batch[1].to(device)
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        
        optimizer.zero_grad()
        loss.backward()
        xm.optimizer_step(optimizer)
        
    xm.mark_step()

8. Monitor Performance

Use Neuron tools to monitor your model's performance:

neuron-top

9. Optimize (if needed)

If performance isn't satisfactory, consider using Neuron's profiling tools:

neuron-profile your_script.py

10. Deploy

Once your model is trained and optimized, you can deploy it using services like Amazon SageMaker or directly on Inf1 instances for inference.

Remember, the AWS Neuron SDK is designed to work with popular ML frameworks like PyTorch and TensorFlow, so you can often use your existing code and workflows with minimal changes. However, you may need to adapt certain CUDA-specific optimizations or custom CUDA kernels to work with Neuron.

For more detailed information and advanced usage, refer to the AWS Neuron documentation and examples provided in the SDK.

siddharthkrish/neuron.md