NVIDIA Enterprise GPU Setup Guide

This guide covers the setup process for NVIDIA enterprise GPUs (A100, H100, H200) on Linux systems—focusing on Ubuntu LTS distributions (20.04, 22.04, and 24.04). It details the installation of the latest drivers and CUDA toolkits, along with optional components for containerized or specialized workloads.

Prerequisites
Driver Installation
CUDA Toolkit Installation
Optional Components
Verification
Troubleshooting
Additional Resources

Prerequisites

Hardware Requirements

Supported NVIDIA GPU: A100, H100, H200, etc.
PCIe Slot: PCIe Gen4 x16 slot (recommended)
Power & Cooling: Adequate power supply and cooling solutions
Motherboard: Server-grade with proper PCIe bifurcation support

System Requirements

Operating System: Ubuntu LTS (20.04, 22.04, or 24.04)
Kernel & Build Tools: Ensure Linux kernel headers and basic build tools are installed

Install basic development tools and headers:

sudo apt-get update
sudo apt-get install -y \
    build-essential \
    linux-headers-$(uname -r) \
    software-properties-common \
    gnupg

Driver Installation

Before installing the latest NVIDIA drivers, ensure that old versions are removed and that you are using updated repository methods.

1. Remove Existing NVIDIA Drivers

sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get autoremove

2. Remove Any Outdated NVIDIA Signing Keys (if present)

sudo apt-key del 7fa2af80

3. Add the NVIDIA Repository and GPG Key

Note: With apt-key deprecated, we now use the NVIDIA CUDA keyring package.

Download and install the CUDA keyring package:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//')/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb

4. Add the Repository Pin File

This file sets repository priorities:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//')/x86_64/cuda-ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//').pin
sudo mv cuda-ubuntu*.pin /etc/apt/preferences.d/cuda-repository-pin-600

5. Update Package Lists

sudo apt-get update

6. Install the Latest NVIDIA Drivers

For modern enterprise GPUs, Ubuntu’s driver query typically recommends the “open” driver series. For example, for an A100 the recommended driver might be:

sudo apt-get install -y nvidia-driver-570-open

If you wish to let Ubuntu automatically choose the best driver for your hardware, run:

sudo ubuntu-drivers autoinstall

Note: For H200 or other new-generation GPUs, verify the recommended driver version using ubuntu-drivers devices and choose the driver (open or server) that best suits your workload.

7. Reboot the System

sudo reboot

CUDA Toolkit Installation

Once the driver is updated and working, install the CUDA toolkit to support development and GPU-accelerated applications.

1. Install the CUDA Toolkit and Development Tools

The following command installs the CUDA toolkit. (This package may be updated over time; if you require a specific version, consider downloading the installer from NVIDIA's website.)

sudo apt-get install -y cuda-toolkit nvidia-cuda-toolkit

2. Update Environment Variables

Add the CUDA paths to your shell configuration (e.g., add these lines to your ~/.bashrc):

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Apply the changes:

source ~/.bashrc

Optional Components

NVIDIA Container Toolkit (for Docker)

Set up the container toolkit to run GPU-accelerated containers:

# Get distribution info
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)

# Create directory for keyring if it doesn't exist
sudo mkdir -p /etc/apt/keyrings

# Download and add the NVIDIA GPG key (modern method)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /etc/apt/keyrings/nvidia-container-toolkit.gpg

# Add the repository (modern method)
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
    sed 's#deb https://#deb [signed-by=/etc/apt/keyrings/nvidia-container-toolkit.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Update package lists
sudo apt-get update

# Install NVIDIA container toolkit
sudo apt-get install -y nvidia-container-toolkit

# Restart Docker service
sudo systemctl restart docker

NVIDIA GPUDirect Storage (GDS)

Install GPUDirect Storage for high-performance data transfer:

sudo apt-get install -y nvidia-gds

NVIDIA Fabric Manager (for NVLink/NVSwitch configurations)

If you are using NVLink or NVSwitch interconnects, install and enable the fabric manager:

sudo apt-get install -y nvidia-fabric-manager
sudo systemctl start nvidia-fabricmanager
sudo systemctl enable nvidia-fabricmanager

Verification

After installation, verify that both the driver and CUDA toolkit are functioning correctly.

1. Check the NVIDIA Driver Installation

nvidia-smi

Expected output should display your GPU(s), driver version, and CUDA version. For example:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 570.xxx.xx          Driver Version: 570.xxx.xx   CUDA Version: 12.x |
|-------------------------------+----------------------+----------------------+
...

2. Verify the CUDA Toolkit Installation

nvcc --version

3. Test CUDA Functionality

Create and run a basic CUDA program:

# Create a simple CUDA test program
cat > cuda_test.cu << 'EOF'
#include <stdio.h>
__global__ void kernel() { }
int main() {
    kernel<<<1,1>>>();
    cudaDeviceSynchronize();
    printf("CUDA test successful!\n");
    return 0;
}
EOF

# Compile and run the test program
nvcc cuda_test.cu -o cuda_test
./cuda_test

Troubleshooting

Common Issues

nvidia-smi Not Found or Not Displaying Correctly
- Verify driver installation with:
```
dpkg -l | grep nvidia-driver
```
- Check if the NVIDIA kernel module is loaded:
```
lsmod | grep nvidia
```
- Inspect system logs for errors:
```
dmesg | grep nvidia
```
CUDA Toolkit Not Found
- Ensure that your PATH and LD_LIBRARY_PATH environment variables are correctly set.
- Check that the CUDA installation directory exists:
```
ls /usr/local/cuda
```
- Run the provided CUDA samples for further validation.

Performance Issues

Confirm PCIe link status with:
```
nvidia-smi -q | grep "Max Link"
```
Monitor power and thermal data:
```
nvidia-smi -q -d POWER,TEMPERATURE
```
Verify compute mode with:
```
nvidia-smi -q | grep "Compute Mode"
```

Additional Tips

System Compatibility: Always verify that your hardware and OS are compatible with the chosen driver and CUDA versions.
Official Documentation: Refer to the latest NVIDIA documentation for enterprise GPUs for any changes.
Monitoring: For production environments, consider using NVIDIA Data Center GPU Manager (DCGM) to monitor health and performance.

Latest Drivers

If the required NVIDIA driver version is not available from your default package manager, follow these steps to add NVIDIA’s official repository for the latest drivers:

1. Remove Previous NVIDIA Repository Files

sudo rm /etc/apt/sources.list.d/cuda*
sudo rm /etc/apt/sources.list.d/nvidia-ml*

2. Add NVIDIA’s Repository

Determine your Ubuntu distribution codename:

distribution=$(lsb_release -sc)

Download the repository pin file and add the repository:

wget https://developer.download.nvidia.com/compute/cuda/repos/${distribution}/x86_64/cuda-${distribution}.pin
sudo mv cuda-${distribution}.pin /etc/apt/preferences.d/cuda-repository-pin-600

wget https://developer.download.nvidia.com/compute/cuda/repos/${distribution}/x86_64/3bf863cc.pub
sudo apt-key add 3bf863cc.pub

sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/${distribution}/x86_64/ /"
sudo apt-get update

3. Install the Latest NVIDIA Driver

Choose the driver version that best suits your GPU and workload. For example, to install the open driver for many enterprise GPUs:

sudo apt-get install -y nvidia-driver-565-open

If this version is not available, try the latest available driver:

sudo apt-get install -y nvidia-driver-latest

4. Install CUDA After Updating the Driver

After updating your driver, install CUDA:

sudo apt-get install -y cuda

5. Finalize Installation

Reboot your system to load the new driver and CUDA libraries:

sudo reboot

After rebooting, confirm the installation:

nvidia-smi
nvcc --version

Gyarbij/NVIDIA_Enterprise_GPU_Setup_Guide.md