This guide covers the setup process for NVIDIA enterprise GPUs (A100, H100, H200) on Linux systems—focusing on Ubuntu LTS distributions (20.04, 22.04, and 24.04). It details the installation of the latest drivers and CUDA toolkits, along with optional components for containerized or specialized workloads.
- Prerequisites
- Driver Installation
- CUDA Toolkit Installation
- Optional Components
- Verification
- Troubleshooting
- Additional Resources
- Supported NVIDIA GPU: A100, H100, H200, etc.
- PCIe Slot: PCIe Gen4 x16 slot (recommended)
- Power & Cooling: Adequate power supply and cooling solutions
- Motherboard: Server-grade with proper PCIe bifurcation support
- Operating System: Ubuntu LTS (20.04, 22.04, or 24.04)
- Kernel & Build Tools: Ensure Linux kernel headers and basic build tools are installed
Install basic development tools and headers:
sudo apt-get update
sudo apt-get install -y \
build-essential \
linux-headers-$(uname -r) \
software-properties-common \
gnupg
Before installing the latest NVIDIA drivers, ensure that old versions are removed and that you are using updated repository methods.
sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get autoremove
sudo apt-key del 7fa2af80
Note: With apt-key
deprecated, we now use the NVIDIA CUDA keyring package.
Download and install the CUDA keyring package:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//')/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
This file sets repository priorities:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//')/x86_64/cuda-ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//').pin
sudo mv cuda-ubuntu*.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-get update
For modern enterprise GPUs, Ubuntu’s driver query typically recommends the “open” driver series. For example, for an A100 the recommended driver might be:
sudo apt-get install -y nvidia-driver-570-open
If you wish to let Ubuntu automatically choose the best driver for your hardware, run:
sudo ubuntu-drivers autoinstall
Note: For H200 or other new-generation GPUs, verify the recommended driver version using
ubuntu-drivers devices
and choose the driver (open or server) that best suits your workload.
sudo reboot
Once the driver is updated and working, install the CUDA toolkit to support development and GPU-accelerated applications.
The following command installs the CUDA toolkit. (This package may be updated over time; if you require a specific version, consider downloading the installer from NVIDIA's website.)
sudo apt-get install -y cuda-toolkit nvidia-cuda-toolkit
Add the CUDA paths to your shell configuration (e.g., add these lines to your ~/.bashrc
):
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
Apply the changes:
source ~/.bashrc
Set up the container toolkit to run GPU-accelerated containers:
# Get distribution info
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
# Create directory for keyring if it doesn't exist
sudo mkdir -p /etc/apt/keyrings
# Download and add the NVIDIA GPG key (modern method)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /etc/apt/keyrings/nvidia-container-toolkit.gpg
# Add the repository (modern method)
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/etc/apt/keyrings/nvidia-container-toolkit.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Update package lists
sudo apt-get update
# Install NVIDIA container toolkit
sudo apt-get install -y nvidia-container-toolkit
# Restart Docker service
sudo systemctl restart docker
Install GPUDirect Storage for high-performance data transfer:
sudo apt-get install -y nvidia-gds
If you are using NVLink or NVSwitch interconnects, install and enable the fabric manager:
sudo apt-get install -y nvidia-fabric-manager
sudo systemctl start nvidia-fabricmanager
sudo systemctl enable nvidia-fabricmanager
After installation, verify that both the driver and CUDA toolkit are functioning correctly.
nvidia-smi
Expected output should display your GPU(s), driver version, and CUDA version. For example:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 570.xxx.xx Driver Version: 570.xxx.xx CUDA Version: 12.x |
|-------------------------------+----------------------+----------------------+
...
nvcc --version
Create and run a basic CUDA program:
# Create a simple CUDA test program
cat > cuda_test.cu << 'EOF'
#include <stdio.h>
__global__ void kernel() { }
int main() {
kernel<<<1,1>>>();
cudaDeviceSynchronize();
printf("CUDA test successful!\n");
return 0;
}
EOF
# Compile and run the test program
nvcc cuda_test.cu -o cuda_test
./cuda_test
-
nvidia-smi Not Found or Not Displaying Correctly
- Verify driver installation with:
dpkg -l | grep nvidia-driver
- Check if the NVIDIA kernel module is loaded:
lsmod | grep nvidia
- Inspect system logs for errors:
dmesg | grep nvidia
- Verify driver installation with:
-
CUDA Toolkit Not Found
- Ensure that your
PATH
andLD_LIBRARY_PATH
environment variables are correctly set. - Check that the CUDA installation directory exists:
ls /usr/local/cuda
- Run the provided CUDA samples for further validation.
- Ensure that your
-
Performance Issues
- Confirm PCIe link status with:
nvidia-smi -q | grep "Max Link"
- Monitor power and thermal data:
nvidia-smi -q -d POWER,TEMPERATURE
- Verify compute mode with:
nvidia-smi -q | grep "Compute Mode"
- Confirm PCIe link status with:
- System Compatibility: Always verify that your hardware and OS are compatible with the chosen driver and CUDA versions.
- Official Documentation: Refer to the latest NVIDIA documentation for enterprise GPUs for any changes.
- Monitoring: For production environments, consider using NVIDIA Data Center GPU Manager (DCGM) to monitor health and performance.
If the required NVIDIA driver version is not available from your default package manager, follow these steps to add NVIDIA’s official repository for the latest drivers:
sudo rm /etc/apt/sources.list.d/cuda*
sudo rm /etc/apt/sources.list.d/nvidia-ml*
Determine your Ubuntu distribution codename:
distribution=$(lsb_release -sc)
Download the repository pin file and add the repository:
wget https://developer.download.nvidia.com/compute/cuda/repos/${distribution}/x86_64/cuda-${distribution}.pin
sudo mv cuda-${distribution}.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/repos/${distribution}/x86_64/3bf863cc.pub
sudo apt-key add 3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/${distribution}/x86_64/ /"
sudo apt-get update
Choose the driver version that best suits your GPU and workload. For example, to install the open driver for many enterprise GPUs:
sudo apt-get install -y nvidia-driver-565-open
If this version is not available, try the latest available driver:
sudo apt-get install -y nvidia-driver-latest
After updating your driver, install CUDA:
sudo apt-get install -y cuda
Reboot your system to load the new driver and CUDA libraries:
sudo reboot
After rebooting, confirm the installation:
nvidia-smi
nvcc --version