Setup LLM environment for Nvidia 5070Ti

The original post in Simplified Chinese could be viewed here. This is the auto-translated version.

Environment Setup

For AI research, using the Linux operating system is undoubtedly the best choice. However, to avoid interfering with my gaming, I installed a Windows 11-Ubuntu 24.10 dual-boot system following this guide. On the Ubuntu system, I used the frp tool to enable remote SSH access within the LAN (this is managed via systemctl to enable auto-start; for more details, you can refer to this blog post).

Graphics Driver

Let's get started with the first challenge: the graphics driver. Since this GPU model is too new, it is currently not possible to obtain the driver through Ubuntu's built-in Software & Updates. Therefore, you need to manually add a software update source to install the driver.

sudo apt install build-essential
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update

Next, you need to disable the desktop environment related to graphical display and switch to the built-in terminal interface for installation (this step is crucial; otherwise, the GPU driver installation process may freeze, and you might have to reinstall the system). Use Ctrl + Alt + F3 to switch to tty3. After logging in, disable the desktop:

sudo systemctl stop gdm3  # gnome

Then start installing the driver. In the installation interface, temporarily use the MIT driver, as the proprietary driver has been verified to not be detected by the nvidia-smi command for now:

sudo apt install nvidia-driver-570

After the installation is complete, restart the system:

reboot

Then verify using the nvidia-smi command. If the GPU is detected, it means the installation was successful.

Sun Mar 23 22:15:50 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5070 Ti     Off |   00000000:01:00.0  On |                  N/A |
|  0%   27C    P5             38W /  300W |     130MiB /  16303MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

CUDA

According to this important official guide, the 50 series GPUs can only use CUDA 12.8 for now. Follow the official installation guide to download and install. The key installation commands are provided below for reference:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pin
sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda-repo-ubuntu2404-12-8-local_12.8.1-570.124.06-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2404-12-8-local_12.8.1-570.124.06-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2404-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8

To be safe, after installation, add CUDA to the path (you can modify it temporarily or permanently by editing ~/.bashrc):

export PATH="/usr/local/cuda-12.8/bin:$PATH"
export CUDA_HOME="/usr/local/cuda-12.8"
export LIBRARY_PATH="/usr/local/cuda-12.8/lib64:/usr/local/cuda-12.8/lib64/stubs:$LIBRARY_PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH"

Open a new terminal or run source ~/.bashrc to apply the changes.

Anaconda

To isolate different Python environments (the system's built-in Python environment is generally protected and cannot directly install packages via pip), it is recommended to install Anaconda for environment management (in fact, conda can also install CUDA; if you need to install multiple versions of CUDA, refer to the CUDA installation documentation). The following is excerpted from the Anaconda installation guide:

curl -O https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh
bash ~/Anaconda3-2024.10-1-Linux-x86_64.sh
# The following instructions assume the default installation location
source ~/.bashrc

After installation, create a new environment using the following command:

conda create -n llm python=3.12

Here, Python 3.12 is used because some very new packages require a high version of Python (e.g., langchain). Using a newer version of Python also allows you to leverage some relatively newer and more convenient syntax.

Once the environment is created, you can activate it:

conda activate llm

After that, you can proceed with installations within this environment:

(llm) $

Pytorch

torch

According to this important official guide, you can directly install the latest precompiled binary of PyTorch using the following command: pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128. If you only need to run non-quantized large models, this installation will suffice.

Otherwise, considering that triton and subsequent tools like vllm depend on PyTorch version 2.6.0, and based on the official guide for triton, you will need to compile PyTorch version 2.6.0 from source to enable CUDA 12.8 support for older versions.

Before starting the source compilation, install the necessary build tools:

sudo apt-get install cmake git ninja

Then pull the code to prepare for the setup:

# Pull the PyTorch 2.6.0-rc9 version, fetching only this version without pulling the entire repository
git clone https://github.com/pytorch/pytorch -b v2.6.0-rc9 --depth 1
cd pytorch
git submodule sync
git submodule update --init --recursive -j 8

# Installing Other PyTorch Dependencies
pip install -r requirements.txt
pip install mkl-static mkl-include wheel

Although the build-essential package, which includes the gcc compiler, has already been installed, Ubuntu 24.04 comes with gcc-14 as the default version. According to this unresolved issue pytorch/pytorch#129358, PyTorch currently encounters compilation errors due to the outdated version of the fbgemm third-party library it depends on (even updating this library does not resolve the issue when compiling the older PyTorch version). Therefore, you need to install a downgraded version of gcc and configure CMAKE to use this version of gcc. (The CC environment variable mentioned in the PyTorch documentation is insufficient; you also need to set CXX.) This is excerpted from here:

sudo apt install gcc-13 g++-13
export CC=/usr/bin/gcc-13
export CXX=/usr/bin/g++-13

You might also need to update the hardcoded gcc program. Modify this section as follows:

for command in all_commands:
    if command["command"].startswith("gcc-13 "):
        command["command"] = "g++-13 " + command["command"][7:]

Add environment variables and start compiling:

# Start compiling
export CUDA_HOME=/usr/local/cuda-12.8
export CUDA_PATH=$CUDA_HOME
export TORCH_CUDA_ARCH_LIST=Blackwell
python setup.py develop

# export binary distribution (optional)
python setup.py bdist_wheel
ls dist # binary dist is here

After the compilation is complete, you should be able to detect it:

$ pip list | grep torch    # Not sure why it shows version 2.7.0
torch                             2.7.0.dev20250308+cu128

torchvision

Installing torchvision also requires building from source. According to the installation guide, skip the optional dependencies and use the following command directly:

git clone https://github.com/pytorch/vision.git -b v0.21.0-rc8 --depth 1
cd vision
# Considering the previously specified version is GCC 13, we also specify it here
export CC=/usr/bin/gcc-13
export CXX=/usr/bin/g++-13
python setup.py develop

Verify installation completion:

$ pip list | grep torchvision
torchvision                       0.21.0+7af6987

torchaudio

Installing torchaudio follows a similar process. However, during installation, you might need to install dependencies online, so ensure you have a stable internet connection:

git clone https://github.com/pytorch/audio.git -b v2.6.0-rc7 --depth 1
cd audio
# 考虑到之前制定的版本是 GCC 13，这里也进行指定
export CC=/usr/bin/gcc-13
export CXX=/usr/bin/g++-13
python setup.py develop

Verify installation completion:

$ pip list | grep torchaudio
torchaudio                        2.6.0a0+d883142

triton

Since autoawq, flash-attention, and deepspeed may all depend on the triton package, we will start by installing triton.

Following the official triton documentation, begin by cloning the repository:

git clone https://github.com/triton-lang/triton.git --depth 1
cd triton

Since CUDA library linking errors might occur, set the following environment variables (consider adding them to the global ~/.bashrc file to prevent similar runtime issues in the future):

export TRITON_LIBCUDA_PATH=/usr/local/cuda-12/lib64/stubs

You can then proceed with the compilation:

pip install pybind11
pip install -e python

After completing the compilation and installation, you can verify whether the installation was successful:

$ pip list | grep triton
pytorch-triton                    3.2.0+git4b3bb1f8

Flash Attention

Considering that many large models can use flash attention to optimize and accelerate, we will install flash attention here. According to the flash-attention installation guide, set a conservative parallel job count based on your CPU cores and memory (as excessive parallelism can cause the installation process to hang if memory is insufficient). Install it using the following command:

MAX_JOBS=4 pip install flash-attn --no-build-isolation

Since Blackwell currently does not support Flash Attention 3, it is generally recommended to use Flash Attention 2. If you encounter the error Dao-AILab/flash-attention#1312 stating Operation Error: /usr/bin/ld: cannot find -lcuda, it indicates that the TRITON_LIBCUDA_PATH environment variable was not set correctly earlier.

AutoAWQ

After Triton is successfully installed, installing this package becomes relatively straightforward:

pip install autoawq

At this point, you can run AWQ quantized large models.

Transformers

At this point, you should be able to efficiently run local large models using these prerequisites! By installing Huggingface's transformers, you should be able to successfully run DeepSeek-Distill-Qwen-14B-AWQ on the 5070 Ti.

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import torch

model_name = "casperhansen/deepseek-r1-distill-qwen-14b-awq"
prompt = "你好"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="cuda",
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
)

streamer = TextStreamer(tokenizer=tokenizer)
text = prompt + "<think>"
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    streamer=streamer,
    do_sample=False,
)

最后可以输出：

<｜begin▁of▁sentence｜>你好<think>

</think>

你好！很高兴见到你，有什么我可以帮忙的吗？无论是学习、工作还是生活中的问题，都可以告诉我哦！😊<｜end▁of▁sentence｜>

Monitor usage with watch nvidia-smi as follows:

Sun Mar 23 23:50:15 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5070 Ti     Off |   00000000:01:00.0  On |                  N/A |
| 30%   53C    P1            300W /  300W |   10102MiB /  16303MiB |     99%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           10730      C   python                                 9964MiB |
+-----------------------------------------------------------------------------------------+

VLLM

VLLM is often used for large model inference deployment. According to the experimental instructions, follow these steps:

git clone https://github.com/vllm-project/vllm.git --depth 1
cd vllm
python use_existing_torch.py
pip install -r requirements/build.txt
pip install setuptools_scm

Next, you need to modify some environment variables or add them to CMAKE in vllm/setup.py (otherwise, you may encounter errors where Caffe2 cannot find CUDA):

export CUDA_HOME=/usr/local/cuda
export CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
export CUDA_INCLUDE_DIRS=/usr/local/cuda/include

It seems that CUDA 12.8 does not currently support NVTX3 (skipping this step will result in linking errors). Therefore, you need to manually download the NVTX library to a specific path <path>:

git clone https://github.com/NVIDIA/NVTX.git --depth 1

Then, replace the following commented section in ~/anaconda3/envs/llm/lib/python3.12/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake (a file within the torch package in the Python environment):

# nvToolsExt
# if(USE_SYSTEM_NVTX)
#   find_path(nvtx3_dir NAMES nvtx3 PATHS ${CUDA_INCLUDE_DIRS})
# else()
#   find_path(nvtx3_dir NAMES nvtx3 PATHS "${PROJECT_SOURCE_DIR}/third_party/NVTX/c/include" NO_DEFAULT_PATH)
# endif()
find_path(nvtx3_dir NAMES nvtx3 PATHS "<path>/NVTX/c/include") # use custom nvtx3, replace <path> to your path

Continue following the instructions, and you should be able to compile VLLM correctly:

MAX_JOBS=4 VLLM_FLASH_ATTN_VERSION=2 python setup.py develop

Afterward, some follow-up tasks are required, such as resolving the issue where triton cannot be found (ImportError: cannot import name 'Config' from 'triton' (unknown location)). You need to return to the previously cloned triton repository and reinstall it:

cd triton && pip install -e python

And ImportError: Numba needs NumPy 2.0 or less. Installed NumPy 2.2. Install NumPy 2.0:

pip install numpy==2.0

At this point, you can successfully deploy VLLM! To avoid GPU memory overflow, you need to adjust the default parameters. For example, you can successfully deploy using the following command:

vllm serve ~/.cache/huggingface/hub/models--casperhansen--deepseek-r1-distill-qwen-14b-awq/snapshots/bc43ec1bbf08de53452630806d5989208b4186db --max_num_seqs 2 --gpu_memory_utilization 0.9 --max_model_len 2048 --port 3003

The max_num_seqs parameter represents the number of sequences that can be processed simultaneously, max_model_len specifies the model's context length, and port defines the exposed port number. After setting these parameters, you can use the OpenAI-compatible standard to interact with the model.

Manually Suppress egg Warnings

Save the following code snippet as requirements.txt, then run pip install --force-reinstall -r requirements.txt --no-deps to reinstall.

python_json_logger==3.3.0
distro==1.9.0
lm_format_enforcer==0.10.11
dnspython==2.7.0
lark==1.2.2
httpx==0.28.1
openai==1.68.2
msgspec==0.19.0
cloudpickle==3.1.1
uvicorn==0.34.0
depyf==0.18.0
nest_asyncio==1.6.0
python_dotenv==1.0.1
numpy==1.26.4
anyio==4.9.0
transformers==4.50.0
tiktoken==0.9.0
xgrammar==0.1.16
opencv_python_headless==4.11.0.86
websockets==15.0.1
httpcore==1.0.7
partial_json_parser==0.2.1.1.post5
scipy==1.15.2
h11==0.14.0
pyzmq==26.3.0
shellingham==1.5.4
ray==2.44.0
outlines==0.1.11
python_multipart==0.0.20
sentencepiece==0.2.0
watchfiles==1.0.4
compressed_tensors==0.9.2
sniffio==1.3.1
rich_toolkit==0.13.2
mistral_common==1.5.4
jiter==0.9.0
fastapi==0.115.12
airportsdata==20250224
interegular==0.3.3
uvloop==0.21.0
llvmlite==0.43.0
numba==0.60.0
outlines_core==0.1.26
httptools==0.6.4
fastapi_cli==0.0.7
starlette==0.46.1
pycountry==24.6.1
diskcache==5.6.3
typer==0.15.2
astor==0.8.1
blake3==1.0.4
email_validator==2.2.0
llguidance==0.7.9
cachetools==6.0.0b1
prometheus_fastapi_instrumentator==7.1.0
gguf==0.10.0
prometheus_client==0.21.1
fastrlock==0.8.3
cupy_cuda12x==13.4.1

DeepSpeed

DeepSpeed can be used to balance workloads with the CPU. First, clone the repository:

git clone https://github.com/deepspeedai/DeepSpeed.git --depth 1
cd DeepSpeed

Try installing directly:

pip install .

You can use ds_report to check for any errors. Generally, you need to perform the following additional steps:

For [WARNING] gds: please install the libaio-dev package with apt, use the following command to install this library:

sudo apt install libaio-dev

Then, you might encounter ld linking errors. In this case, you also need to specify GCC version 13:

export CC=/usr/bin/gcc-13
export CXX=/usr/bin/g++-13

Clone the cutlass repository to <path> and set the CUTLASS_PATH environment variable:

export CUTLASS_PATH=<path>/cutlass

Unable to find dskernels, you need to install this package:

pip install deepspeed-kernels

Afterward, you can compile and install all sub-packages. Some may fail to compile, which can be ignored:

DS_BUILD_OPS=1 pip install . --global-option="build_ext" --global-option="-j8"

After installation, use the ds_report command to verify:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_lion ............... [YES] ...... [OKAY]
evoformer_attn ......... [YES] ...... [OKAY]
 [WARNING]  FP Quantizer is using an untested triton version (3.3.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [YES] ...... [OKAY]
fused_lion ............. [YES] ...... [OKAY]
gds .................... [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
inference_core_ops ..... [YES] ...... [OKAY]
cutlass_ops ............ [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
ragged_device_ops ...... [YES] ...... [OKAY]
ragged_ops ............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.7
 [WARNING]  using untested triton version (3.3.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]

Afterward, you might need to reinstall triton.

LogCreative/nvidia-5070-ti-llm-env-setup.md