The original post in Simplified Chinese could be viewed here. This is the auto-translated version.
For AI research, using the Linux operating system is undoubtedly the best choice. However, to avoid interfering with my gaming, I installed a Windows 11-Ubuntu 24.10 dual-boot system following this guide. On the Ubuntu system, I used the frp tool to enable remote SSH access within the LAN (this is managed via systemctl
to enable auto-start; for more details, you can refer to this blog post).
Let's get started with the first challenge: the graphics driver. Since this GPU model is too new, it is currently not possible to obtain the driver through Ubuntu's built-in Software & Updates. Therefore, you need to manually add a software update source to install the driver.
sudo apt install build-essential
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
Next, you need to disable the desktop environment related to graphical display and switch to the built-in terminal interface for installation (this step is crucial; otherwise, the GPU driver installation process may freeze, and you might have to reinstall the system). Use Ctrl + Alt + F3 to switch to tty3. After logging in, disable the desktop:
sudo systemctl stop gdm3 # gnome
Then start installing the driver. In the installation interface, temporarily use the MIT driver, as the proprietary driver has been verified to not be detected by the nvidia-smi
command for now:
sudo apt install nvidia-driver-570
After the installation is complete, restart the system:
reboot
Then verify using the nvidia-smi
command. If the GPU is detected, it means the installation was successful.
Sun Mar 23 22:15:50 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5070 Ti Off | 00000000:01:00.0 On | N/A |
| 0% 27C P5 38W / 300W | 130MiB / 16303MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
According to this important official guide, the 50 series GPUs can only use CUDA 12.8 for now. Follow the official installation guide to download and install. The key installation commands are provided below for reference:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pin
sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda-repo-ubuntu2404-12-8-local_12.8.1-570.124.06-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2404-12-8-local_12.8.1-570.124.06-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2404-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8
To be safe, after installation, add CUDA to the path (you can modify it temporarily or permanently by editing ~/.bashrc
):
export PATH="/usr/local/cuda-12.8/bin:$PATH"
export CUDA_HOME="/usr/local/cuda-12.8"
export LIBRARY_PATH="/usr/local/cuda-12.8/lib64:/usr/local/cuda-12.8/lib64/stubs:$LIBRARY_PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH"
Open a new terminal or run source ~/.bashrc
to apply the changes.
To isolate different Python environments (the system's built-in Python environment is generally protected and cannot directly install packages via pip
), it is recommended to install Anaconda for environment management (in fact, conda can also install CUDA; if you need to install multiple versions of CUDA, refer to the CUDA installation documentation). The following is excerpted from the Anaconda installation guide:
curl -O https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh
bash ~/Anaconda3-2024.10-1-Linux-x86_64.sh
# The following instructions assume the default installation location
source ~/.bashrc
After installation, create a new environment using the following command:
conda create -n llm python=3.12
Here, Python 3.12 is used because some very new packages require a high version of Python (e.g., langchain
). Using a newer version of Python also allows you to leverage some relatively newer and more convenient syntax.
Once the environment is created, you can activate it:
conda activate llm
After that, you can proceed with installations within this environment:
(llm) $
According to this important official guide, you can directly install the latest precompiled binary of PyTorch using the following command: pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
. If you only need to run non-quantized large models, this installation will suffice.
Otherwise, considering that triton
and subsequent tools like vllm
depend on PyTorch version 2.6.0, and based on the official guide for triton
, you will need to compile PyTorch version 2.6.0 from source to enable CUDA 12.8 support for older versions.
Before starting the source compilation, install the necessary build tools:
sudo apt-get install cmake git ninja
Then pull the code to prepare for the setup:
# Pull the PyTorch 2.6.0-rc9 version, fetching only this version without pulling the entire repository
git clone https://github.com/pytorch/pytorch -b v2.6.0-rc9 --depth 1
cd pytorch
git submodule sync
git submodule update --init --recursive -j 8
# Installing Other PyTorch Dependencies
pip install -r requirements.txt
pip install mkl-static mkl-include wheel
Although the build-essential
package, which includes the gcc
compiler, has already been installed, Ubuntu 24.04 comes with gcc-14
as the default version. According to this unresolved issue pytorch/pytorch#129358, PyTorch currently encounters compilation errors due to the outdated version of the fbgemm
third-party library it depends on (even updating this library does not resolve the issue when compiling the older PyTorch version). Therefore, you need to install a downgraded version of gcc
and configure CMAKE to use this version of gcc
. (The CC
environment variable mentioned in the PyTorch documentation is insufficient; you also need to set CXX
.) This is excerpted from here:
sudo apt install gcc-13 g++-13
export CC=/usr/bin/gcc-13
export CXX=/usr/bin/g++-13
You might also need to update the hardcoded gcc
program. Modify this section as follows:
for command in all_commands:
if command["command"].startswith("gcc-13 "):
command["command"] = "g++-13 " + command["command"][7:]
Add environment variables and start compiling:
# Start compiling
export CUDA_HOME=/usr/local/cuda-12.8
export CUDA_PATH=$CUDA_HOME
export TORCH_CUDA_ARCH_LIST=Blackwell
python setup.py develop
# export binary distribution (optional)
python setup.py bdist_wheel
ls dist # binary dist is here
After the compilation is complete, you should be able to detect it:
$ pip list | grep torch # Not sure why it shows version 2.7.0
torch 2.7.0.dev20250308+cu128
Installing torchvision also requires building from source. According to the installation guide, skip the optional dependencies and use the following command directly:
git clone https://github.com/pytorch/vision.git -b v0.21.0-rc8 --depth 1
cd vision
# Considering the previously specified version is GCC 13, we also specify it here
export CC=/usr/bin/gcc-13
export CXX=/usr/bin/g++-13
python setup.py develop
Verify installation completion:
$ pip list | grep torchvision
torchvision 0.21.0+7af6987
Installing torchaudio follows a similar process. However, during installation, you might need to install dependencies online, so ensure you have a stable internet connection:
git clone https://github.com/pytorch/audio.git -b v2.6.0-rc7 --depth 1
cd audio
# 考虑到之前制定的版本是 GCC 13,这里也进行指定
export CC=/usr/bin/gcc-13
export CXX=/usr/bin/g++-13
python setup.py develop
Verify installation completion:
$ pip list | grep torchaudio
torchaudio 2.6.0a0+d883142
Since autoawq
, flash-attention
, and deepspeed
may all depend on the triton
package, we will start by installing triton
.
Following the official triton
documentation, begin by cloning the repository:
git clone https://github.com/triton-lang/triton.git --depth 1
cd triton
Since CUDA library linking errors might occur, set the following environment variables (consider adding them to the global ~/.bashrc
file to prevent similar runtime issues in the future):
export TRITON_LIBCUDA_PATH=/usr/local/cuda-12/lib64/stubs
You can then proceed with the compilation:
pip install pybind11
pip install -e python
After completing the compilation and installation, you can verify whether the installation was successful:
$ pip list | grep triton
pytorch-triton 3.2.0+git4b3bb1f8
Considering that many large models can use flash attention to optimize and accelerate, we will install flash attention here. According to the flash-attention
installation guide, set a conservative parallel job count based on your CPU cores and memory (as excessive parallelism can cause the installation process to hang if memory is insufficient). Install it using the following command:
MAX_JOBS=4 pip install flash-attn --no-build-isolation
Since Blackwell currently does not support Flash Attention 3, it is generally recommended to use Flash Attention 2. If you encounter the error Dao-AILab/flash-attention#1312 stating Operation Error: /usr/bin/ld: cannot find -lcuda
, it indicates that the TRITON_LIBCUDA_PATH
environment variable was not set correctly earlier.
After Triton is successfully installed, installing this package becomes relatively straightforward:
pip install autoawq
At this point, you can run AWQ quantized large models.
At this point, you should be able to efficiently run local large models using these prerequisites! By installing Huggingface's transformers, you should be able to successfully run DeepSeek-Distill-Qwen-14B-AWQ on the 5070 Ti.
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import torch
model_name = "casperhansen/deepseek-r1-distill-qwen-14b-awq"
prompt = "你好"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="cuda",
attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(
model_name,
)
streamer = TextStreamer(tokenizer=tokenizer)
text = prompt + "<think>"
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=1024,
streamer=streamer,
do_sample=False,
)
最后可以输出:
<|begin▁of▁sentence|>你好<think>
</think>
你好!很高兴见到你,有什么我可以帮忙的吗?无论是学习、工作还是生活中的问题,都可以告诉我哦!😊<|end▁of▁sentence|>
Monitor usage with watch nvidia-smi
as follows:
Sun Mar 23 23:50:15 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5070 Ti Off | 00000000:01:00.0 On | N/A |
| 30% 53C P1 300W / 300W | 10102MiB / 16303MiB | 99% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 10730 C python 9964MiB |
+-----------------------------------------------------------------------------------------+
VLLM is often used for large model inference deployment. According to the experimental instructions, follow these steps:
git clone https://github.com/vllm-project/vllm.git --depth 1
cd vllm
python use_existing_torch.py
pip install -r requirements/build.txt
pip install setuptools_scm
Next, you need to modify some environment variables or add them to CMAKE in vllm/setup.py
(otherwise, you may encounter errors where Caffe2 cannot find CUDA):
export CUDA_HOME=/usr/local/cuda
export CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
export CUDA_INCLUDE_DIRS=/usr/local/cuda/include
It seems that CUDA 12.8 does not currently support NVTX3 (skipping this step will result in linking errors). Therefore, you need to manually download the NVTX library to a specific path <path>
:
git clone https://github.com/NVIDIA/NVTX.git --depth 1
Then, replace the following commented section in ~/anaconda3/envs/llm/lib/python3.12/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake
(a file within the torch package in the Python environment):
# nvToolsExt
# if(USE_SYSTEM_NVTX)
# find_path(nvtx3_dir NAMES nvtx3 PATHS ${CUDA_INCLUDE_DIRS})
# else()
# find_path(nvtx3_dir NAMES nvtx3 PATHS "${PROJECT_SOURCE_DIR}/third_party/NVTX/c/include" NO_DEFAULT_PATH)
# endif()
find_path(nvtx3_dir NAMES nvtx3 PATHS "<path>/NVTX/c/include") # use custom nvtx3, replace <path> to your path
Continue following the instructions, and you should be able to compile VLLM correctly:
MAX_JOBS=4 VLLM_FLASH_ATTN_VERSION=2 python setup.py develop
Afterward, some follow-up tasks are required, such as resolving the issue where triton
cannot be found (ImportError: cannot import name 'Config' from 'triton' (unknown location)). You need to return to the previously cloned triton
repository and reinstall it:
cd triton && pip install -e python
And ImportError: Numba needs NumPy 2.0 or less. Installed NumPy 2.2. Install NumPy 2.0:
pip install numpy==2.0
At this point, you can successfully deploy VLLM! To avoid GPU memory overflow, you need to adjust the default parameters. For example, you can successfully deploy using the following command:
vllm serve ~/.cache/huggingface/hub/models--casperhansen--deepseek-r1-distill-qwen-14b-awq/snapshots/bc43ec1bbf08de53452630806d5989208b4186db --max_num_seqs 2 --gpu_memory_utilization 0.9 --max_model_len 2048 --port 3003
The max_num_seqs
parameter represents the number of sequences that can be processed simultaneously, max_model_len
specifies the model's context length, and port
defines the exposed port number. After setting these parameters, you can use the OpenAI-compatible standard to interact with the model.
Manually Suppress egg Warnings
Save the following code snippet as requirements.txt
, then run pip install --force-reinstall -r requirements.txt --no-deps
to reinstall.
python_json_logger==3.3.0
distro==1.9.0
lm_format_enforcer==0.10.11
dnspython==2.7.0
lark==1.2.2
httpx==0.28.1
openai==1.68.2
msgspec==0.19.0
cloudpickle==3.1.1
uvicorn==0.34.0
depyf==0.18.0
nest_asyncio==1.6.0
python_dotenv==1.0.1
numpy==1.26.4
anyio==4.9.0
transformers==4.50.0
tiktoken==0.9.0
xgrammar==0.1.16
opencv_python_headless==4.11.0.86
websockets==15.0.1
httpcore==1.0.7
partial_json_parser==0.2.1.1.post5
scipy==1.15.2
h11==0.14.0
pyzmq==26.3.0
shellingham==1.5.4
ray==2.44.0
outlines==0.1.11
python_multipart==0.0.20
sentencepiece==0.2.0
watchfiles==1.0.4
compressed_tensors==0.9.2
sniffio==1.3.1
rich_toolkit==0.13.2
mistral_common==1.5.4
jiter==0.9.0
fastapi==0.115.12
airportsdata==20250224
interegular==0.3.3
uvloop==0.21.0
llvmlite==0.43.0
numba==0.60.0
outlines_core==0.1.26
httptools==0.6.4
fastapi_cli==0.0.7
starlette==0.46.1
pycountry==24.6.1
diskcache==5.6.3
typer==0.15.2
astor==0.8.1
blake3==1.0.4
email_validator==2.2.0
llguidance==0.7.9
cachetools==6.0.0b1
prometheus_fastapi_instrumentator==7.1.0
gguf==0.10.0
prometheus_client==0.21.1
fastrlock==0.8.3
cupy_cuda12x==13.4.1
DeepSpeed can be used to balance workloads with the CPU. First, clone the repository:
git clone https://github.com/deepspeedai/DeepSpeed.git --depth 1
cd DeepSpeed
Try installing directly:
pip install .
You can use ds_report
to check for any errors. Generally, you need to perform the following additional steps:
- For
[WARNING] gds: please install the libaio-dev package with apt
, use the following command to install this library:
sudo apt install libaio-dev
- Then, you might encounter
ld
linking errors. In this case, you also need to specify GCC version 13:
export CC=/usr/bin/gcc-13
export CXX=/usr/bin/g++-13
- Clone the cutlass repository to
<path>
and set theCUTLASS_PATH
environment variable:
export CUTLASS_PATH=<path>/cutlass
- Unable to find
dskernels
, you need to install this package:
pip install deepspeed-kernels
Afterward, you can compile and install all sub-packages. Some may fail to compile, which can be ignored:
DS_BUILD_OPS=1 pip install . --global-option="build_ext" --global-option="-j8"
After installation, use the ds_report
command to verify:
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_lion ............... [YES] ...... [OKAY]
evoformer_attn ......... [YES] ...... [OKAY]
[WARNING] FP Quantizer is using an untested triton version (3.3.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [YES] ...... [OKAY]
fused_lion ............. [YES] ...... [OKAY]
gds .................... [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
inference_core_ops ..... [YES] ...... [OKAY]
cutlass_ops ............ [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
ragged_device_ops ...... [YES] ...... [OKAY]
ragged_ops ............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.7
[WARNING] using untested triton version (3.3.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
Afterward, you might need to reinstall triton
.