Created
February 7, 2025 06:22
-
-
Save jxlwqq/b86be764398d1ed39ca49820b025c54c to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Calico | |
#### 拉取镜像 | |
```shell | |
nerdctl image pull --namespace=k8s.io quay.io/tigera/operator:v1.30.3 | |
nerdctl image pull --namespace=k8s.io m.daocloud.io/docker.io/calico/cni:v3.26.0 | |
nerdctl image pull --namespace=k8s.io m.daocloud.io/docker.io/calico/node:v3.26.0 | |
nerdctl image pull --namespace=k8s.io m.daocloud.io/docker.io/calico/kube-controllers:v3.26.0 | |
nerdctl image pull --namespace=k8s.io m.daocloud.io/docker.io/calico/csi:v3.26.0 | |
nerdctl image pull --namespace=k8s.io m.daocloud.io/docker.io/calico/apiserver:v3.26.0 | |
nerdctl image pull --namespace=k8s.io m.daocloud.io/docker.io/calico/pod2daemon-flexvol:v3.26.0 | |
nerdctl image pull --namespace=k8s.io m.daocloud.io/docker.io/calico/typha:v3.26.0 | |
nerdctl image pull --namespace=k8s.io m.daocloud.io/docker.io/calico/node-driver-registrar:v3.26.0 | |
nerdctl image tag --namespace=k8s.io m.daocloud.io/docker.io/calico/cni:v3.26.0 docker.io/calico/cni:v3.26.0 | |
nerdctl image tag --namespace=k8s.io m.daocloud.io/docker.io/calico/node:v3.26.0 docker.io/calico/node:v3.26.0 | |
nerdctl image tag --namespace=k8s.io m.daocloud.io/docker.io/calico/kube-controllers:v3.26.0 docker.io/calico/kube-controllers:v3.26.0 | |
nerdctl image tag --namespace=k8s.io m.daocloud.io/docker.io/calico/csi:v3.26.0 docker.io/calico/csi:v3.26.0 | |
nerdctl image tag --namespace=k8s.io m.daocloud.io/docker.io/calico/apiserver:v3.26.0 docker.io/calico/apiserver:v3.26.0 | |
nerdctl image tag --namespace=k8s.io m.daocloud.io/docker.io/calico/pod2daemon-flexvol:v3.26.0 docker.io/calico/pod2daemon-flexvol:v3.26.0 | |
nerdctl image tag --namespace=k8s.io m.daocloud.io/docker.io/calico/typha:v3.26.0 docker.io/calico/typha:v3.26.0 | |
nerdctl image tag --namespace=k8s.io m.daocloud.io/docker.io/calico/node-driver-registrar:v3.26.0 docker.io/calico/node-driver-registrar:v3.26.0 | |
``` | |
nerdctl save --namespace=k8s.io quay.io/tigera/operator:v1.30.3 >images/tigera-operator-v1.30.3.tar | |
nerdctl save --namespace=k8s.io docker.io/calico/cni:v3.26.0 >images/calico-cni-v3.26.0.tar | |
nerdctl save --namespace=k8s.io docker.io/calico/node:v3.26.0 >images/calico-node-v3.26.0.tar | |
nerdctl save --namespace=k8s.io docker.io/calico/kube-controllers:v3.26.0 >images/calico-kube-controllers-v3.26.0.tar | |
nerdctl save --namespace=k8s.io docker.io/calico/csi:v3.26.0 >images/calico-csi-v3.26.0.tar | |
nerdctl save --namespace=k8s.io docker.io/calico/apiserver:v3.26.0 >images/calico-apiserver-v3.26.0.tar | |
nerdctl save --namespace=k8s.io docker.io/calico/pod2daemon-flexvol:v3.26.0 >images/calico-pod2daemon-flexvol-v3.26.0.tar | |
nerdctl save --namespace=k8s.io docker.io/calico/typha:v3.26.0 >images/calico-typha-v3.26.0.tar | |
nerdctl save --namespace=k8s.io docker.io/calico/node-driver-registrar:v3.26.0 >images/calico-node-driver-registrar-v3.26.0.tar | |
nerdctl load --namespace=k8s.io <images/tigera-operator-v1.30.3.tar | |
nerdctl load --namespace=k8s.io <images/calico-cni-v3.26.0.tar | |
nerdctl load --namespace=k8s.io <images/calico-node-v3.26.0.tar | |
nerdctl load --namespace=k8s.io <images/calico-kube-controllers-v3.26.0.tar | |
nerdctl load --namespace=k8s.io <images/calico-csi-v3.26.0.tar | |
nerdctl load --namespace=k8s.io <images/calico-apiserver-v3.26.0.tar | |
nerdctl load --namespace=k8s.io <images/calico-pod2daemon-flexvol-v3.26.0.tar | |
nerdctl load --namespace=k8s.io <images/calico-typha-v3.26.0.tar | |
nerdctl load --namespace=k8s.io <images/calico-node-driver-registrar-v3.26.0.tar | |
``` | |
#### 安装 calicoctl | |
```shell | |
# node-0 | |
curl -LO https://github.com/projectcalico/calico/releases/download/v3.26.0/calicoctl-linux-amd64 | |
chmod +x ./calicoctl-linux-amd64 | |
cp ./calicoctl-linux-amd64 /usr/local/bin/calicoctl | |
``` | |
#### 安装 Calico | |
```shell | |
kubectl create ns calico-system | |
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.0/manifests/tigera-operator.yaml | |
kubectl -n tigera-operator set image deployments/tigera-operator tigera-operator=quay.io/tigera/operator:v1.30.3 | |
``` | |
#### 使用 VXLAN 模式 | |
```shell | |
curl https://raw.githubusercontent.com/projectcalico/calico/v3.26.0/manifests/custom-resources.yaml -O | |
kubectl apply -f custom-resources.yaml | |
``` | |
#### 调整 Calico 配置 | |
```shell | |
# waiting for calico ready | |
watch kubectl get pods -A -o wide | |
# 关闭 IPIP 模式 | |
# VXLAN or BGP without encapsulation is supported if using Calico CNI. IPIP (Calico's default encapsulation mode) is not supported. Use the following command to turn off IPIP on the default IP pool. | |
# https://docs.tigera.io/calico/3.26/getting-started/kubernetes/windows-calico/kubernetes/requirements | |
kubectl patch felixconfiguration default --type=merge --patch='{"spec":{"ipipEnabled":false}}' | |
# IPAM 配置的严格关联性设置为 true | |
# For Linux control nodes using Calico networking, strict affinity must be set to true. This is required to prevent Linux nodes from borrowing IP addresses from Windows nodes: | |
# https://docs.tigera.io/calico/3.26/getting-started/kubernetes/windows-calico/kubernetes/standard | |
kubectl patch ipamconfigurations default --type=merge --patch='{"spec": {"strictAffinity": true}}' | |
# 禁用 BGP | |
# Ensure that BGP is disabled since you're using VXLAN. If you installed Calico using operator, you can do this by: | |
# https://docs.tigera.io/calico/3.26/getting-started/kubernetes/windows-calico/quickstart | |
kubectl patch installation default --type=merge --patch='{"spec": {"calicoNetwork": {"bgp": "Disabled"}}}' | |
``` | |
https://docs.tigera.io/calico/latest/getting-started/kubernetes/self-managed-onprem/onpremises |
Linux
登陆虚拟机
vagrant ssh node-0
vagrant ssh node-1
设置密码并使用 root 用户
sudo passwd root
su root
cd /vagrant
禁用交换分区和防火墙
- 禁用交换分区。为了保证 kubelet 正常工作,你必须禁用交换分区。
- 关闭防火墙
sudo swapoff -a
sudo sed -ri 's/.*swap.*/#&/' /etc/fstab
sudo ufw disable
配置网络
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfilter
# 设置所需的 sysctl 参数,参数在重新启动后保持不变
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
# 应用 sysctl 参数而不重新启动
sudo sysctl --system
安装容器运行时
使用推荐的 containerd 作为容器运行时,详见:
https://github.com/containerd/containerd/blob/main/docs/getting-started.md
if [ ! -f containerd-1.7.2-linux-amd64.tar.gz ]; then
curl -LO https://github.com/containerd/containerd/releases/download/v1.7.2/containerd-1.7.2-linux-amd64.tar.gz
fi
tar Cxzvf /usr/local/ containerd-1.7.2-linux-amd64.tar.gz
mkdir -p /usr/local/lib/systemd/system/
curl https://raw.githubusercontent.com/containerd/containerd/main/containerd.service > /usr/local/lib/systemd/system/containerd.service
systemctl daemon-reload
systemctl enable --now containerd
if [ ! -f runc.amd64 ]; then
curl -LO https://github.com/opencontainers/runc/releases/download/v1.1.7/runc.amd64
fi
install -m 755 runc.amd64 /usr/local/sbin/runc
if [ ! -f cni-plugins-linux-amd64-v1.3.0.tgz ]; then
curl -LO https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-linux-amd64-v1.3.0.tgz
fi
mkdir -p /opt/cni/bin
tar Cxzvf /opt/cni/bin/ cni-plugins-linux-amd64-v1.3.0.tgz
# https://kubernetes.io/zh-cn/docs/setup/production-environment/container-runtimes/#containerd
mkdir -p /etc/containerd/
containerd config default > /etc/containerd/config.toml
# 将 SystemdCgroup 的值修改为 true
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
sudo systemctl restart containerd
安装 kubeadm、kubelet 和 kubectl
sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl
curl -fsSL https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-archive-keyring.gpg
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
# 网络不通的话,可以使用阿里云的源
# sudo curl -s https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | sudo apt-key add -
# echo "deb https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
安装 nerdctl,并提前下载镜像
curl -LO https://github.com/containerd/nerdctl/releases/download/v1.4.0/nerdctl-1.4.0-linux-amd64.tar.gz
tar Cxzvvf /usr/local/bin nerdctl-1.4.0-linux-amd64.tar.gz
nerdctl image pull --namespace=k8s.io registry.k8s.io/kube-apiserver:v1.28.2
nerdctl image pull --namespace=k8s.io registry.k8s.io/kube-controller-manager:v1.28.2
nerdctl image pull --namespace=k8s.io registry.k8s.io/kube-scheduler:v1.28.2
nerdctl image pull --namespace=k8s.io registry.k8s.io/kube-proxy:v1.28.2
nerdctl image pull --namespace=k8s.io registry.k8s.io/pause:3.9
nerdctl image pull --namespace=k8s.io registry.k8s.io/etcd:3.5.7-0
nerdctl image pull --namespace=k8s.io registry.k8s.io/coredns/coredns:v1.10.1
nerdctl save --namespace=k8s.io registry.k8s.io/kube-apiserver:v1.28.2 >images/kube-apiserver-v1.28.2.tar
nerdctl save --namespace=k8s.io registry.k8s.io/kube-controller-manager:v1.28.2 >images/kube-controller-manager-v1.28.2.tar
nerdctl save --namespace=k8s.io registry.k8s.io/kube-scheduler:v1.28.2 >images/kube-scheduler-v1.28.2.tar
nerdctl save --namespace=k8s.io registry.k8s.io/kube-proxy:v1.28.2 >images/kube-proxy-v1.28.2.tar
nerdctl save --namespace=k8s.io registry.k8s.io/pause:3.9 >images/pause-3.9.tar
nerdctl save --namespace=k8s.io registry.k8s.io/etcd:3.5.7-0 >images/etcd-3.5.7-0.tar
nerdctl save --namespace=k8s.io registry.k8s.io/coredns/coredns:v1.10.1 >images/coredns-v1.10.1.tar
nerdctl load --namespace=k8s.io <images/kube-apiserver-v1.28.2.tar
nerdctl load --namespace=k8s.io <images/kube-controller-manager-v1.28.2.tar
nerdctl load --namespace=k8s.io <images/kube-proxy-v1.28.2.tar
nerdctl load --namespace=k8s.io <images/kube-scheduler-v1.28.2.tar
nerdctl load --namespace=k8s.io <images/pause-3.9.tar
nerdctl load --namespace=k8s.io <images/coredns-v1.10.1.tar
nerdctl load --namespace=k8s.io <images/etcd-3.5.7-0.tar
kubeadm config images pull --kubernetes-version=1.28.2 # --image-repository=registry.aliyuncs.com/google_containers
初始化 control-plane 节点
仅在 control-plane 节点执行:
sudo kubeadm init \
--apiserver-advertise-address=192.168.205.10 \
--apiserver-cert-extra-sans=192.168.205.10 \
--pod-network-cidr=192.168.0.0/16 \
--service-cidr=10.96.0.0/12 \
--kubernetes-version=1.28.2
# --image-repository=registry.aliyuncs.com/google_containers
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl get nodes
kubectl get pods -A
加入 node 节点
仅在 node 节点执行:
# kubeadm join 192.168.205.10:6443 --token col3a3.t1tj94nt38f7ixyp --discovery-token-ca-cert-hash sha256:056b35bc8838de9d6899800d7178eea4ce2813dbbbea9fb53f8ebde13d5b7741
# kubeadm token create --print-join-command
修改 node 节点的 kubelet 配置
# node-0
cat <<EOF | sudo tee /etc/default/kubelet
KUBELET_EXTRA_ARGS="--node-ip=192.168.205.10"
EOF
sudo systemctl restart kubelet
# node-1
cat <<EOF | sudo tee /etc/default/kubelet
KUBELET_EXTRA_ARGS="--node-ip=192.168.205.11"
EOF
sudo systemctl restart kubelet
上述操作尽限本地 vagrant 环境,线下/生产环境无需修改。
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
搭建 GPU 异构集群
GPU 资源管理
手动管理
最小安装:
Operator 自动管理
在一个就绪的 k8s 集群中,也可以通过 helm 安装 nvidia gpu operator 来自动管理 gpu 资源:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update
当 driver.enabled=true 时,主机会崩溃。
参考
部署 Meta-Llama-3.1-405B-Instruct(测试)
部署 Meta-Llama-3.1-405B-Instruct(产线)
部署类型
由于 Meta-Llama-3.1-405B-Instruct 模型的规模较大,无法在单个 GPU 上运行,因此需要在多个 GPU 上运行:
VERSION=v0.5.1 kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml
部署工具
参考
部署 DeepSeek-V3(产线)
作为 DeepSeek 官方首推的 llm 引擎,SGLang 针对 DeepSeek 模型专门进行了多项优化,以提高其推理速度。
以下是 SGLang 以 Docker 容器形式 部署 DeepSeek-V3 的示例:
8 x NVIDIA H200 GPUs:
2 x 8 x NVIDIA H20 GPUs:
参考
https://github.com/sgl-project/sglang/blob/main/benchmark/deepseek_v3/README.md
https://docs.sglang.ai/references/deepseek.html