Skip to content

Instantly share code, notes, and snippets.

@gavinmcfall
Last active April 14, 2025 20:33
Show Gist options
  • Save gavinmcfall/ea6cb1233d3a300e9f44caf65a32d519 to your computer and use it in GitHub Desktop.
Save gavinmcfall/ea6cb1233d3a300e9f44caf65a32d519 to your computer and use it in GitHub Desktop.
Thunderbolt Networking on Talos

How To: Thunderbolt Networking on Talos

I have a cluster of 3 Minisforum MS-01 Mini PCs running Talos Linux for my k8s cluster. Previously, I had set these machines up using Debian & K3s for the cluster and configured Thunderbolt Networking by follow this video by Jim's Garage Which is baed on this gist series from scyto

Cabling

This is how I have mine connected, this means that each one of the my 3 nodes has a cable going to the other two nodes MS-01_TB_Cabling

Pre-requisites

  1. A Bare-metal cluster of 3 machines

  2. Each of your machines needs 2x Thunderbolt ports

  3. Visit https://factory.talos.dev/ and start the process of setting up an image. Choose the following options:

    • Hardware Type: Bare-metal Machine
    • Talos Linux Version: 1.7.5 (at the time of writing)
    • Machine Architecture: amd64
    • Secure Boot: Off (unless you hate life)
    • System Extensions:
      • siderolabs/i915-ucode
      • siderolabs/intel-ice-firmware
      • siderolabs/intel-ucode
      • siderolabs/thunderbolt
      • siderolabs/util-linux-tools
  4. The following extra Args added to your global patches

- intel_iommu=on
- iommu=pt
- mitigations=off
- net.ifnames=0
  1. Note pcie_ports=native, might be needed for you. It caused my nodes to crash but was required for Buroa's to work on his Mac Minis. I would exclude this for now and only add it if you run into issues
  2. Now you can prepare to install
  3. You should see options like this: talos-install-options Note down the initial install and upgrade links
  4. Build a bootable USB Stick using a tool like Rufus using the "First Boot" ISO shown above
  5. Plug the USB into each machine and one at a time boot them.
  6. Once booted you can remove the USB stick and do the next machine.
  7. Ensure you have TalosCTL install on your local machine/dev environment etc
  8. Follow the incredible Cluster Template from Onedr0p
  9. NOTE: Use the upgrade schematic ID in your config.yaml when you set it up.
  10. After you have your cluster all up and running you can check that your extraArgs are in place using
talosctl -n <node name> edit mc

You should see a section that looks like this:

install:
  diskSelector:
    serial: S73VNU0X303413H
  extraKernelArgs:
    - intel_iommu=on
    - iommu=pt
    - mitigations=off
    - net.ifnames=0

First Hurdle

How the heck do you identify your PCI devices on Talos? Turns out the key is kubectl-node-shell combined with TalosCTL

Confirming your nodes can see that you have Thunderbolt

  1. In your home-ops repo you should have a .taskfiles directory, if not create one
  2. Create a folder called Kubernetes and a subdirectory called resources
  3. Under the Kubernetes folder, create a file called Taskfile.yaml
  4. In the resources subdirectory create a file called privileged-pod.tmpl.yaml
  5. You should have a structure like this image
  6. Add the following to the privileged-pod.tmpl.yaml file and save it
---
apiVersion: v1
spec:
  containers:
    - name: debug
      image: docker.io/library/alpine:latest
      # image: docker.io/library/ubuntu:latest
      command: ["/bin/sh"]
      stdin: true
      stdinOnce: true
      tty: true
      securityContext:
        allowPrivilegeEscalation: true
        privileged: true
      volumeMounts:
        - mountPath: /rootfs
          name: rootfs
        - mountPath: /sys/firmware/efi/efivars
          name: efivars
        - mountPath: /run/containerd
          name: containerd
  dnsPolicy: ClusterFirstWithHostNet
  hostIPC: true
  hostNetwork: true
  hostPID: true
  nodeName: ${node}
  restartPolicy: Never
  volumes:
    - name: rootfs
      hostPath:
        path: /
    - name: efivars
      hostPath:
        path: /sys/firmware/efi/efivars
    - name: containerd
      hostPath:
        path: /run/containerd
  1. In your Taskfile.yaml add the following
---
# yaml-language-server: $schema=https://taskfile.dev/schema.json
version: "1"

vars:
  KUBERNETES_RESOURCES_DIR: "{{.ROOT_DIR}}/.taskfiles/Kubernetes/resources"

tasks:
  privileged:
    desc: Run a privileged pod
    cmd: |
      kubectl run privileged-{{.node}} -i --rm --image=null \
        --overrides="$(yq {{.KUBERNETES_RESOURCES_DIR}}/privileged-pod.tmpl.yaml -o=json | envsubst)"
    env:
      node: "{{.node}}"
    preconditions:
      - test -f {{.KUBERNETES_RESOURCES_DIR}}/privileged-pod.tmpl.yaml
  1. Start up the container using the following command
task kubernetes:privileged node={your nodes names}
  1. Since The task file created a container using Alpine Linux we will need to use apk to install a couple of utilties
apk update && apk add pciutils
  1. These will allow us to get the information we need.
  2. Next run:
lspci | grep -i thunderbolt
  1. You should get a response that looks something like this:
00:07.0 PCI bridge: Intel Corporation Alder Lake-P Thunderbolt 4 PCI Express Root Port #0 (rev 02)
00:07.2 PCI bridge: Intel Corporation Alder Lake-P Thunderbolt 4 PCI Express Root Port #2 (rev 02)
00:0d.0 USB controller: Intel Corporation Alder Lake-P Thunderbolt 4 USB Controller (rev 02)
00:0d.2 USB controller: Intel Corporation Alder Lake-P Thunderbolt 4 NHI #0 (rev 02)
00:0d.3 USB controller: Intel Corporation Alder Lake-P Thunderbolt 4 NHI #1 (rev 02)
  1. If you do, proceed.

Setting up Thunderbolt Networking

  1. First lets take a look at our links
talosctl -n stanton-01 get links
  1. You should be able to see something like this:
NODE        NAMESPACE   TYPE        ID              VERSION     TYPE    KIND    HW ADDR             OPER STATE      LINK STATE
stanton-01  network     LinkStatus  thunderbolt0    3           ether           02:9f:19:a1:73:94   up              false
stanton-01  network     LinkStatus  thunderbolt1    3           ether           02:35:20:8c:5f:42   up              false
  1. Next lets get a more specific look at the Thudnerbolt links
talosctl get links -oyaml | more
  1. Once this shows up, press / and then type Thunderbolt and enter
  2. You should see something like this:
---
node: 10.90.3.101
metadata:
    namespace: network
    type: LinkStatuses.net.talos.dev
    id: cilium_host
/thunderbolt
...skipping
    id: thunderbolt0
    version: 3
    owner: network.LinkStatusController
    phase: running
    created: 2024-07-17T02:11:30Z
    updated: 2024-07-17T02:11:35Z
spec:
    index: 44
    type: ether
    linkIndex: 0
    flags: UP,BROADCAST,RUNNING,MULTICAST,LOWER_UP
    hardwareAddr: 02:9f:19:a1:73:94
    permanentAddr: 02:9f:19:a1:73:94
    broadcastAddr: ff:ff:ff:ff:ff:ff
    mtu: 65520
    queueDisc: pfifo_fast
    operationalState: up
    kind: ""
    slaveKind: ""
    busPath: 1-1.0
    driver: thunderbolt-net
    driverVersion: 6.6.33-talos
    linkState: false
    port: Other
    duplex: Unknown
---
node: 10.90.3.101
metadata:
    namespace: network
    type: LinkStatuses.net.talos.dev
    id: thunderbolt1
    version: 3
    owner: network.LinkStatusController
    phase: running
    created: 2024-07-17T02:01:13Z
    updated: 2024-07-17T02:01:14Z
spec:
    index: 41
    type: ether
    linkIndex: 0
    flags: UP,BROADCAST,RUNNING,MULTICAST,LOWER_UP
    hardwareAddr: 02:35:20:8c:5f:42
    permanentAddr: 02:35:20:8c:5f:42
    broadcastAddr: ff:ff:ff:ff:ff:ff
    mtu: 65520
    queueDisc: pfifo_fast
    operationalState: up
    kind: ""
    slaveKind: ""
    busPath: 0-1.0
    driver: thunderbolt-net
    driverVersion: 6.6.33-talos
    linkState: false
    port: Other
    duplex: Unknown
---
  1. Node down the two bus paths:
id: thunderbolt0
busPath: 1-1.0

id: thunderbolt1
busPath: 0-1.0
  1. Next we need to know HOW things are connected.
  2. Go to your nodes and unplug, and replug each cable one a time
Node01 - Cable One
Node01 - Cable Two
Node02 - Cable One
Node02 - Cable Two
Node03 - Cable Three
Node04 - Cable Four
  1. Now lets see how the machines are connected. Run the following command:
talosctl -n stanton-01 dmesg | grep thunderbolt
  1. You are looking for lines like this:
stanton-01: kern:    info: [2024-07-17T00:24:12.071257675Z]: thunderbolt 0-1: Intel Corp. stanton-02
stanton-01: kern:    info: [2024-07-17T00:25:41.370465675Z]: thunderbolt 1-1: Intel Corp. stanton-03
  1. This tells you which bus-path (and thus which thunderbolt network interface) is connected to which machine
  2. Repeat this process for each node
stanton-02: kern:    info: [2024-07-17T00:25:44.855454779Z]: thunderbolt 0-1: Intel Corp. stanton-01
stanton-02: kern:    info: [2024-07-17T00:26:11.798178779Z]: thunderbolt 1-1: Intel Corp. stanton-03

stanton-03: kern:    info: [2024-07-17T00:25:41.495885192Z]: thunderbolt 0-1: Intel Corp. stanton-01
stanton-03: kern:    info: [2024-07-17T00:25:57.912867192Z]: thunderbolt 1-1: Intel Corp. stanton-02
  1. Now that we know that, we are ready to configure Talos
  2. Open up your talconfig.yaml in your editor of choice
  3. Look for the nodes: you should have section under here for each of your nodes
  4. Look for a section under each node called networkInterfaces:
  5. Add the following deviceSelector under the others that may be there
- deviceSelector:
    busPath: 0-1.0 # stanton-02
  dhcp: false
  mtu: 65520
  addresses:
    - 169.254.255.101/32
  routes:
    - network: 169.254.255.102/32
      metric: 2048
- deviceSelector:
    busPath: 1-1.0 # stanton-03
  dhcp: false
  mtu: 65520
  addresses:
    - 169.254.255.101/32
  routes:
    - network: 169.254.255.103/32
      metric: 2048
  1. Scroll down and add the following to the second node
- deviceSelector:
    busPath: 0-1.0 # stanton-01
  dhcp: false
  mtu: 65520
  addresses:
    - 169.254.255.102/32
  routes:
    - network: 169.254.255.101/32
      metric: 2048
- deviceSelector:
    busPath: 1-1.0 # stanton-03
  dhcp: false
  mtu: 65520
  addresses:
    - 169.254.255.102/32
  routes:
    - network: 169.254.255.103/32
      metric: 2048
  1. Then the third node:
- deviceSelector:
    busPath: 0-1.0 # stanton-01
  dhcp: false
  mtu: 65520
  addresses:
    - 169.254.255.103/32
  routes:
    - network: 169.254.255.101/32
      metric: 2048
- deviceSelector:
    busPath: 1-1.0 # stanton-02
  dhcp: false
  mtu: 65520
  addresses:
    - 169.254.255.103/32
  routes:
    - network: 169.254.255.102/32
      metric: 2048
  1. Note: Ensure you are adjusting the following:

    • busPath
    • Comments here to match your node names based on the queries from earlier
    • Ip Addresses used. You can use what ever you like here. I set the last octet to match that of the nodes primary netowrk Ip for easier identification.
  2. Once that is done. Lets setup a global patch for your extraArgs

  3. Navigate to your talos/patches/global folder and create a new file called kernel.yaml

  4. Add the following and save:

machine:
  install:
    extraKernelArgs:
      - intel_iommu=on
      - iommu=pt
      - mitigations=off
      - net.ifnames=0
      # - pcie_ports=native
  1. Note that pcie_ports=native is commented out here, your nodes may need it.
  2. Apply your config and get your nodes rebooted

Focing Thunderbolt to use Intel's Performance Cores

  1. If you havent already. Close down your privilaged container and swap the image used in your privileged-pod.tmpl.yaml to the ubuntu one

image

  1. Run the task file again to spin up the ubuntu container
task kubernetes:privileged node={your nodes names}
  1. Check which of your cores are Performance and which are Efficiency using:
lscpu --all --extended
  1. You should see an output like this.
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ       MHZ
  0    0      0    0 0:0:0:0          yes 4900.0000 400.0000  737.4020
  1    0      0    0 0:0:0:0          yes 4900.0000 400.0000  678.6480
  2    0      0    1 4:4:1:0          yes 4900.0000 400.0000  645.2680
  3    0      0    1 4:4:1:0          yes 4900.0000 400.0000  754.7570
  4    0      0    2 8:8:2:0          yes 5000.0000 400.0000 1336.6210
  5    0      0    2 8:8:2:0          yes 5000.0000 400.0000  923.7330
  6    0      0    3 12:12:3:0        yes 5000.0000 400.0000  407.3450
  7    0      0    3 12:12:3:0        yes 5000.0000 400.0000  413.3750
  8    0      0    4 16:16:4:0        yes 4900.0000 400.0000  426.1340
  9    0      0    4 16:16:4:0        yes 4900.0000 400.0000  676.8810
 10    0      0    5 20:20:5:0        yes 4900.0000 400.0000  615.2590
 11    0      0    5 20:20:5:0        yes 4900.0000 400.0000  400.0000
 12    0      0    6 24:24:6:0        yes 3800.0000 400.0000  883.2140
 13    0      0    7 25:25:6:0        yes 3800.0000 400.0000  671.7470
 14    0      0    8 26:26:6:0        yes 3800.0000 400.0000 1210.9910
 15    0      0    9 27:27:6:0        yes 3800.0000 400.0000 1240.5000
 16    0      0   10 28:28:7:0        yes 3800.0000 400.0000  995.9070
 17    0      0   11 29:29:7:0        yes 3800.0000 400.0000  474.6940
 18    0      0   12 30:30:7:0        yes 3800.0000 400.0000  400.1950
 19    0      0   13 31:31:7:0        yes 3800.0000 400.0000 1114.0560
  1. Take note of which CPU id's are performance cores (Your efficiency cores are the ones with the lower MAXMHZ).
  2. In this example CPU 0-11 are Performance cores, whilst 12-19 are Efficiency Cores
  3. Run the following command to force the node to use performance cores (update the echo value to match your CPU ids)
grep thunderbolt /proc/interrupts | cut -d ":" -f1 | xargs -I {} sh -c 'echo 0-11 | tee "/proc/irq/{}/smp_affinity_list"'
  1. You should get a response like:
0-11
0-11
0-11
0-11
0-11
0-11
0-11
0-11
  1. Exit out of your shell container

Time to test

  1. Open up 3x Terminal windows or tabs
  2. In each, stand up an ubuntu privilaged container using the same command as before
task kubernetes:privileged node={your nodes names}
  1. Once in the shell, running the following command on each
apt update
apt install iperf3 pciutils
  1. Once done, run the following command on your first node (this will start the iperf server)
iperf3 -s -B 169.254.255.101
  1. This binds the server to the IP Address you set for thunder bolt
  2. Now go over to the shell for your second node and run:
iperf3 -c 169.254.255.101 -B 169.254.255.102 -R
  1. Once complete run it again on the other server
iperf3 -c 169.254.255.101 -B 169.254.255.103 -R
  1. Pop back to the first nodes shell and see the combined tests. You should see something like this:
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 169.254.255.102, port 58349
[  5] local 169.254.255.101 port 5201 connected to 169.254.255.102 port 39333
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  3.03 GBytes  26.0 Gbits/sec   29   3.06 MBytes
[  5]   1.00-2.00   sec  3.05 GBytes  26.2 Gbits/sec    1   3.06 MBytes
[  5]   2.00-3.00   sec  3.07 GBytes  26.4 Gbits/sec   33   2.68 MBytes
[  5]   3.00-4.00   sec  2.89 GBytes  24.8 Gbits/sec  112   3.31 MBytes
[  5]   4.00-5.00   sec  3.08 GBytes  26.4 Gbits/sec   29   2.68 MBytes
[  5]   5.00-6.00   sec  3.08 GBytes  26.4 Gbits/sec    0   3.00 MBytes
[  5]   6.00-7.00   sec  3.09 GBytes  26.5 Gbits/sec    0   3.00 MBytes
[  5]   7.00-8.00   sec  3.10 GBytes  26.6 Gbits/sec    0   3.00 MBytes
[  5]   8.00-9.00   sec  3.05 GBytes  26.2 Gbits/sec   30   2.75 MBytes
[  5]   9.00-10.00  sec  3.07 GBytes  26.4 Gbits/sec    2   2.75 MBytes
[  5]  10.00-10.00  sec   128 KBytes  1.61 Gbits/sec    0   2.75 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  30.5 GBytes  26.2 Gbits/sec  236             sender
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
Accepted connection from 169.254.255.103, port 49163
[  5] local 169.254.255.101 port 5201 connected to 169.254.255.103 port 44701
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  3.01 GBytes  25.8 Gbits/sec
[  5]   1.00-2.00   sec  3.01 GBytes  25.9 Gbits/sec
[  5]   2.00-3.00   sec  3.03 GBytes  26.0 Gbits/sec
[  5]   3.00-4.00   sec  3.05 GBytes  26.2 Gbits/sec
[  5]   4.00-5.00   sec  3.07 GBytes  26.4 Gbits/sec
[  5]   5.00-6.00   sec  3.05 GBytes  26.1 Gbits/sec
[  5]   6.00-7.00   sec  2.40 GBytes  20.6 Gbits/sec
[  5]   7.00-8.00   sec  3.03 GBytes  26.0 Gbits/sec
[  5]   8.00-9.00   sec  2.93 GBytes  25.2 Gbits/sec
[  5]   9.00-10.00  sec  3.02 GBytes  26.0 Gbits/sec
[  5]  10.00-10.00  sec  1.88 MBytes  18.2 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  29.6 GBytes  25.4 Gbits/sec                  receiver
  1. You can ignore the last line of the test as thats just the reminants of the test file. You now have nice, fast TB Networking running on your nodes!

Persisting your CPU changes

  1. In order to persist your changes for Performance Cores you will need something to make these changes every time your nodes boot/reboot.
  2. This can be achieved using irqbalance.
  3. Here is an example container that does just that. NOTE: You will need to update the BAN list to match your CPUs E-Core
  4. Special Thanks to buroa for all his help getting this setup (and the aforementioned container)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment