"A tale of a bad decision and a stream of misfortunes..."
It all started with poor Internet experience, not particularly localized to a single site or operation. I'm a Comcast (Xfinity) user, so I'm very familiar with unreliable Internet performance. The interesting thing this time is that the failure was selective. For example, speedtest showed full bandwidth with no delays. Visiting google.com would be OK, but duckduckgo.com would cause the page to take forever to load. I eventually found one file inside the sutterhealth.org website that took one second to load when it should take about 100ms at most. This became my benchmark to see if things were working as they should.
Comcast forces me to use one of their "Gateways" (a XB7) to get unlimited quota. I don't like to have a device outside my control inside my network, so I use a small router on my "edge": a NanoPI R5S SBC running Debian Linux. This is a small ARM box with two 2.5Gbps Ethernet interfaces (lan1, lan2) and a 1Gbps "WAN" port (wan0). This device is connected directly to the Comcast gateway, which is configured in "bridge" mode. I use lan1 for my internal network and lan2 remains unused.
After many tcpdumps, pings and httpings from hosts inside my network, I discovered that the problem did not show when I ran those operations from the router itself (router == my NanoPI edge router.) The signal levels on the gateway seemed OK, but I restarted it anyway. No deal.
Zooming on the NanoPI, I found many instances of "Link down/link up" on the wan0 interface. This could be the old Realtek Linux driver issue where the interface becomes flaky if RX/TX offloading is turned on (which I use). I had seen this back in the old NanoPI R2S days, but not on the R5S yet. The only way to clear the problem is to reboot.
Once the router was rebooted, everything was back online and performance was good again.
"Do not ever ever ever attempt any critical upgrades at night..."
I broke this cardinal rule.
Why leave for tomorrow what you can do today? Let's upgrade this kernel, in hopes that they "improved" the Realtek driver. I went from kernel 6.1.0-17 to 6.1.0-30. No big deal. All done, just a reboot to go and...
Nothing... No connectivity, nothing... :(
Oh well... Time to connect an USB keyboard and HDMI to the router and fix this mess...
I quickly retrieved my HDMI and USB extension cables and connected them to the router. I could see the text mode login prompt but the keyboard was completely unresponsive...
I tried a number of things: keyboard via USB switch, directly connected, but nothing seemed to work. In fact, no device connected to these USB ports in the router would do anything. Probably something is amiss in the Debian installed there (but what? udev is up and running...)
I even grabbed a USB power meter and could see the proper voltage and devices using power when plugged, but the OS was completely oblivious to it.
I could see the boot messages, but had no way to stop them. I filmed the boot process with my phone and reviewed it. This revealed that lan1 was not being detected by the OS, which naturally broke everything.
Time to think of alternatives... Maybe boot with a USB drive?
I located the Debian image I needed to boot (the NanoPI has three image versions: SD card, MMC, and USB). I downloaded it using a corporate laptop connected via tethering to my phone, but the laptop has some corporate security enabled and won't allow anything to be written to USB HD devices... (sigh)
Let's configure the laptop networking so that I can use a fixed IP from my
internal network (DHCP was on the router) and do the same on my main
workstation, where I can finally write the USB drive. I then discovered after
many frustrating minutes that my laptop also has some corporate boogaboo that
won't allow me to do that. I had to kill NetworkManager manually and configure
the network directly using ifconfig
commands.
Once transferred to the desktop, I managed to write and verify the USB drive. I plugged it back to the router and...
The router didn't boot from the USB drive! In fact nothing happened when I turned on the router with the USB drive in it! :( Time to try writing an SD card instead (the router has a micro SD card slot). I went through the same corporate laptop annoyances as before, and copied the file to the desktop so I could write it to an SD card using an adapter.
An almost unused, brand name adapter just refused to work. It would show a new device with size == 0 Tried different USB ports, nothing. Time to scavenge old stuff for some CCP approved low cost USB adapter (which I found). Very slow piece of hardware and it took forever to write 8G of data (the image size) to it, but at least it worked...
Back to the router, I plugged the SD card and rebooted. Nothing. Nada. Same as before. It just loads the problematic image from the NVME directly. Spent some time reading the information on the NanoPI r5S wiki and discovered that in fact it cannot boot directly from the NVME. It actually boots from internal MMC and that bootloader loads the rest of the OS from the NVME. There's a "Mask" button that, if pressed for four seconds during power on will erase the MMC, but the risk of doing this was too much and I'd have to reinstall this loader later anyway (OS is installed on the NVME...) Time for another approach, but since this was already 01:30am, I decided to call it a night.
The Post-it I put on my home office door failed overnight and fell. The cats quickly disappeared with it. It read "Don't open the office door, there are small screws and electronics everywhere and cats CANNOT enter the office". Luckily nobody opened it, but yeah, when it rains...
New day, the fight continues! Time to take the router off the wall, open it and plug the NVME into a NVME/USB adapter. The idea is to make an image of the NVME (before something worse happens) and edit a few files to make it bootable again. My strategy was to disable the firewall, enable lan2 (the working lan port) with a fixed IP, and SSH to the machine. Opening the router and removing the NVME was trivial, but discovering that it does not fit in my USB enclosure was REALLY annoying...
Why? NVME drives can be M-type (one notch) and M/B type (two notches). Usually, adapters have only one notch which will fit both types of NVME devices. Mine had TWO, which means only M/S type devices would fit. The device on the router was a plain M-type (one notch) so it wouldn't fit... :(
With time running out (meetings coming) I had only one option at this point: Use my desktop NVME slots to edit the contents of the NVME and get into the machine. Naturally, the NVME slots on my desktop were under fans and other PC paraphernalia, so I had to move things around which took even more time. I eventually managed to edit the files in the NVME and disable the firewall commands. I also changed the configuration to enable lan2 to use a fixed IP. Yay! I Removed the whole thing again, closed the desktop, reinstalled the NVME on the router again and rebooted.
Joy! I can ping the lan2 IP from my workstation! Let's ssh... Connection Refused... :(
I repeated the whole penance again and discover that for security reasons my SSH daemon was not listening on the new lan2 IP. Quick fix and I finally managed to log in to this machine (about noon, next day).
With the router up and running and an SSH session open, I could confirm that
the problem was that lan1 had disappeared completely. I could still see wan0
and lan2, but not lan1 which didn't even show in a simple lspci
command.
Running ip l
didn't show anything either. Why? It's using the same Realtek
module as lan2, which was still working OK. To make matters even weirder, the
switch had a yellow (100Mbps) light on, not the usual green (1G light). Maybe
some auto-negotiation issue with the new version of the driver, but why if lan2
was still working perfectly? Also, I'm mostly sure I had auto-negotiation
already turned it off. Something to investigate later...
For the moment, the easiest way forward would be to remove the old kernel with
apt-get remove --purge
. I then just discovered that in Debian, wireguard
depends on linux-image-arm64
, which depends on the latest versioned version
of the linua-image-X.XXX-arm64
package. Removing the package for the newer
image would also remove Wireguard :(.
I eventually looked into /boot
and discovered a config file under
/boot/extlinux
with what appeared to be a boot menu (even though I've never
seen a boot menu during the boot process). I changed the default kernel label
to the previous version and it then finally rebooted in the old version.
Once I got the router up and running again, I was finally able to investigate why USB wasn't working in the first place. As it turns out, I see a lot of "read error, -71" messages in the log when the USB devices are plugged. A quick Google search returns many interesting pointers with suggested solutions, but none of them worked. What really worked was trying another keyboard. Once I plugged a cheap $5 keyboard (instead of my regular Das Keyboard mechanical), everything worked flawlessly. I still don't know why this SBC doesn't want to work with the Das Keyboard, but all this could have been avoided if I had tried another keyboard in the first place (granted, all my spare keyboards were destroyed over time by my kids and I had to scavenge a Logitech wireless somewhere to test it).
- Never ever ever do any kind of even remotely critical maintenance late at night, even if it seems harmless (but I knew that already...)
- Always make sure your headless servers can accept an external monitor and/or USB keyboard. Failing that, make sure you can login using a serial console or USB port as a serial device.
- Don't trust kernel upgrades, in particular in ARM devices.
- Realtek sucks.
- Hardware is finicky. Always test another piece of hardware if you something weird happening.
- Despite all the setbacks, I managed to put the router back without loss of data.
- A rift in space-time didn't open and suck me to another dimension.
Ouch this sounds horrific! This is the same sentiment as “do not submit code to production on a Friday afternoon”.
The Realtek woes remind me of a similar period of pain I had with an old onboard intel nic in one of my one litre proxmox hosts.