Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save thebream/a9ae1e54f92db58cfafd3c0bfb83dc3f to your computer and use it in GitHub Desktop.

Select an option

Save thebream/a9ae1e54f92db58cfafd3c0bfb83dc3f to your computer and use it in GitHub Desktop.
Remediation script for e1000e network driver hang on Proxmox VE

Check recent system journal entries for "hang detected" message(s), and if found then reset network interface.

Workaround until driver / kernel fixed.

TLDR; Quick start

  1. Copy attached script to somewhere, e.g. /root/cron/hangcheck2.sh
  2. If your interface name is not eno1 then update the ifup and ifdown commands
  3. Review code for your peace of mind (after all, this script is going to be running as root!)
  4. Give it a test run:
    tim@tim-nuc10:~$ /root/cron/hangcheck2.sh -vv
    Verbosity is: 2
    2025-11-20 10:52:48: No network hang detected, exiting
    
  5. Add entry to root crontab, e.g. run every 10 minutes, starting at 2 minutes past the hour:
    2,12,22,32,42,52 * * * * /root/cron/hangcheck2.sh >> /var/log/hangcheck2.log || echo "CRON JOB FAILED"
    

Reference

https://forum.proxmox.com/threads/e1000-driver-hang.58284/

Background

System tested on

Intel NUC 10 (NUC10i5FNH) with I219-V integrated onboard LAN using e1000e driver:

root@tim-nuc10:~# lspci | grep Ethernet  
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (10) I219-V

root@tim-nuc10:~# ethtool -i eno1  
driver: e1000e  
version: 6.8.12-14-pve  
firmware-version: 0.6-4  
expansion-rom-version:  
bus-info: 0000:00:1f.6  
supports-statistics: yes  
supports-test: yes  
supports-eeprom-access: yes  
supports-register-dump: yes  
supports-priv-flags: yes

Example error from system journal

When network has hung, these errors occur about every 2 seconds). Verbose output shown here:

root@tim-nuc10:~# journalctl --since "2025-11-19 05:00:03" --until "2025-11-19 05:00:04" -o verbose
Wed 2025-11-19 05:00:03.609642 AEDT [s=357c571eab6d4965951303c26dfdd3c1;i=1cadd5;b=d3f13914711347d9b09a19db19ba1dd2;m=4>
    _BOOT_ID=d3f13914711347d9b09a19db19ba1dd2
    _MACHINE_ID=4dd372840ba9458ea51b7ea0e1828448
    _HOSTNAME=tim-nuc10
    _RUNTIME_SCOPE=system
    _TRANSPORT=kernel
    SYSLOG_FACILITY=0
    SYSLOG_IDENTIFIER=kernel
    PRIORITY=3
    _KERNEL_SUBSYSTEM=pci
    _KERNEL_DEVICE=+pci:0000:00:1f.6
    _UDEV_SYSNAME=0000:00:1f.6
    _SOURCE_MONOTONIC_TIMESTAMP=4956951916911
    MESSAGE=e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
              TDH                  <e9>
              TDT                  <9>
              next_to_use          <9>
              next_to_clean        <e9>
            buffer_info[next_to_clean]:
              time_stamp           <2273268d5>
              next_to_watch        <ea>
              jiffies              <227716f00>
              next_to_watch.status <0>
            MAC Status             <40080083>
            PHY Status             <796d>
            PHY 1000BASE-T Status  <3800>
            PHY Extended Status    <3000>
            PCI Status             <10>

(Note that the network interface name, "eno1", is in the first line of the MESSAGE field above. The script could parse that line to get name instead of hard-coding it)

Based on that, the hang check looks at the journal for the last two minutes and filters on these fields:

_TRANSPORT=kernel
_KERNEL_SUBSYSTEM=pci
PRIORITY=3

Output from that is then filtered through this, to count the number of hangs:

grep -c "Detected Hardware Unit Hang:"

Complete command looks like this:

if ! hangcount=$(journalctl \
                    --since "2 minutes ago" _TRANSPORT=kernel \
                    _KERNEL_SUBSYSTEM=pci --priority=3 | \
                 grep -c "Detected Hardware Unit Hang:")
then
    log_msg 1 "No network hang detected, exiting"
    exit 0
fi
#!/bin/bash
# Reset network interface if system journal shows it has hung
# Refer https://forum.proxmox.com/threads/e1000-driver-hang.58284
# 19/11/2025 v1.0 Created simplified version - Tim.
set -euo pipefail
usage() {
cat <<EOF
Reset network if hang detected in PVE networking, must be run as root.
Options:
-h this help
-v verbose output, specify multiple times to increase verbosity
EOF
}
log_msg() {
local loglevel="$1"
local logmsg="$2"
local timestamp
timestamp=$(date +"%Y-%m-%d %H:%M:%S")
if [[ $loglevel -le $VERBOSITY ]]; then
echo "$timestamp: $logmsg"
fi
}
#defaults
VERBOSITY=0
while getopts "hv" OPTION
do
case $OPTION in
h)
usage
exit 0
;;
v)
VERBOSITY=$((VERBOSITY + 1))
;;
\?)
usage
exit 3
;;
esac
done
shift $((OPTIND - 1))
[[ $VERBOSITY -ge 1 ]] && echo "Verbosity is: $VERBOSITY"
# check system journal for recent hang
if ! hangcount=$(journalctl \
--since "2 minutes ago" _TRANSPORT=kernel \
_KERNEL_SUBSYSTEM=pci --priority=3 | \
grep -c "Detected Hardware Unit Hang:")
then
log_msg 1 "No network hang detected, exiting"
exit 0
fi
log_msg 0 "Hang detected, count is: $hangcount, restarting network"
# need full path, root cron PATH does not include /usr/sbin
/usr/sbin/ifdown eno1; sleep 10; /usr/sbin/ifup eno1
log_msg 2 "Sleeping 10 seconds for good luck"
sleep 10
# problem has been detected, so exit non-zero to get notification from cron
exit 1
@minhoryang
Copy link
Copy Markdown

Would it make sense to store the previous $hangcount? Otherwise, once a hang is detected, the interface may cycle down/up every 10 minutes.

@thebream
Copy link
Copy Markdown
Author

thebream commented Apr 4, 2026

As configured above, the script runs every 10 minutes, and checks the system journal for hangs since 2 minutes ago - so, as long as the journal check interval is shorter than the interval that cron runs the script, there won't be a problem.

Once the hang occurs, errors are logged to the journal (at least on my system they are) about every 2 seconds, so the script really only needs to look back 1 minute or less.

For the record, the script has recovered my NUC a few times now over the last few months:

tim@tim-nuc10:/var/log$ cat hangcheck2.log
2026-01-08 17:02:01: Hang detected, count is: 60, restarting network
2026-03-14 03:02:01: Hang detected, count is: 60, restarting network
2026-04-01 10:02:01: Hang detected, count is: 60, restarting network

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment