We have experienced three occurrences on two servers of this error "tx_timeout" / "hung_queue", and packets stopped flowing for some number of seconds (but then recovered):
Apr 10 02:04:14 node39 kernel: WARNING: at net/sched/sch_generic.c:297 dev_watchdog+0x276/0x280() Apr 10 02:04:14 node39 kernel: NETDEV WATCHDOG: p2p1 (i40e): transmit queue 8 timed out ... Apr 10 02:04:14 node39 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE ------------ 3.10.0-514.6.1.el7.x86_64 #1 Apr 10 02:04:14 node39 kernel: Hardware name: Dell Inc. PowerEdge R620/01W23F, BIOS 2.1.3 11/20/2013 ... Apr 10 02:04:14 node39 kernel: i40e 0000:42:00.0 p2p1: tx_timeout: VSI_seid: 390, Q 8, NTC: 0x113, HWB: 0x116, NTU: 0x116, TAIL: 0x116, INT: 0x1 Apr 10 02:04:14 node39 kernel: i40e 0000:42:00.0 p2p1: tx_timeout recovery level 1, hung_queue 8 Apr 10 02:04:14 node39 kernel: i40e 0000:42:00.0 p2p1: adding 3c:fd:fe:9f:b7:48 vid=0
This is within first 3 weeks of usage of Intel X710 duo adapters running firmware 4.53 (with supported Intel SFP+) recently installed in a cluster of two-year-old Dell R620s, running CentOS 7.3:
node39:/# lspci -vv | grep -A 1 10GbE pcilib: sysfs_read_vpd: read failed: Input/output error 05:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) Subsystem: Intel Corporation Ethernet Converged Network Adapter X710-2 -- 05:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) Subsystem: Intel Corporation Ethernet Converged Network Adapter X710 -- 42:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) Subsystem: Intel Corporation Ethernet Converged Network Adapter X710-2 -- 42:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) Subsystem: Intel Corporation Ethernet Converged Network Adapter X710 node39:/usr/local/bin# ethtool -i p2p1 driver: i40e version: 1.5.10-k firmware-version: 4.53 0x8000206e 0.0.0 expansion-rom-version: bus-info: 0000:42:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes
We have used X710s without issue in a few other servers, but in those cases they are HP OEM, and running firmware 4.60:
node93:/# lspci -vv |grep -A 1 10GbE 04:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) Subsystem: Hewlett-Packard Company HP Ethernet 10Gb 2-port 562FLR-SFP+ Adapter -- 04:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) Subsystem: Hewlett-Packard Company Ethernet 10Gb 562SFP+ Adapter -- 05:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) Subsystem: Hewlett-Packard Company HP Ethernet 10Gb 2-port 562SFP+ Adapter -- 05:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) Subsystem: Hewlett-Packard Company Ethernet 10Gb 562SFP+ Adapter node93:/# ethtool -i ens2f0 driver: i40e version: 1.5.10-k firmware-version: 4.60 0x80001f47 1.3072.0 expansion-rom-version: bus-info: 0000:05:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes
I have downloaded nvmupdate64e and updated a spare Dell to firmware 5.05, so if this is the correct solution I have confirmed the procedure. However threads such as this one Intel X710 vs VMWare ESX: crash and reboot give me pause-- crash and reboot would certainly be worse than a 10-20 second transmit hang.
My questions are:
- Has anyone else experienced these tx_timeout / hung_queue issues?
- Is it a known issue? If so, is it an issue with firmware, with i40e driver, or something else such as TSO/GSO (which are currently ON but I could turn them off).
- If it is an issue with firmware, has it been corrected between versions 4.53 and 4.60, and is it recommended to flash production machines to 5.05, or to some other version. I could not find a detailed Change List.
- Is there a way (such as generating high data rates using iperf) to make the sporadic issues occur reproducibly, so that I can demonstrate whether any attempted solution has been successful.
Thanks in advance!