Hi!
There is a dual E5-2690v3 box based on Supermicro SYS-2028GR-TR/X10DRG-H, BIOS 1.0c, running Ubuntu 16.04.1, w. all current updates.
It has a XL710-QDA2 card, fw 5.0.40043 api 1.5 nvm 5.04 0x80002537, driver 1.5.25 (the stock Ubuntu i40e driver 1.4.25 resulted in a crash), that is planned to be used as an iSCSI initiator endpoint. But there seems to be a problem: the log file fills up with "RX driver issue detected" messages and occasionally the iSCSI link resets as ping times out. This is critical error, as the mounted device becomes unusable!
So, Question 1: Is there something that can be done to fix the iSCSI behaviour of the XL710 card? When testing the card with iperf (2 concurrent sessions, the other end had a 10G NIC), there were no problems. The problems started when the iSCSI connection was established.
Question 2: Is there a way to force the card to work in PCI Express 2.0 mode? The server downgraded the card once after several previous failures and then it became surprisingly stable. I cannot find a way to make it persist though.
Some excerpts from log files (there are also occasional TX driver issues, but much less frequently than RX problems):
[ 263.116057] EXT4-fs (sdk): mounted filesystem with ordered data mode. Opts: (null)
[ 321.030246] i40e 0000:81:00.0: RX driver issue detected, PF reset issued
[ 332.512601] i40e 0000:81:00.0: RX driver issue detected, PF reset issued
..lots of the above messages...
[ 481.001787] i40e 0000:81:00.0: RX driver issue detected, PF reset issued
[ 487.183237] NOHZ: local_softirq_pending 08
[ 491.151322] i40e 0000:81:00.0: RX driver issue detected, PF reset issued
..lots of the above messages...
[ 1181.099046] i40e 0000:81:00.0: RX driver issue detected, PF reset issued
[ 1199.852665] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295189627, last ping 4295190878, now 4295192132
[ 1199.852694] connection1:0: detected conn error (1022)
[ 1320.412312] session1: session recovery timed out after 120 secs
[ 1320.412325] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412331] sd 10:0:0:0: [sdk] killing request
[ 1320.412347] sd 10:0:0:0: [sdk] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 1320.412352] sd 10:0:0:0: [sdk] CDB: Write Same(10) 41 00 6b 40 69 00 00 08 00 00
[ 1320.412356] blk_update_request: I/O error, dev sdk, sector 1799383296
[ 1320.412411] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412423] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412428] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412433] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412438] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412442] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412446] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412451] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412455] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412460] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412464] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412469] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412473] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412477] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412482] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412486] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412555] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412566] Aborting journal on device sdk-8.
[ 1320.412571] sd 10:0:0:0: rejecting I/O to offline device
[ 1320.412576] JBD2: Error -5 detected when updating journal superblock for sdk-8.
[ 1332.831851] sd 10:0:0:0: rejecting I/O to offline device
[ 1332.831864] EXT4-fs error (device sdk): ext4_journal_check_start:56: Detected aborted journal
[ 1332.831869] EXT4-fs (sdk): Remounting filesystem read-only
[ 1332.831873] EXT4-fs (sdk): previous I/O error to superblock detected
Unloading the kernel module and modprobe-ing it again:
[ 1380.970732] i40e: Intel(R) 40-10 Gigabit Ethernet Connection Network Driver - version 1.5.25
[ 1380.970737] i40e: Copyright(c) 2013 - 2016 Intel Corporation.
[ 1380.987563] i40e 0000:81:00.0: fw 5.0.40043 api 1.5 nvm 5.04 0x80002537 0.0.0
[ 1381.127289] i40e 0000:81:00.0: MAC address: 3c:xx:xx:xx:xx:xx
[ 1381.246815] i40e 0000:81:00.0 p5p1: renamed from eth0
[ 1381.358723] i40e 0000:81:00.0 p5p1: NIC Link is Up 40 Gbps Full Duplex, Flow Control: None
[ 1381.416135] i40e 0000:81:00.0: PCI-Express: Speed 8.0GT/s Width x8
[ 1381.454729] i40e 0000:81:00.0: Features: PF-id[0] VFs: 64 VSIs: 66 QP: 48 RSS FD_ATR FD_SB NTUPLE CloudF DCB VxLAN Geneve NVGRE PTP VEPA
[ 1381.471584] i40e 0000:81:00.1: fw 5.0.40043 api 1.5 nvm 5.04 0x80002537 0.0.0
[ 1381.605866] i40e 0000:81:00.1: MAC address: 3c:xx:xx:xx:xx:xy
[ 1381.712287] i40e 0000:81:00.1 p5p2: renamed from eth0
[ 1381.751417] IPv6: ADDRCONF(NETDEV_UP): p5p2: link is not ready
[ 1381.810607] IPv6: ADDRCONF(NETDEV_UP): p5p2: link is not ready
[ 1381.820095] i40e 0000:81:00.1: PCI-Express: Speed 8.0GT/s Width x8
[ 1381.826141] i40e 0000:81:00.1: Features: PF-id[1] VFs: 64 VSIs: 66 QP: 48 RSS FD_ATR FD_SB NTUPLE CloudF DCB VxLAN Geneve NVGRE PTP VEPA
[ 1647.123056] EXT4-fs (sdk): recovery complete
[ 1647.123414] EXT4-fs (sdk): mounted filesystem with ordered data mode. Opts: (null)
[ 1668.179234] NOHZ: local_softirq_pending 08
[ 1673.994586] i40e 0000:81:00.0: RX driver issue detected, PF reset issued
[ 1676.871805] i40e 0000:81:00.0: RX driver issue detected, PF reset issued
[ 1692.833097] i40e 0000:81:00.0: RX driver issue detected, PF reset issued
[ 1735.179086] NOHZ: local_softirq_pending 08
[ 1767.357902] i40e 0000:81:00.0: RX driver issue detected, PF reset issued
[ 1803.828762] i40e 0000:81:00.0: RX driver issue detected, PF reset issued
After several failures, the card loaded in PCI-Express 2.0 mode. It became stable then:
Jan 1 18:44:35 systemd[1]: Started ifup for p5p1.
Jan 1 18:44:35 systemd[1]: Found device Ethernet Controller XL710 for 40GbE QSFP+ (Ethernet Converged Network Adapter XL710-Q2).
Jan 1 18:44:35 NetworkManager[1911]: <info> [1483289075.5028] devices added (path: /sys/devices/pci0000:80/0000:80:01.0/0000:81:00.0/net/p5p1, iface: p5p1)
Jan 1 18:44:35 NetworkManager[1911]: <info> [1483289075.5029] locking wired connection setting
Jan 1 18:44:35 NetworkManager[1911]: <info> [1483289075.5029] get unmanaged devices count: 3
Jan 1 18:44:35 avahi-daemon[1741]: Joining mDNS multicast group on interface p5p1.IPv4 with address xx.xx.xx.xx.
Jan 1 18:44:35 avahi-daemon[1741]: New relevant interface p5p1.IPv4 for mDNS.
Jan 1 18:44:35 NetworkManager[1911]: <info> [1483289075.5577] device (p5p1): link connected
Jan 1 18:44:35 avahi-daemon[1741]: Registering new address record for xx.xx.xx.xx on p5p1.IPv4.
Jan 1 18:44:35 kernel: [11572.541797] i40e 0000:81:00.0 p5p1: NIC Link is Up 40 Gbps Full Duplex, Flow Control: None
Jan 1 18:44:35 kernel: [11572.579303] i40e 0000:81:00.0: PCI-Express: Speed 5.0GT/s Width x8
Jan 1 18:44:35 kernel: [11572.579309] i40e 0000:81:00.0: PCI-Express bandwidth available for this device may be insufficient for optimal performance.
Jan 1 18:44:35 kernel: [11572.579312] i40e 0000:81:00.0: Please move the device to a different PCI-e link with more lanes and/or higher transfer rate.
Jan 1 18:44:35 kernel: [11572.617328] i40e 0000:81:00.0: Features: PF-id[0] VFs: 64 VSIs: 66 QP: 48 RX: 1BUF RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA
Jan 1 18:44:35 kernel: [11572.635294] i40e 0000:81:00.1: fw 5.0.40043 api 1.5 nvm 5.04 0x80002537 0.0.0
Jan 1 18:44:35 kernel: [11572.917343] i40e 0000:81:00.1: MAC address: 3c:xx:xx:xx:xx:xx
Jan 1 18:44:35 systemd[1]: Reloading OpenBSD Secure Shell server.
Jan 1 18:44:35 systemd[1]: Reloaded OpenBSD Secure Shell server.
Jan 1 18:44:35 kernel: [11572.921344] i40e 0000:81:00.1: SAN MAC: 3c:xx:xx:xx:xx:xx
Jan 1 18:44:35 NetworkManager[1911]: <warn> [1483289075.9656] device (eth0): failed to find device 14 'eth0' with udev
Jan 1 18:44:35 NetworkManager[1911]: <info> [1483289075.9671] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/13)
Jan 1 18:44:35 kernel: [11572.976596] i40e 0000:81:00.1 p5p2: renamed from eth0
Kind regards,
jpe