Hello,
We have installed PC with Ubuntu 14.04.3 with all updates as Border router:
Linux hellnat 3.19.0-47-generic #53~14.04.1-Ubuntu SMP Mon Jan 18 16:09:14 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
CPU: 2*E5-2690v3 with hyperthreading enabled (so total 48 logical "cores" in OS)
Intel XL710 quad port, every "channel" of every p1p* interface is binded to its core
It is used as border router, so it uses BGP. We use p1p1 and p1p3 to connect to internal routers and p1p2 and p1p3 - to Uplinks.
Suddenly traffic stopped when it was NOT rush hour.
After reboot (via IPMI) I saw next lines in syslog file:
Jan 31 02:33:33 hellnat kernel: [220504.793680] ------------[ cut here ]------------
Jan 31 02:33:33 hellnat kernel: [220504.793701] WARNING: CPU: 45 PID: 0 at /build/linux-lts-vivid-Yt59dr/linux-lts-vivid-3.19.0/net/sched/sch_generic.c:303 dev_watchdog+0x24f/0x260()
Jan 31 02:33:33 hellnat kernel: [220504.793705] NETDEV WATCHDOG: p1p1 (i40e): transmit queue 8 timed out
Jan 31 02:33:33 hellnat kernel: [220504.793707] Modules linked in: nf_conntrack_netlink nfnetlink xt_tcpudp xt_multiport iptable_filter xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw ast ttm joydev intel_rapl iosf_mbi drm_kms_helper x86_pkg_temp_thermal intel_powerclamp drm syscopyarea sysfillrect sysimgblt coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel ipmi_ssif aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd lpc_ich mei_me sb_edac edac_core mei ipmi_si 8250_fintek ipmi_msghandler lp wmi acpi_pad parport ioatdma mac_hid shpchp nf_conntrack_ftp acpi_power_meter nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc tcp_htcp hid_generic i40e(OE) igb vxlan ip6_udp_tunnel i2c_algo_bit udp_tunnel usbhid dca uas configfs ahci ptp usb_storage hid megaraid_sas libahci pps_core
Jan 31 02:33:33 hellnat kernel: [220504.793817] CPU: 45 PID: 0 Comm: swapper/45 Tainted: G OE 3.19.0-47-generic #53~14.04.1-Ubuntu
Jan 31 02:33:33 hellnat kernel: [220504.793820] Hardware name: Supermicro SYS-6018R-WTR/X10DRW-i, BIOS 1.1 08/13/2015
Jan 31 02:33:33 hellnat kernel: [220504.793822] ffffffff81b3fcc0 ffff88105f4a3d58 ffffffff817afcd5 0000000000000000
Jan 31 02:33:33 hellnat kernel: [220504.793827] ffff88105f4a3da8 ffff88105f4a3d98 ffffffff81074dea 0000000000000286
Jan 31 02:33:33 hellnat kernel: [220504.793830] 0000000000000008 ffff88105b65a000 0000000000000040 ffff88105748cf40
Jan 31 02:33:33 hellnat kernel: [220504.793835] Call Trace:
Jan 31 02:33:33 hellnat kernel: [220504.793837] <IRQ> [<ffffffff817afcd5>] dump_stack+0x45/0x57
Jan 31 02:33:33 hellnat kernel: [220504.793857] [<ffffffff81074dea>] warn_slowpath_common+0x8a/0xc0
Jan 31 02:33:33 hellnat kernel: [220504.793860] [<ffffffff81074e66>] warn_slowpath_fmt+0x46/0x50
Jan 31 02:33:33 hellnat kernel: [220504.793869] [<ffffffff816cd69f>] dev_watchdog+0x24f/0x260
Jan 31 02:33:33 hellnat kernel: [220504.793874] [<ffffffff816cd450>] ? dev_graft_qdisc+0x80/0x80
Jan 31 02:33:33 hellnat kernel: [220504.793879] [<ffffffff810dac79>] call_timer_fn+0x39/0x110
Jan 31 02:33:33 hellnat kernel: [220504.793883] [<ffffffff816cd450>] ? dev_graft_qdisc+0x80/0x80
Jan 31 02:33:33 hellnat kernel: [220504.793888] [<ffffffff810dc440>] run_timer_softirq+0x220/0x320
Jan 31 02:33:33 hellnat kernel: [220504.793898] [<ffffffff8104a403>] ? lapic_next_deadline+0x33/0x40
Jan 31 02:33:33 hellnat kernel: [220504.793905] [<ffffffff81078f44>] __do_softirq+0xe4/0x270
Jan 31 02:33:33 hellnat kernel: [220504.793909] [<ffffffff8107930d>] irq_exit+0x9d/0xb0
Jan 31 02:33:33 hellnat kernel: [220504.793916] [<ffffffff817ba78a>] smp_apic_timer_interrupt+0x4a/0x60
Jan 31 02:33:33 hellnat kernel: [220504.793924] [<ffffffff817b87bd>] apic_timer_interrupt+0x6d/0x80
Jan 31 02:33:33 hellnat kernel: [220504.793926] <EOI> [<ffffffff81650510>] ? cpuidle_enter_state+0x70/0x170
Jan 31 02:33:33 hellnat kernel: [220504.793938] [<ffffffff816504fd>] ? cpuidle_enter_state+0x5d/0x170
Jan 31 02:33:33 hellnat kernel: [220504.793943] [<ffffffff816506c7>] cpuidle_enter+0x17/0x20
Jan 31 02:33:33 hellnat kernel: [220504.793949] [<ffffffff810b54d4>] cpu_startup_entry+0x334/0x3d0
Jan 31 02:33:33 hellnat kernel: [220504.793955] [<ffffffff810e9e53>] ? clockevents_register_device+0xe3/0x140
Jan 31 02:33:33 hellnat kernel: [220504.793960] [<ffffffff81048bb7>] start_secondary+0x197/0x1c0
Jan 31 02:33:33 hellnat kernel: [220504.793963] ---[ end trace 43e1a051ade0289e ]---
Jan 31 02:33:33 hellnat kernel: [220504.793973] i40e 0000:81:00.0 p1p1: tx_timeout: VSI_seid: 399, Q 8, NTC: 0xd36, HWB: 0xa1, NTU: 0xa1, TAIL: 0xa1, INT: 0x0
Jan 31 02:33:33 hellnat kernel: [220504.793976] i40e 0000:81:00.0 p1p1: tx_timeout recovery level 1, hung_queue 8
Jan 31 02:33:43 hellnat watchquagga[2972]: zebra state -> unresponsive : no response yet to ping sent 10 seconds ago
Jan 31 02:33:49 hellnat watchquagga[2972]: bgpd state -> unresponsive : no response yet to ping sent 10 seconds ago
Jan 31 02:33:50 hellnat kernel: [220521.908228] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [kworker/13:1:536]
Jan 31 02:33:50 hellnat kernel: [220521.908306] Modules linked in: nf_conntrack_netlink nfnetlink xt_tcpudp xt_multiport iptable_filter xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw ast ttm joydev intel_rapl iosf_mbi drm_kms_helper x86_pkg_temp_thermal intel_powerclamp drm syscopyarea sysfillrect sysimgblt coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel ipmi_ssif aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd lpc_ich mei_me sb_edac edac_core mei ipmi_si 8250_fintek ipmi_msghandler lp wmi acpi_pad parport ioatdma mac_hid shpchp nf_conntrack_ftp acpi_power_meter nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc tcp_htcp hid_generic i40e(OE) igb vxlan ip6_udp_tunnel i2c_algo_bit udp_tunnel usbhid dca uas configfs ahci ptp usb_storage hid megaraid_sas libahci pps_core
Jan 31 02:33:50 hellnat kernel: [220521.908396] CPU: 13 PID: 536 Comm: kworker/13:1 Tainted: G W OE 3.19.0-47-generic #53~14.04.1-Ubuntu
Jan 31 02:33:50 hellnat kernel: [220521.908399] Hardware name: Supermicro SYS-6018R-WTR/X10DRW-i, BIOS 1.1 08/13/2015
Jan 31 02:33:50 hellnat kernel: [220521.908408] Workqueue: events inet_frag_worker
The main lines , I think, are:
Jan 31 02:33:33 hellnat kernel: [220504.793705] NETDEV WATCHDOG: p1p1 (i40e): transmit queue 8 timed out
Jan 31 02:33:33 hellnat kernel: [220504.793973] i40e 0000:81:00.0 p1p1: tx_timeout: VSI_seid: 399, Q 8, NTC: 0xd36, HWB: 0xa1, NTU: 0xa1, TAIL: 0xa1, INT: 0x0
Jan 31 02:33:33 hellnat kernel: [220504.793976] i40e 0000:81:00.0 p1p1: tx_timeout recovery level 1, hung_queue 8
We can see that tx queue 8 hang up. Why can it happen? I think it is a problem of network adapter or driver. Can you explain it to me and how to fix it? It is big problem when it happens because all traffic is going through this machine.
Some information from ethtool:
# ethtool -i p1p1
driver: i40e
version: 1.3.49
firmware-version: 4.53 0x80001da6 0.0.0
bus-info: 0000:81:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
# ethtool -c p1p1
Coalesce parameters for p1p1:
Adaptive RX: off TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
rx-usecs: 800
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 256
tx-usecs: 600
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 256
rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0
rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0
# ethtool -k p1p1
Features for p1p1:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: on
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
tx-tcp-segmentation: off
tx-tcp-ecn-segmentation: off
tx-tcp6-segmentation: off
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: on
receive-hashing: on
highdma: on
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
busy-poll: off [fixed]
If you need more information feel free to ask it.
Thank you in advance.
Regards,
Evgeny