Quantcast
Channel: Intel Communities : Discussion List - Wired Ethernet
Viewing all articles
Browse latest Browse all 4405

i40e XL710 hang up - tx_timeout hung_queue - ubuntu

$
0
0

Hello,

We have installed PC with Ubuntu 14.04.3 with all updates as Border router:

Linux hellnat 3.19.0-47-generic #53~14.04.1-Ubuntu SMP Mon Jan 18 16:09:14 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

CPU: 2*E5-2690v3 with hyperthreading enabled (so total 48 logical "cores" in OS)

Intel XL710 quad port, every "channel" of every p1p* interface is binded to its core

It is used as border router, so it uses BGP. We use p1p1 and p1p3 to connect to internal routers and p1p2 and p1p3 - to Uplinks.

Suddenly traffic stopped when it was NOT rush hour.

zabb1.png

zabb2.png

After reboot (via IPMI) I saw next lines in syslog file:

Jan 31 02:33:33 hellnat kernel: [220504.793680] ------------[ cut here ]------------

Jan 31 02:33:33 hellnat kernel: [220504.793701] WARNING: CPU: 45 PID: 0 at /build/linux-lts-vivid-Yt59dr/linux-lts-vivid-3.19.0/net/sched/sch_generic.c:303 dev_watchdog+0x24f/0x260()

Jan 31 02:33:33 hellnat kernel: [220504.793705] NETDEV WATCHDOG: p1p1 (i40e): transmit queue 8 timed out

Jan 31 02:33:33 hellnat kernel: [220504.793707] Modules linked in: nf_conntrack_netlink nfnetlink xt_tcpudp xt_multiport iptable_filter xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw ast ttm joydev intel_rapl iosf_mbi drm_kms_helper x86_pkg_temp_thermal intel_powerclamp drm syscopyarea sysfillrect sysimgblt coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel ipmi_ssif aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd lpc_ich mei_me sb_edac edac_core mei ipmi_si 8250_fintek ipmi_msghandler lp wmi acpi_pad parport ioatdma mac_hid shpchp nf_conntrack_ftp acpi_power_meter nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc tcp_htcp hid_generic i40e(OE) igb vxlan ip6_udp_tunnel i2c_algo_bit udp_tunnel usbhid dca uas configfs ahci ptp usb_storage hid megaraid_sas libahci pps_core

Jan 31 02:33:33 hellnat kernel: [220504.793817] CPU: 45 PID: 0 Comm: swapper/45 Tainted: G           OE  3.19.0-47-generic #53~14.04.1-Ubuntu

Jan 31 02:33:33 hellnat kernel: [220504.793820] Hardware name: Supermicro SYS-6018R-WTR/X10DRW-i, BIOS 1.1 08/13/2015

Jan 31 02:33:33 hellnat kernel: [220504.793822]  ffffffff81b3fcc0 ffff88105f4a3d58 ffffffff817afcd5 0000000000000000

Jan 31 02:33:33 hellnat kernel: [220504.793827]  ffff88105f4a3da8 ffff88105f4a3d98 ffffffff81074dea 0000000000000286

Jan 31 02:33:33 hellnat kernel: [220504.793830]  0000000000000008 ffff88105b65a000 0000000000000040 ffff88105748cf40

Jan 31 02:33:33 hellnat kernel: [220504.793835] Call Trace:

Jan 31 02:33:33 hellnat kernel: [220504.793837]  <IRQ>  [<ffffffff817afcd5>] dump_stack+0x45/0x57

Jan 31 02:33:33 hellnat kernel: [220504.793857]  [<ffffffff81074dea>] warn_slowpath_common+0x8a/0xc0

Jan 31 02:33:33 hellnat kernel: [220504.793860]  [<ffffffff81074e66>] warn_slowpath_fmt+0x46/0x50

Jan 31 02:33:33 hellnat kernel: [220504.793869]  [<ffffffff816cd69f>] dev_watchdog+0x24f/0x260

Jan 31 02:33:33 hellnat kernel: [220504.793874]  [<ffffffff816cd450>] ? dev_graft_qdisc+0x80/0x80

Jan 31 02:33:33 hellnat kernel: [220504.793879]  [<ffffffff810dac79>] call_timer_fn+0x39/0x110

Jan 31 02:33:33 hellnat kernel: [220504.793883]  [<ffffffff816cd450>] ? dev_graft_qdisc+0x80/0x80

Jan 31 02:33:33 hellnat kernel: [220504.793888]  [<ffffffff810dc440>] run_timer_softirq+0x220/0x320

Jan 31 02:33:33 hellnat kernel: [220504.793898]  [<ffffffff8104a403>] ? lapic_next_deadline+0x33/0x40

Jan 31 02:33:33 hellnat kernel: [220504.793905]  [<ffffffff81078f44>] __do_softirq+0xe4/0x270

Jan 31 02:33:33 hellnat kernel: [220504.793909]  [<ffffffff8107930d>] irq_exit+0x9d/0xb0

Jan 31 02:33:33 hellnat kernel: [220504.793916]  [<ffffffff817ba78a>] smp_apic_timer_interrupt+0x4a/0x60

Jan 31 02:33:33 hellnat kernel: [220504.793924]  [<ffffffff817b87bd>] apic_timer_interrupt+0x6d/0x80

Jan 31 02:33:33 hellnat kernel: [220504.793926]  <EOI>  [<ffffffff81650510>] ? cpuidle_enter_state+0x70/0x170

Jan 31 02:33:33 hellnat kernel: [220504.793938]  [<ffffffff816504fd>] ? cpuidle_enter_state+0x5d/0x170

Jan 31 02:33:33 hellnat kernel: [220504.793943]  [<ffffffff816506c7>] cpuidle_enter+0x17/0x20

Jan 31 02:33:33 hellnat kernel: [220504.793949]  [<ffffffff810b54d4>] cpu_startup_entry+0x334/0x3d0

Jan 31 02:33:33 hellnat kernel: [220504.793955]  [<ffffffff810e9e53>] ? clockevents_register_device+0xe3/0x140

Jan 31 02:33:33 hellnat kernel: [220504.793960]  [<ffffffff81048bb7>] start_secondary+0x197/0x1c0

Jan 31 02:33:33 hellnat kernel: [220504.793963] ---[ end trace 43e1a051ade0289e ]---

Jan 31 02:33:33 hellnat kernel: [220504.793973] i40e 0000:81:00.0 p1p1: tx_timeout: VSI_seid: 399, Q 8, NTC: 0xd36, HWB: 0xa1, NTU: 0xa1, TAIL: 0xa1, INT: 0x0

Jan 31 02:33:33 hellnat kernel: [220504.793976] i40e 0000:81:00.0 p1p1: tx_timeout recovery level 1, hung_queue 8

Jan 31 02:33:43 hellnat watchquagga[2972]: zebra state -> unresponsive : no response yet to ping sent 10 seconds ago

Jan 31 02:33:49 hellnat watchquagga[2972]: bgpd state -> unresponsive : no response yet to ping sent 10 seconds ago

Jan 31 02:33:50 hellnat kernel: [220521.908228] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [kworker/13:1:536]

Jan 31 02:33:50 hellnat kernel: [220521.908306] Modules linked in: nf_conntrack_netlink nfnetlink xt_tcpudp xt_multiport iptable_filter xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw ast ttm joydev intel_rapl iosf_mbi drm_kms_helper x86_pkg_temp_thermal intel_powerclamp drm syscopyarea sysfillrect sysimgblt coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel ipmi_ssif aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd lpc_ich mei_me sb_edac edac_core mei ipmi_si 8250_fintek ipmi_msghandler lp wmi acpi_pad parport ioatdma mac_hid shpchp nf_conntrack_ftp acpi_power_meter nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc tcp_htcp hid_generic i40e(OE) igb vxlan ip6_udp_tunnel i2c_algo_bit udp_tunnel usbhid dca uas configfs ahci ptp usb_storage hid megaraid_sas libahci pps_core

Jan 31 02:33:50 hellnat kernel: [220521.908396] CPU: 13 PID: 536 Comm: kworker/13:1 Tainted: G        W  OE  3.19.0-47-generic #53~14.04.1-Ubuntu

Jan 31 02:33:50 hellnat kernel: [220521.908399] Hardware name: Supermicro SYS-6018R-WTR/X10DRW-i, BIOS 1.1 08/13/2015

Jan 31 02:33:50 hellnat kernel: [220521.908408] Workqueue: events inet_frag_worker

 

 

The main lines , I think, are:

Jan 31 02:33:33 hellnat kernel: [220504.793705] NETDEV WATCHDOG: p1p1 (i40e): transmit queue 8 timed out

Jan 31 02:33:33 hellnat kernel: [220504.793973] i40e 0000:81:00.0 p1p1: tx_timeout: VSI_seid: 399, Q 8, NTC: 0xd36, HWB: 0xa1, NTU: 0xa1, TAIL: 0xa1, INT: 0x0

Jan 31 02:33:33 hellnat kernel: [220504.793976] i40e 0000:81:00.0 p1p1: tx_timeout recovery level 1, hung_queue 8

 

We can see that tx queue 8 hang up. Why can it happen? I think it is a problem of network adapter or driver. Can you explain it to me and how to fix it? It is big problem when it happens because all traffic is going through this machine.

Some information from ethtool:

# ethtool -i p1p1

driver: i40e

version: 1.3.49

firmware-version: 4.53 0x80001da6 0.0.0

bus-info: 0000:81:00.0

supports-statistics: yes

supports-test: yes

supports-eeprom-access: yes

supports-register-dump: yes

supports-priv-flags: yes

 

# ethtool -c p1p1

Coalesce parameters for p1p1:

Adaptive RX: off  TX: off

stats-block-usecs: 0

sample-interval: 0

pkt-rate-low: 0

pkt-rate-high: 0

rx-usecs: 800

rx-frames: 0

rx-usecs-irq: 0

rx-frames-irq: 256

tx-usecs: 600

tx-frames: 0

tx-usecs-irq: 0

tx-frames-irq: 256

rx-usecs-low: 0

rx-frame-low: 0

tx-usecs-low: 0

tx-frame-low: 0

rx-usecs-high: 0

rx-frame-high: 0

tx-usecs-high: 0

tx-frame-high: 0

 

# ethtool -k p1p1

Features for p1p1:

rx-checksumming: on

tx-checksumming: on

  tx-checksum-ipv4: on

  tx-checksum-ip-generic: off [fixed]

  tx-checksum-ipv6: on

  tx-checksum-fcoe-crc: off [fixed]

  tx-checksum-sctp: on

scatter-gather: on

  tx-scatter-gather: on

  tx-scatter-gather-fraglist: off [fixed]

tcp-segmentation-offload: off

  tx-tcp-segmentation: off

  tx-tcp-ecn-segmentation: off

  tx-tcp6-segmentation: off

udp-fragmentation-offload: off [fixed]

generic-segmentation-offload: off

generic-receive-offload: off

large-receive-offload: off [fixed]

rx-vlan-offload: on

tx-vlan-offload: on

ntuple-filters: on

receive-hashing: on

highdma: on

rx-vlan-filter: on

vlan-challenged: off [fixed]

tx-lockless: off [fixed]

netns-local: off [fixed]

tx-gso-robust: off [fixed]

tx-fcoe-segmentation: off [fixed]

tx-gre-segmentation: off [fixed]

tx-ipip-segmentation: off [fixed]

tx-sit-segmentation: off [fixed]

tx-udp_tnl-segmentation: on

fcoe-mtu: off [fixed]

tx-nocache-copy: off

loopback: off [fixed]

rx-fcs: off [fixed]

rx-all: off [fixed]

tx-vlan-stag-hw-insert: off [fixed]

rx-vlan-stag-hw-parse: off [fixed]

rx-vlan-stag-filter: off [fixed]

l2-fwd-offload: off [fixed]

busy-poll: off [fixed]

 

 

If you need more information feel free to ask it.

Thank you in advance.

 

Regards,

Evgeny


Viewing all articles
Browse latest Browse all 4405

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>