cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
402
Views
1
Helpful
4
Replies

TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED (Nexus9000)

Ozy
Level 1
Level 1

Hello everyone!

I'm not a network engineer but as a subsystem (kernel + filesystem) engineer I know the network concept.

Few months ago, I designed a DMZ and configured my 2 node VPC with 6 rack switch.

Everything was smoothly working but I started to see some problems a month ago and my life turn into hell:

 

2024 Mar 28 22:48:09 NILE1 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group 
buffer 90 percent threshold is exceeded!
2024 Mar 28 22:50:10 NILE1 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group 
buffer 90 percent threshold is exceeded! (message repeated 1 time)
2024 Mar 28 22:52:10 NILE1 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group 
buffer 90 percent threshold is exceeded! (message repeated 1 time)


I was using "nxos.9.3.9", I saw a bug report and solution was upgrade and I upgrade it to "nxos.9.3.13" But my problem not solved.

I don't know what is the issue and I'm not able to digg due to I don't know how to diagnost..

When I get the buffer error all the packages are drops thats what I know.

 

VPC-SW-2#     show interface counters errors non-zero

--------------------------------------------------------------------------------
Port          Align-Err    FCS-Err   Xmit-Err    Rcv-Err  UnderSize OutDiscards
--------------------------------------------------------------------------------
Eth1/1                0          0          0          0          0      369651
Eth1/8                0          0          0          0          0     1968446
Eth1/9                0          0          0          0          0      124332
Eth1/17               0          0          0          0          0      101073
Eth1/18               0          0          0          0          0      102809
Eth1/19               0          0          0          0          0      100208
Eth1/20               0          0          0          0          0      102725
Eth1/21               0          0          0          0          0      102590
Eth1/25               0          0          0          0          0       48752
Eth1/26               0          0          0          0          0      102281
Eth1/27               0          0          0          0          0       70208
Eth1/28               0          0          0          0          0      102652
Eth1/34               0          0          0          0          0      102646
Eth1/35               0          0          0          0          0      102849
Eth1/36               0          0          0          0          0      102430
Eth1/42               0          0          0          0          0     1968435
Eth1/43               0          0          0          0          0     1966384
Eth1/44               0          0          0          0          0     1968448
Eth1/46               0          0          0          0          0       32722
Eth1/47               0          0          0          0          0       45342
Eth1/48               0          0          0          0          0       24724
Eth1/49               0          0          0          0          0      102454
Eth1/50               0          0          0          0          0       99501
Eth1/51               0          0          0          0          0      100564
Eth1/52               0          0          0          0          0      102824
Eth1/53               0          0          0          0          0      102935
Eth1/54               0          0          0          0          0      103074
Po8                   0          0          0          0          0     1968446
Po9                   0          0          0          0          0      124332
Po17                  0          0          0          0          0      101073
Po18                  0          0          0          0          0      102809
Po19                  0          0          0          0          0      100208
Po20                  0          0          0          0          0      102725
Po21                  0          0          0          0          0      102590
Po25                  0          0          0          0          0       48752
Po26                  0          0          0          0          0      102281
Po27                  0          0          0          0          0       70208
Po28                  0          0          0          0          0      102652
Po34                  0          0          0          0          0      102646
Po35                  0          0          0          0          0      102849
Po36                  0          0          0          0          0      102430
Po42                  0          0          0          0          0     1968435
Po43                  0          0          0          0          0     1966384
Po44                  0          0          0          0          0     1968448
Po49                  0          0          0          0          0      102454
Po50                  0          0          0          0          0       99501
Po51                  0          0          0          0          0      100564
Po52                  0          0          0          0          0      102824
Po53                  0          0          0          0          0      102935
Po54                  0          0          0          0          0      103074
Po100                 0          0          0          0          0      102788

What changed? Maybe wrong cabling overtime my best bet..

I have some IPMI switches and I shut their port now and hunting the root cause.

My switches are:

VPC-SW-1     : C93180YC-FX3 [ BIOS: version 01.09 | NXOS: version 9.3(13) ]
VPC-SW-2     : C93180YC-FX3 [ BIOS: version 01.09 | NXOS: version 9.3(13) ]
datasw-aa-03: C93180YC-FX   [ BIOS: version 05.51 | NXOS: version 9.3(13) ]
datasw-aa-04: C93180YC-FX   [ BIOS: version 05.51 | NXOS: version 9.3(13) ]
datasw-aa-06: C93180YC-FX   [ BIOS: version 05.51 | NXOS: version 9.3(13) ]
datasw-aa-08: C93180YC-FX3 [ BIOS: version 01.09 | NXOS: version 9.3(13) ] 
datasw-aa-10: C92160YC-X     [ BIOS: version 07.41 | NXOS: version 7.0(3)I3(1) ]
datasw-aa-11: C92160YC-X     [ BIOS: version 07.41 | NXOS: version 7.0(3)I3(1) ]

 

 

VPC-SW-1# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
2024 Mar 28 22:48:09 NILE1 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group 
buffer 90 percent threshold is exceeded!
2024 Mar 28 22:50:10 NILE1 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group 
buffer 90 percent threshold is exceeded! (message repeated 1 time)
2024 Mar 28 22:52:10 NILE1 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group 
buffer 90 percent threshold is exceeded! (message repeated 1 time)
---------------------------------------------------------------------------------------
VPC-SW-2# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
2024 Mar 28 22:48:18 NILE2 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group 
buffer 90 percent threshold is exceeded!
2024 Mar 28 22:49:03 NILE2 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group 
buffer 90 percent threshold is exceeded! (message repeated 2 times)
2024 Mar 28 22:51:58 NILE2 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group 
buffer 90 percent threshold is exceeded! (message repeated 3 times)
---------------------------------------------------------------------------------------
datasw-aa-03# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
2024 Mar 21 18:30:34 datasw-aa-03 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
2024 Mar 27 02:09:17 datasw-aa-03 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
2024 Mar 27 02:11:18 datasw-aa-03 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
---------------------------------------------------------------------------------------
datasw-aa-04# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
2024 Mar 27 00:56:36 datasw-aa-04 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded! (message repeated 1 time)
2024 Mar 28 22:49:58 datasw-aa-04 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
2024 Mar 28 22:51:58 datasw-aa-04 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded! (message repeated 1 time)
---------------------------------------------------------------------------------------
datasw-aa-06# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
2024 Mar 21 20:53:36 datasw-aa-06 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
2024 Mar 21 22:13:36 datasw-aa-06 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
2024 Mar 21 22:15:36 datasw-aa-06 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
---------------------------------------------------------------------------------------
datasw-aa-08# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
2024 Mar 27 16:44:26 datasw-aa-08 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded! (message repeated 1 time)
2024 Mar 27 16:46:26 datasw-aa-08 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded! (message repeated 1 time)
2024 Mar 27 16:55:56 datasw-aa-08 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded! (message repeated 1 time)
---------------------------------------------------------------------------------------
datasw-aa-10# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
datasw-aa-10# 
---------------------------------------------------------------------------------------
datasw-aa-11# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
datasw-aa-11# 

 

The interesting part is I only do not see this issue on datasw-aa-10 and 11 "C92160YC-X [ BIOS: version 07.41 | NXOS: version 7.0(3)I3(1) ]"

 

Dear experienced network engineers...

Even before I find the command "show interface counters errors non-zero" I was suffering with "sh int | include discard".

As you can see I don't know how to check logs, monitor ports etc.  

Please help me to find the root cause. What should I do?

 

 

 

4 Replies 4

Ozy
Level 1
Level 1

My VPC and Uplink cables are:

VPC-SW-1#
Eth1/45       VPC Keep-Alive     connected routed    full    10G     SFP-H10GB-CU3M
Eth1/46       VPC Peer-Link      connected trunk     full    25G     SFP-H25GB-CU1M
Eth1/47       VPC Peer-Link      connected trunk     full    25G     SFP-H25GB-CU1M
Eth1/48       VPC Peer-Link      connected trunk     full    25G     SFP-H25GB-CU1M
Eth1/49       datasw-aa-03       connected trunk     full    100G    QSFP-100G-PCC
Eth1/50       datasw-aa-04       connected trunk     full    100G    QSFP-100G-PCC
Eth1/51       datasw-aa-06       connected trunk     full    100G    QSFP-100G-CR4
Eth1/52       datasw-aa-08       connected trunk     full    100G    QSFP-100G-CR4
Eth1/53       datasw-aa-10       connected trunk     full    100G    QSFP-100G-PCC
Eth1/54       datasw-aa-11       connected trunk     full    100G    QSFP-100G-PCC

VPC-SW-2#
Eth1/45       VPC Keep-Alive     connected routed    full    10G     SFP-H10GB-CU3M
Eth1/46       VPC Peer-Link      connected trunk     full    25G     SFP-H25GB-CU1M
Eth1/47       VPC Peer-Link      connected trunk     full    25G     SFP-H25GB-CU1M
Eth1/48       VPC Peer-Link      connected trunk     full    25G     SFP-H25GB-CU1M
Eth1/49       datasw-aa-03       connected trunk     full    100G    QSFP-100G-PCC
Eth1/50       datasw-aa-04       connected trunk     full    100G    QSFP-100G-PCC
Eth1/51       datasw-aa-06       connected trunk     full    100G    QSFP-100G-PCC
Eth1/52       datasw-aa-08       connected trunk     full    100G    QSFP-100G-CR4
Eth1/53       datasw-aa-10       connected trunk     full    100G    QSFP-100G-PCC
Eth1/54       datasw-aa-11       connected trunk     full    100G    QSFP-100G-PCC

 

After shutting ipmi switch uplinks the problem did not repeat again.

I'm watching now lets see...

Hello @Ozy ,

the following link provides methods and tools for further investigation of  this kind of errors

https://www.cisco.com/c/en/us/support/docs/switches/nexus-9000-series-switches/217340-understand-the-tahusd-buffer-threshold-e.html

Consider it a reference if the issue will appear again.

Hope to help

Giuseppe

 

Hello @Giuseppe Larosa, thank you for the reply.

When I search this issue, there was only 3 best result and one of is this link. I learned every diagnost command from there it is very usefull.

But the overall approach over in that topic is explaining the concept and there are no easy way to understand outputs and investigate the problem switch or port based. "

In my issue, when this problem occurs, all the counters are rising due to buffer cleanup and it is very hard to understand what is the cause of trigger or the problem. At least it is hard for me (non network engineer).

I setup the topology and it works great. This is my first issue after 5 months and main cause is wrong cabling which I have no idea when or who have done that.

This problem teach me that I need to spend some "over-times" for monitoring and exporting syslogs. My active features are below. Can you recommend any guide or any quick advise to be able to get meaningfull alerts or values for the future problems I could have? 

cfs eth distribute
feature interface-vlan
feature hsrp
feature lacp
feature vpc
feature lldp
feature bfd

 

Review Cisco Networking products for a $25 gift card