We upgraded a pair of Cisco Nexus 5K switches early last week which had been running a 7.0 release for a while. The switch pair was running beautifully and by that I mean no errors or discards on any interfaces.
However after we upgraded to 7.1 (recommended Cisco release), boom we saw within our solarwinds monitoring, huge amounts of input packet discards on particular busy interfaces. I thought this was a software bug so I logged a ticket with Cisco TAC and started troubleshooting straight away.
Cisco TAC explained that within the 7.1 release there have been some changes to how the switch handles queueing, that the VOQ (virtual output queuing) queues had changed.
Cisco Nexus egress congestion
As an example, if I have a flow of packets with input (ingress) to ports 1-5 and that flow is destinated to output (egress) to port 10. If port 10’s egress queue becomes congested you will most probably see input discards amongst future ingress packets. The diagram below illustrates this.
Cisco Nexus – How to identify egress congestion
The easy part of this is that we can identify the input discards by typing in the following command:
SWITCH1# show interfaces | i discard|Description
However, the tricky part on the Cisco Nexus 5500 series switches is how do we identify which outbound port is congested ?
Firstly determine how many asics your switch has by typing in the command below and looking at the car column.
SWITCH1# sh hardware internal carmel all-ports
We can see from the output that we have asics 0, 1, 2 and 3. We don’t count the sup modules.
Now that we have established how many asics we have, we now need to look at the output congestions on each asic. Best way to identify congested outbound ports would be at a busy network period. The command to check for outbound congestion is:
The command to check for outbound congestion on asic 0 is:
SWITCH1# show hardware internal carmel asic 0 registers match .*STA.*frh.* | i eg
A value of 0 means no congestion. Any other value means there is congestion on this interface.
The table above does not keep a record of outbound congestion for very long, so unfortunately you need to keep repeating the above command and record down how many instances of congestion you see on the interface. In the above example if we continue to see a number other than 0 on the addr_1 interface we can conclude that this interface is suffering from egress congestion.
To link car_bm_STA_frh_eg_addr_1 to a physical interface we need to return to the previous table and look for the asic number followed by the addr_ number.
The interface with outbound congestion is eth1/1
You will need to repeat the command, show hardware internal carmel asic 0 registers match .*STA.*frh.* | i eg for asic 1, 2 and 3 as well.
To resolve the outbound congestion, you will need to follow the outbound interface to the device and determine if there are some performance issues on the end device. Alternatively you can try adding more links to the end device to increase the outbound queues.
This process is only for the Cisco Nexus 5500 series switches. For 5600 and 9000 series, it is much easier as you are able to look into the VOQ’s