Re: RX and TX packet drops

gman66 · ‎09-30-2019

Hello Cisco Community - can anyone help me understand how to monitor packet drops in Cisco UCS? In UCSM there are error and loss counters in the port stats, but none are a basic counter for packet drops, and more specifically how can one check to see if there were packets being dropped at a specific time?

Do I need the CLI for this?

Do I need a 3rd party mgmt tool that can collect stats and store them for historical review?

HELP! :)

I've been focused on VMware technology most of my career and this is a metric you can easily pull up and look at, with well documented fixes if the problem occurs. Can't seem to figure out how to get these packet dropped stats in UCS, can't find any documentation on it

balaji.bandi · ‎09-30-2019

The question here is - packets dropping from where to where.

how is your UCS environment connected?

Most cases, UCS , VMWARE( vSwitch) -- Fibre Interconnect---Nexus --Core--(users with access switches).

depends on where you see packet drops, we need to look at the interface connected level.

with VMware environment, if you have vSphere you can monitor on vm level, or you can monitor switch level if you have any NMS.

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

gman66 · ‎10-01-2019

The architecture is exactly as you listed it, VMware running on UCS blades with 5k upstream then 7k upstream.

I have looked at the VM stats, no packet drops at the time of the issue. Our F5 shows that the 2 servers became unavailable and were not responding to the health check.

No packet drops on the VMs in question

No host mem swapping

No CPU contention on the host

small deviation in storage latency but the spike doesn't even touch 1ms (Thanks HDS there is no better array)

SMall deviation in network throughput but the spike is only to 400KBps, we have 10Gbps networking

My review of the VMware stack shows its not the culprit

UCSM shows 0 ports stats for all loss counters, and 0 for all errors.

The event logs don't go back far enough, they must be getting over written, can't recall how that is configured, but I can only see the past week or so of events and that's it. I navigated to the specific blades that VMs were running on to check for faults and events and there was no data showing in either. Not sure why the events section would be empty though, I figured there should be some events logged but nothing.

I simply want to understand where I can see if packet drops took place within the UCS stack, I know where to look at this from the VMware host perspective, but not the UCS hardware per se, or if its something that is being monitored and logged in the stack.

I figured that if an interface in the UCS stack went down I would see this in the event logs, but as I state, the logs aren't going back far enough for me to see if something happened. I wasn't pulled in to try to diagnose this issue till over a week after it occurred. With VMware I can get a granular look at the stats using the Vrealize operations manager tool, but I have nothing like this tool for Cisco UCS and could certainly take a recommendation on something that might capture events better for historical analysis.

Kirk J · ‎10-01-2019

You would need to check:

VMware VMNIC level counters for drops in the esxi stack.
UCS adapter level counters (connect adapter chassis/blade/adapter: connect adapter 1/1/1)
- connect, attach-mcp
- the actual command to see drops can vary depending on which vic card is present.
UCS IOM Hif/NIF counters (connect iom x)
- commands vary depending on IOM generation
UCS FIs : show hardware internal carmel counters interrupt
- show queuing interface eth x/y
- commands will vary depending on FI generation

Also, have seen plenty of cases were actual issues was with storage, but first symptoms show up as 'network' issues when hosts or guestVms are starting to thrash around with storage. (check ESXtop output for DAVG,KAVG,GAVG )

You can do Vmware pktcap-uw commands to capture at DVS/VMK/VMNIC level, and likely want to do a span on the N5k Links going to the FIs to see which direction the drops seem to be (i.e. is GuestVm sending/resending requests that aren't getting responses?)

I would suggest getting pcaps to define who stops getting responses from who... Are only some of the hosts/guestvms impacted? Trying some disabling port-channel members, or half of VPC.... does problem go away?

Setup some generic ping tests to isolate:

ping from within same vlan/subnet , guestVM to guestVM who's MACs are pinned to same FI (this confirms if local FI switching is/is not seeing issue)
ping from within same vlan/subnet, guestVM to guestVM who's MACs are pinned to different FIs, (this involves upstream N5K switches)
Ping from different guestVMs in different subnets (this tests L3 gateways)

Kirk...

gman66 · ‎10-01-2019

Thanks Kirk - when I run the commands in the UCSM shell, what history is provided? For example, if I see some packet drops recorded, how do I know when they might have happened? The idea is to match the drops with the time frame of the incident.

Kirk J · ‎10-01-2019

You're not going to get individual drop time stamps.

You can enable CRC increment thresh hold alerts, which would give you alert time stamps.

I would keep the interface counters cleared, and keep checking.

Would be handy if you had a NMS pulling and logging historical data.

What does the output on each FI look like for nxos#show int count error?

What model FIs and IOMs?

Kirk...