Solved: trying to clear ECC errors

andrewcalhoun · ‎07-28-2015

I have a server whose Health LED is amber, and UCSM is showing an F1237:
Description: Health LED of server 7/1 shows error. Reason: DDR4_P1_A1_ECC:Sensor Threshold Crossed; DDR4_P1_A2_ECC:Sensor Threshold Crossed

Last Transition for this fault is 2015-04-01

There were apparently some ECC errors during POST in April, but besides that, there was only 1 correctable ECC error in April, and 1 correctable ECC error in June (zero uncorrectable errors ever on this server).

So I:

1. Gathered UCSM and Chassis (7) UCS logs.

2. Acknowledged the F1237,

3. Reset the memory error counters on both P1 A1 and P1 A2 DIMMs from the correct window (Equipment --> Inventory --> Memory --> double-click DIMM A1 to open a smaller window--> Click "Reset Memory Errors" click "Yes", and then did the same thing for DIMM A2.

4. Cleared the SEL Log on this blade

5. Reset the CIMC for this blade.

After the CIMC reset, the LED still amber, and the F1237 is still on the list of faults all the way up to the main chassis node on the Equipment tab.

What really seems weird is that the SEL for this blade now shows the 4 entries below (the same entries it was showing from back in April until I cleared the SEL and reset the CIMC), but those 4 entries now have TODAY's date on them, even though the Last Transition for the F1237 still shows as 2015-04-01, and this blade was not rebooted today. Furthermore, this blade was rebooted over the past weekend, and that does show up in the SEL, but there were no ECC errors during POST that time.

Memory DDR4_P1_A1_ECC #0x81 | DURING POST: Upper critical - going high | Asserted | Reading 60250 >= Threshold 16000 error

Memory DDR4_P1_A1_ECC #0x81 | DURING POST: Upper Non-recoverable - going high | Asserted | Reading 60250 >= Threshold 60250 error

Memory DDR4_P1_A2_ECC #0x82 | DURING POST: Upper critical - going high | Asserted | Reading 60250 >= Threshold 16000 error

Memory DDR4_P1_A2_ECC #0x82 | DURING POST: Upper Non-recoverable - going high | Asserted | Reading 60250 >= Threshold 60250 error

How is the SEL showing POST errors with today's date, when the server hasn't even been rebooted today (and the last transition for the F1237 still shows a date from April)?

Saurabh Kothari · ‎07-28-2015

hey,

did you open a TAC case yet?

ECC count seems to be high.

View solution in original post

Keny Perez · ‎07-29-2015

60250= This DIMM was FLAGGED 'INOPERABLE' during a previous boot.

Replace it. After UCSM 2.1, we do not get more false-degraded alerts. If the DIMM is marked as degraded (in this case with the "60250", even though UCSM is marking it as operable), it will stay in that state til it is changed.

HTH,

-Kenny

View solution in original post

andrewcalhoun · ‎07-28-2015

I thought I should mention that both DIMM A1 and DIMM A2 show as "Operable" on the window used to reset memory errors for each DIMM, and they both also show "Operable" in the sam_techsupportinfo file from the UCSM tech support log:

Server 7/1:
...

Array 1:
DIMM Location Presence Overall Status Type Capacity (MB) Clock
---- ---------- ---------------- ------------------------ ------------ ------------- -----
1 A1 Equipped Operable Other 32768 2133
2 A2 Equipped Operable Other 32768 2133

Saurabh Kothari · ‎07-28-2015

hey,

did you open a TAC case yet?

ECC count seems to be high.

andrewcalhoun · ‎07-29-2015

Not yet. I am waiting for authorization from the end customer to use the support contract associated with this hardware.

My immediate puzzlement is why I cannot get rid of these SEL entries, in an effort to see if the errors return. They clearly say "during POST," and this blade hasn't been rebooted since Saturday.

Yesterday, I went through the process of clearing DIMM errors on DIMMs A1 and A2, clearing the SEL, and resetting the CIMC. Twice. Even though the SEL was cleared each time, these 4 entries (saying "during POST") were re-added to the newly-cleared SEL each time, with a brand new timestamp, as though they had just happened, even though the blade hasn't been restarted in 3 days. Not to mention the fact that the reboot on Saturday did not seem to produce any ECC errors during POST - These seem to be hanging around from a boot back in April.

andrewcalhoun · ‎07-29-2015

Oh, and the DIMM error counters in the chassis (CIMC) show tech won't clear, even though I cleared the DIMM errors on each DIMM individually:

Querying All IPMI Sensors:
Sensor Name | Reading | Unit | Status | LNR | LC | LNC | UNC | UC | UNR |
=================|=========|==============|========|=========|=========|=========|=========|=========|=========|
DDR4_P1_A1_ECC | 60250.000 | error | UNR | na | na | na | na | 16000.000 | 60250.000 |
DDR4_P1_A2_ECC | 60250.000 | error | UNR | na | na | na | na | 16000.000 | 60250.000 |

Even after the clearing/resetting procedure 2x yesterday, the only fault mentioning ECC errors, from the 'show fault details' section of the UCSM show tech show a Creation Time and a Last Transition Time on 2015-04-01.

Saurabh Kothari · ‎07-29-2015

So if i Understand it correctly these errors are stale?

after clearing SEL yesterday did you reset CIMC and then collect the show-tech logs?

andrewcalhoun · ‎08-03-2015

Yes, the day after starting this thread, I cleared the SEL 2 additional times, and each time, the 4 entries were written to the (otherwise empty) SEL, with time stamps as if they had occurred immediately after the SEL clearing, even though they say "during POST" and this blade had not been booted that day.

I guess Keny Perez must be right; CIMC must be taking whatever occurred during POST so seriously that it keeps putting the entries back into the SEL any time it is cleared, and will not clear the memory ECC errors for that DIMM when I specifically try to do that in inventory.

If I can ever get access to this end client's account, I will open a case to have it troubleshot further or the DIMMs replaced.

Saurabh Kothari · ‎08-03-2015

Glad to have helped you :)

thank you.

Keny Perez · ‎07-29-2015

60250= This DIMM was FLAGGED 'INOPERABLE' during a previous boot.

Replace it. After UCSM 2.1, we do not get more false-degraded alerts. If the DIMM is marked as degraded (in this case with the "60250", even though UCSM is marking it as operable), it will stay in that state til it is changed.

HTH,

-Kenny