Never heard of the "ADDDC Sparing" feature of Intel CPUs before...
The fun thing about correctable ECC errors in Cisco UCS hardware is that they don't create an alert, but are still logged - and then at some point the hardware will report a "SEL Log full" event, which is an alert. 🤷♂️
Ok, so Supermicro has a slightly more useful document on this at https://www.supermicro.com/manuals/other/Memory_RAS_Configuration_User_Guide.pdf
I still don't understand how this ADDDC stuff is supposed to work... Will each module reserve some spare area at boot time to use for mapping? Or is memory hot-swapped out in case of a failure that triggers the feature?
Anyway, "Adaptive Double Device Data Correction (ADDDC) [..]will not issue a performance penalty before a device fails" is useful. A small performance hit is better than a crash, usually.
@galaxis In the world I'm used to, crashes are better than performance penalties or undetected errors. I can understand why you'd want DDDC for SSDs, but why for RAM? RAM is expensive, and there are plenty of other things that can fail on a server anyway, so why not just shut it down and swap out RAM sticks when you get too many correctable ECC errors?
@freakazoid That's what we have been doing, but Cisco currently advises for enabling ADDDC on hardware that supports the feature. It doesn't go undetected, but it also doesn't fill the SEL log with repeating ECC errors. It should also lower the chance to run into an uncorrectable ECC error on an affected module.
Our workload is lots of classic enterprise application VMs, and those usually really don't like crashes, whereas a couple % of memory latency will go largely undetected.
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!