Zabbix Template for ECC Errors
If your system has the capability of monitoring ECC errors via /sys/devices/system/edac/mc*/
items, you can use this template to do so. I had found a template elsewhere on the internet, but it was hardcoded to only monitor mc0, the first memory controller. Many CPUs that support ECC have multiple memory controllers, not to mention multi-socket systems.
zabbix_export:
version: '7.0'
template_groups:
- uuid: 449d04f6170e4bdea0501d60281f2da0
name: 'Linux servers'
templates:
- uuid: 215fb4ca3cc545a4bff18cc1874096fe
template: 'Template EDAC Memory Controller'
name: 'Template EDAC Memory Controller'
groups:
- name: 'Linux servers'
discovery_rules:
- uuid: 2dbba7c4792a40d5bd48d274438f0d42
name: 'Discover Memory Controllers'
key: 'vfs.dir.get["/sys/devices/system/edac/mc/","^mc[0-9]+$",,,,0]'
lifetime: 30d
item_prototypes:
- uuid: ea0397643c1445669913f7be6a353791
name: 'ECC CE Count for {#MC}'
key: 'vfs.file.contents[/sys/devices/system/edac/mc/{#MC}/ce_count]'
preprocessing:
- type: STR_REPLACE
parameters:
- \n
- ''
trigger_prototypes:
- uuid: d7b7d4c4a72e4e69a1d4b67b832b4d7c
expression: 'last(/Template EDAC Memory Controller/vfs.file.contents[/sys/devices/system/edac/mc/{#MC}/ce_count])>0'
name: 'ECC CE Count Increased on {HOST.NAME}'
priority: WARNING
- uuid: 3afe9596479b47428422a49c4b4cab57
name: 'ECC UE Count for {#MC}'
key: 'vfs.file.contents[/sys/devices/system/edac/mc/{#MC}/ue_count]'
preprocessing:
- type: STR_REPLACE
parameters:
- \n
- ''
trigger_prototypes:
- uuid: c6e25e2a3b0b4bbf9d1a8c3e5e7a6f4b
expression: 'last(/Template EDAC Memory Controller/vfs.file.contents[/sys/devices/system/edac/mc/{#MC}/ue_count])>0'
name: 'ECC UE Count Increased on {HOST.NAME}'
priority: DISASTER
lld_macro_paths:
- lld_macro: '{#MC}'
path: $.basename
This will set up two items per memory controller – one for correctable errors and one for uncorrectable. It will also set up a trigger for each one going above zero.