Zabbix Template for ECC Errors

If your system has the capability of monitoring ECC errors via /sys/devices/system/edac/mc*/ items, you can use this template to do so. I had found a template elsewhere on the internet, but it was hardcoded to only monitor mc0, the first memory controller. Many CPUs that support ECC have multiple memory controllers, not to mention multi-socket systems.

zabbix_export:
  version: '7.0'
  template_groups:
    - uuid: 449d04f6170e4bdea0501d60281f2da0
      name: 'Linux servers'
  templates:
    - uuid: 215fb4ca3cc545a4bff18cc1874096fe
      template: 'Template EDAC Memory Controller'
      name: 'Template EDAC Memory Controller'
      groups:
        - name: 'Linux servers'
      discovery_rules:
        - uuid: 2dbba7c4792a40d5bd48d274438f0d42
          name: 'Discover Memory Controllers'
          key: 'vfs.dir.get["/sys/devices/system/edac/mc/","^mc[0-9]+$",,,,0]'
          lifetime: 30d
          item_prototypes:
            - uuid: ea0397643c1445669913f7be6a353791
              name: 'ECC CE Count for {#MC}'
              key: 'vfs.file.contents[/sys/devices/system/edac/mc/{#MC}/ce_count]'
              preprocessing:
                - type: STR_REPLACE
                  parameters:
                    - \n
                    - ''
              trigger_prototypes:
                - uuid: d7b7d4c4a72e4e69a1d4b67b832b4d7c
                  expression: 'last(/Template EDAC Memory Controller/vfs.file.contents[/sys/devices/system/edac/mc/{#MC}/ce_count])>0'
                  name: 'ECC CE Count Increased on {HOST.NAME}'
                  priority: WARNING
            - uuid: 3afe9596479b47428422a49c4b4cab57
              name: 'ECC UE Count for {#MC}'
              key: 'vfs.file.contents[/sys/devices/system/edac/mc/{#MC}/ue_count]'
              preprocessing:
                - type: STR_REPLACE
                  parameters:
                    - \n
                    - ''
              trigger_prototypes:
                - uuid: c6e25e2a3b0b4bbf9d1a8c3e5e7a6f4b
                  expression: 'last(/Template EDAC Memory Controller/vfs.file.contents[/sys/devices/system/edac/mc/{#MC}/ue_count])>0'
                  name: 'ECC UE Count Increased on {HOST.NAME}'
                  priority: DISASTER
          lld_macro_paths:
            - lld_macro: '{#MC}'
              path: $.basename

This will set up two items per memory controller – one for correctable errors and one for uncorrectable. It will also set up a trigger for each one going above zero.

Leave a Reply