The HANA cluster node fails..

A HANA cluster node on a 3+1 512GiB scale out cluster failed due to swapping (paging to paging space).

At the time of the issue there where no unload events, which suggested that we were running out of memory before HANA reached the Global Allocation Limit (GAL). The GAL was reduced but the issue happened again.

We looked for non-HANA processes consuming memory, adding them to the detailed process monitoring using the appmon.xml in Monitiq, but we found nothing interesting to report. We could not find a significant user of memory outside of HANA.

The memory metrics did not add up..?

We did some rudimentary calculations to add up the total memory used in processes (ps aux|awk ‘{tot+= $6} END {print tot}’) and compared this with the total value of memory consumed by processes as reported by Monitiq (GiB in use by processes and OS). We were surprised to find that there was 36GiB unaccounted for! This is quite unusual, normally you would expect the sum of processes to add up to more than the memory used, as the “ps aux” values do not take into account the shared components of the memory usage. So where is the missing memory?

All becomes clear, SLAB is revealed!

Closer inspection of /proc/meminfo revealed the metric “SUnreclaim” containing 36GiB. SUnreclaim refers to the  un-reclaimable component of the SLAB, an area of memory used by kernel level drivers not directly associated with processes, hence not seen by process level metrics.

You can run the command slabtop to see some detailed metrics on slab usage. When I did this it became evident that something was rapidly allocating new slab objects, but not releasing them. We found a patch and fixed the issue.

We also fed this new OS knowledge back into the Monitiq agent development and now collect all of the /proc/meminfo metrics, so that we can choose which we are interested in as the OS progresses without having to upgrade the agent.

From agent version 1.4.17 we store and alert on the SUnreclaim value.

gib-slab-col

It seems that this is an area that is still evolving, and depending on your Linux flavour and version you might have a different set of metrics.
See documentation on SLAB/SLOB/SLUB in places like this.

But now we collect all of them on the agent, when we identify something useful we can switch on the collection, storage and alerting on the back-end without making a change on the customers systems.