Monitoring HANA System Replication with Monitiq

HANA DB provides the ability to replicate data to a second server/cluster for DR or clustering. This Blog covers the metrics we use in Monitiq to manage the day to day operation of replication.

(For more information on HANA replication see http://scn.sap.com/docs/DOC-47702)

The Monitiq HANA agent displays the following metrics for each of the HANA services.

Path – Path of the volume being replicated
Volume ID – Integer reference for the volume being replicated
Site Name – The name of the primary live site
Secondary Host – The name of the target/destination host
Secondary Site Name – The name of the secondary/target/destination site
Replication Status – The current status of the replication for this service/volume. Normally “ACTIVE”. This is the metric used for alert rules
Replication Mode – SYNC, ASYNC, SYNCMEM etc (see http://scn.sap.com/docs/DOC-47702)
Last Log Position Time – The last position written to the log on the primary site
Shipped Log Position Time – Last position sent to the second site
Seconds Behind – Appropriate to ASYNC connections, the number of seconds the second site is lagging behind the first

When running sacrificial DR, don’t forget monitoring Row vs GAL

‘Sacrificial DR (Disaster Recovery)’ is the term used for making good use of idle resources on a Disaster Recovery node for QA/Test/Development on the understanding that these additional non-prod environments will be switched off (sacrificed) when DR is required. To prevent the over-allocation of memory on the DR node, you need to reduce the amount of memory the DR instance is allowed to use. The recommendation is to set the Global Allocation Limit (GAL) to 64GB or 20GB + row storage, whichever is greater. If the row data grows over time, the GAL will need to be adjusted to match this requirement. If the row storage + 20GiB exceeds the GAL on the DR instance, the replication will fail. 

Monitiq has an additional Monitor keeping an eye on the level of row data compared with the target instance GAL. It also gives you an indication of the overall level of commitment, so that the GAL’s do not add up to more than the available memory on the cluster node. The metrics sent by the agent are:

Cluster Node – The primary server
Secondary Host – The destination server
Row memory in use GB – The size of row memory data in the production instance
Secondary Host Allocation Limit GB – The global allocation limit on the DR instance
Recommended Minimum Secondary Host Allocation Limit GB – 64GB or 20GB + row storage, whichever is greater
Spare Capacity GB =  [Secondary Host Allocation Limit GB] – [Recommended Minimum Secondary Host Allocation Limit GB]* – This is the space on the replication target for the DR instance to grow into. If this is significantly high, this space could be reduced to allow the increase of other non-prod instances
*
Used Capacity % – Measures the required size of the DR instance against that set in the Global Allocation Limit (GAL) on the DR instance. If this is greater than 100% then the replication would fail as there would not be enough to contain the row store
Allocation Commitment Of Secondary Host (% of Physical Mem) – The sum of all GAL’s set on this physical server (DR+QA+DEV etc).  If this is over 97% of physical memory then the system is likely to go into a swapping condition as the system is over provisioned

The GAL checks have been available since v1.3.7 and the replication checks been available since v1.3.5. For more information on agent releases see: https://www.monitiq.com/monitiq-agent-release-notes/