An Overview of Meta-Monitoring
Unless you are using cloud-based services to meet all your monitoring requirements, there is a need to monitor your monitoring infrastructure itself. Let's call it meta-monitoring because it is self-service for monitoring. Its main purpose is for you to have peace of mind while managing an application stack in production and to avoid concern about whether various components of your monitoring infrastructure are up and running.
Even when you are using cloud-based services for monitoring, it is important to ensure that the related agents are up and running in your application stack and that they are reporting monitoring metrics back to the cloud or SAS backend. Usually, if those agents are down or have a problem communicating with the backend, then the SAS provider will alert you about it. Let's focus on how to monitor the monitoring infra that you will end up maintaining.
Meta-Monitoring Requirements
Infrastructure
This is the basic monitoring requirement to check whether the servers hosting the monitoring applications are available and accessible. The requirement is similar to the infrastructure monitoring requirements for servers that host applications.
Application Services
On the monitoring servers, the related services and processes must be running. On the application servers, the monitoring agents should only be running if the monitoring application is agent-based.
Network Access
Usually, there will be restricted network access between a monitoring application server and its agents running in production environments. The connectivity between the server and an agent can be impacted by changes done in the network configuration. Simple network checks using Telnet or Netcat that mimic the kind of network access required by the monitoring system can be implemented.
Monitoring Activity
By tracking the log files generated by the monitoring application, we can get some idea on the availability and health of the monitoring infrastructure. Log files from the monitoring servers and agent logs from the application server machines can be indexed by a log aggregation and indexing system like ELK. The reliability of the latter method will be better with a cloud-based service for log aggregation and indexing such as Sumo Logic or Loggly.
Main Challenge
The main challenge with implementing meta-monitoring is the availability of reliable infrastructure, though this is not a major issue if your monitoring infrastructure is based out of a public cloud. If it is data center (DC) or colocation (colo) hosted, building meta-monitoring will be more challenging. If a DC or colo goes down or becomes inaccessible, the servers that host meta-monitoring also go down and there will not be any reliable way to know about such an incident.
Implementing a highly reliable method to check on the availability of monitoring systems is the key to the success of a meta-monitoring system. During the DC-only era, the options to implement reliable meta-monitoring were limited. But, with the availability of cloud-based infra and SAS monitoring applications that task has become much simpler as we would see in the subsequent sections.
Meta-Monitoring Methods
Different configurations of meta-monitoring are possible and your specific setup would depend largely on where your infrastructure is and what applications you have on the monitoring toolchain.
Single Data Center
When all your applications are in one DC or colo and you are not using a SAS-based monitoring service, the options available to you are limited. These are the strategies to follow if you have to deal with a situation like that.
Set up a secondary monitoring server to monitor the primary monitoring server (and nothing else).
From secondary monitoring server, send out periodic notifications to your on-call distribution list about the availability of main monitoring. The on-call team should be instructed to monitor these alerts.
The secondary monitoring server should be monitored by the primary like it monitors the application servers in production.
The on-call team should investigate these scenarios:
Alerts from secondary monitoring server about issues with primary monitoring infra.
The absence of the periodic heartbeats from the secondary monitoring service, in which case, there could be availability issues with DC or colo itself.
Multi-Data Center Environment
This is a variation of the single DC or colo strategy but in this configuration, the primary and secondary monitoring servers are hosted in different DCs and colos for reliability.
Set up a secondary monitoring server to monitor only the primary monitoring server in a different DC or colo.
The secondary monitoring server should be monitored by the primary like it monitors the application servers in production.
There is no need to send out heartbeats from the secondary server in this case because there is a fairly good chance that either primary or secondary monitoring infra will be available, and, one of them will alert on availability issues.
With Access to Public Cloud
If you have a footprint in the public cloud, the task of implementing meta-monitoring will become simpler. The role of a secondary monitoring server, as we have seen in the DC and colo-only strategies, can be replaced with native cloud monitoring features like CloudWatch in AWS. If the monitoring infrastructure would be down, such cloud-native monitoring would alert.
Note that your overall monitoring requirements will not be covered by public cloud monitoring services like CloudWatch but they would be very useful in building meta-monitoring. Using CloudWatch-like features for meta-monitoring would make the monitoring infrastructure highly reliable.
With a SAS Monitoring Service
If there are cloud-based monitoring services available in your monitoring toolchain, all you have to do is to include the monitoring infra that you manage in the scope of a cloud-based monitoring service. The cloud-based monitoring could be general purpose services like Datadog and VictorOps or log aggregation and indexing services like Loggly and Sumo Logic.
It also means that if you are using cloud-based monitoring services for both general monitoring and log aggregation and indexing, there is no meta-monitoring requirement. Normally, that is implicitly covered by the cloud-based monitoring services. You only need to make sure that the SAS provider can actually notify you of any failures on their backend or with their monitoring agents hosted in your environment.
If a general purpose SAS monitoring service is used for meta-monitoring, your local monitoring infra will be monitored like another application stack, checking on availability of servers and processes.
If adequate logging info is available from the local monitoring applications, those log files can be fed into a SAS-based log aggregation platform like Sumo Logic or Loggly. Log files with meta-monitoring related metrics can be generated if log files from the monitoring applications won’t be adequate.
With a Last-Mile Monitoring Service
A cloud-based monitoring service has to be used to cover last-mile monitoring. If that service is available in your monitoring toolchain, it can also be used to implement meta-monitoring, provided that meta-monitoring endpoints are accessible from the Internet as REST APIs or health-check URLs.
Typically, monitoring infra is not exposed to the internet, but the availability of monitoring health-checks from the internet can be utilized in very creative ways including checking those from mobile apps.
Conclusion
A combination of access to public cloud-based infrastructure and SAS based monitoring eliminate the need to build meta-monitoring explicitly. However, even in that ideal environment, you need to make sure that your monitoring is monitored.
Without much cloud support available for monitoring, it is important that your monitoring infrastructure and processes cover all the requirements of meta-monitoring to make the overall monitoring system highly reliable.