7 Tips for Using Instrumentation and Metrics To Align Site Reliability With Business Goals

Before the acceleration of modern DevOps practices, software engineers primarily wrote code. Now the job is so much more — from getting apps production-ready and iterating quickly to scale new services to architecting system compatibility and ensuring compliance and reliability — which has elevated the need for exceptional instrumentation. But what does great instrumentation involve and where should you begin?

I tackle the answer to this question in a new book on observability I co-authored along with Chronosphere’s co-founder and CEO, Martin Mao, and cloud-native expert, Kenichi Shibata — O’Reilly’s Cloud Native Monitoring: Practical Challenges and Solutions for Modern Architecture. 

Instrumentation is the process of building a great observability function, which first and foremost includes standardized metrics and dashboards tied into business context. With exceptional instrumentation that aligns site reliability with business goals, software engineers and site reliability engineers (SREs) can further give their organization a competitive advantage. In chapter 5, we share the tips and tricks to building a great instrumentation and metrics function, which I have excerpted and paraphrased in this article.

7 Ways To Build Great Instrumentation and Metrics Functions     

The trick to great instrumentation is building a great metrics function that helps your organization find the right balance between too much and not enough information. Our book explains seven ways to achieve that goal.                                    

1. Start With Out-of-the-Box Standard Instrumentation and Dashboarding

SRE teams and software engineers using open source tools can enable standardized metrics and dashboards right out of the box. Let them get started right away. 

They can, for example:

2. Enlist Internal Software Engineers and SRE/Observability Teams To Deliver Standard Dashboards

Internal software engineers or SRE/observability teams are better choices than any vendor to build and create standardized dashboards because they know your business context best. That makes them best positioned to achieve desired business outcomes.

How that works in practice is that they’ll know how to:

3. Add Business Context To Standardized Metrics

With standardized metrics and dashboards in place, you can begin to add your business context. These three examples illustrate how:

4. Create SLOs From Standardized Instrumentation

Derived from key metrics, service levels help you align site reliability with your business goals. These three concepts, defined by Stavros Foteinopoulos of Mattermost,[1] are key to understanding service levels: 

Here’s an example of those three concepts in practice:

If fictitious Feline Company builds an API that provides cat memes for app developers to use, its SLI is the percentage of time that API is available to all downstream external customers. If its SLO is 99% availability, it’s promising customers no more than 3.65 days of downtime per year. This is measured by an error rate formula. If downtime errors exceed the limit specified in the SLO, the SLA states the company agrees to donate $1 to an animal shelter per additional minute of downtime.

Steven Thurgood of Google writes in The Site Reliability Workbook that error budgets “are the tool SRE uses to balance service reliability with the pace of innovation.”[2] 

If your company’s SLI is availability, you have likely already instrumented a set of standardized metrics, like the Prometheus RED metrics. In that case, you can use those standardized metrics  to build a dashboard, which will make it fairly straightforward to create realistic SLOs based on performance. But you must also standardize what each SLI means across the organization. For example, what is 99% availability? The rule of thumb here is consistency across an organization.   

5. Be Sure to Monitor the Monitor                                                    

It’s important to know two things about monitoring: who monitors the monitoring system? What happens when it goes down? 

These three rules can help ensure reliability:                                    

6. Establish Write and Read Limits                                                      

Metric cardinality is fundamentally multiplicative. One engineer can write a single query that can read metrics with a cardinality in the order of 10s to 100s of million time series. Because you cannot safely guarantee a system will not be overloaded answering this query, it’s best practice to build a detection system to understand if one of your queries or writes will cause an outage.   

Read and write matters for developers and users:                            

7. Establish a Safe Way to Experiment and Iterate To Drive Innovation

If you are highly dependent on today’s monitoring system and can be concerned about making major changes to it, you’re creating a different set of challenges. You have no way to experiment and iterate. You’re also prevented from learning new technologies and tools without breaking your current system observability technology stack.

You can avoid this scenario by making it safe to create a new set (or subset) of monitoring data in another observability system. Try to get 10% of all production observability data onto the new system without your developers doing anything more. Different observability systems let your SREs safely experiment — without a developer needing to change or redeploy code — with things such as:

An observability system like Prometheus can scrape data from the /metric endpoint every 10 seconds instead of every second.

Cloud-Native Observability Boosts Instrumentation Success                                    

Modern observability platforms empower you and your team to learn more about your systems and applications at a level of granularity that’s been so difficult to establish at scale historically. Join the organizations across industries choosing observability platforms to build a great observability function that puts you in control of your telemetry and increases business confidence and effectiveness.


[1] Stavros Foteinopoulos, “How We Use Sloth to Do SLO Monitoring and Alerting with Prometheus,” Mattermost, October 26, 2021, https://oreil.ly/e35u8.   

[2] Steven Thurgood, “Example Error Budget Policy,” in The Site Reliability Workbook (O’Reilly Media, 2018), https://oreil.ly/yEg2b.

 

 

 

 

Top