Highly Available Prometheus Metrics for Distributed SQL With Thanos on GKE

In the last few years, Prometheus has gained huge popularity as a tool for monitoring distributed systems. It has a simple yet powerful data model and query language, however, it can often pose a bit of a challenge when it comes to high availability as well as for historical metric data storage. Adding more Prometheus replicas can be used to improve availability, but otherwise, Prometheus does not offer continuous availability. 

For example, if one of the Prometheus replicas crashes, there will be a gap in the metric data during the time it takes to failover to another Prometheus instance. Similarly, Prometheus’s local storage is limited in scalability and durability given its single-node architecture. You will have to rely on a remote storage system to solve the long-term data retention problem. This is where the CNCF sandbox project Thanos comes in handy.

Thanos is a set of components that can be composed into a highly available metrics system with unlimited storage capacity on GCP, S3, or other supported object stores, and runs seamlessly on top of existing Prometheus deployments. Thanos allows you to query multiple Prometheus instances at once and merges data for the same metric across multiple instances on the fly to produce a continuous stream of monitoring logs. Even though Thanos is an early-stage project, it is already used in production by companies like Adobe and eBay.

Because YugabyteDB is a cloud native, distributed SQL database, it can easily interoperate with Thanos and many other CNCF projects like Longhorn, OpenEBS, Rook, and Falco.

What’s YugabyteDB? It is an open source, high-performance distributed SQL database built on a scalable and fault-tolerant design inspired by Google Spanner. Yugabyt is PostgreSQL wire compatible.

In this blog post, we’ll show you how to get up and running with Thanos so that it can be used to monitor a YugabyteDB cluster, all running on Google Kubernetes Engine (GKE).

Thanos Architecture

At a high level, Thanos has several key components worth understanding how they work.

An illustration of the components is shown below. You can learn more about the Thanos architecture by checking out the documentation.

Thanos arhitecture

Why Thanos and YugabyteDB

Because YugabyteDB already integrates with Prometheus, Thanos can be used as a  resilient monitoring platform for YugabyteDB clusters that can also store the metric data long term. It ensures the continuous availability of YugabyteDB metric data by aggregating the data from multiple Prometheus instances into a single view.

Prerequisites

Here is the environment required for the setup:

Setting Up a Kubernetes Cluster on Google Cloud Platform

Assuming you have a Google Cloud Platform account, the first step is to set up a Kubernetes cluster using GKE.

The usual defaults should be sufficient. For the purposes of this demo, I chose Machine type: n-1-standard-4 (4 vCPU, 15 GB memory).

Selecting nodes

Install YugabyteDB on GKE With Helm

Once your Kubernetes cluster is up and running, log into the shell, and work through the following commands to get a YugabyteDB cluster deployed using Helm 3.

Create a Namespace

Shell
 




x


 
1
$ kubectl create namespace yb-demo



Add the Charts Repository

Shell
 




xxxxxxxxxx
1


 
1
$ helm repo add yugabytedb https://charts.yugabyte.com



Fetch Updates From the Repository

Shell
 




xxxxxxxxxx
1


 
1
$ helm repo update



Install YugabyteDB

We are now ready to install YugabyteDB. In the command below, we’ll be specifying values for a resource constrained environment.

Shell
 




xxxxxxxxxx
1


 
1
$ helm install yb-demo yugabytedb/yugabyte \
2
--set resource.master.requests.cpu=0.5, \ resource.master.requests.memory=0.5Gi,\
3
resource.tserver.requests.cpu=0.5, \ resource.tserver.requests.memory=0.5Gi --namespace yb-demo



To check the status of the YugabyteDB cluster, execute the command below:

Shell
 




xxxxxxxxxx
1


 
1
$ helm status yb-demo -n yb-demo


Yugabyte cluster deployed

From the screenshot above we can see that the external IP is 35.239.35.3 and that the YSQL port is 5433. You can use this information to connect to YugabyteDB with your favorite database admin tool, like DBeaver, pgAdmin, and TablePlus. For more information, check out the third-party tools documentation.

Congrats! At this point, you have a three-node YugabyteDB cluster running on GKE.

Setting Up the Prometheus Operator

For the purposes of this blog, we will be using the Prometheus Operator deployed via Helm 3 to get Prometheus up and running.

Create a values.yaml File

By default, Helm charts install multiple components that are not required to run Thanos with Prometheus. Also, since our cluster has limited resources, we need to override the default configuration by creating a new values.yaml file and passing this file when we install the Prometheus Operator using Helm.

Shell
 




xxxxxxxxxx
1


 
1
$ touch values.yaml
2
$ vim values.yaml



The file’s contents should look like this:

YAML
 







Install Prometheus

Install the Prometheus Operator via Helm 3 as shown below.

Shell
 




xxxxxxxxxx
1


1
$ kubectl create namespace prometheus
2
$ helm repo add stable https://kubernetes-charts.storage.googleapis.com
3
$ helm repo update 
4
$ helm install prometheus-operator stable/prometheus-operator \
5
 --namespace prometheus \
6
 --values values.yaml



You can verify that the Prometheus Operator is installed using the following command:

Shell
 




xxxxxxxxxx
1


 
1
$ kubectl get pods -n prometheus


To avoid the scenario of metrics being unavailable, either permanently or for a short duration of time, we can run a second instance of Prometheus. Each instance of Prometheus will run independent of the other, however each instance will still have the same configuration as set by the Prometheus Operator. You can see this implementation detail in the bolded section below where we specify 2 replicas.

Create a file called prometheus.yaml

Shell
 




xxxxxxxxxx
1


 
1
$ touch prometheus.yaml
2
$ vim prometheus.yaml



Add the following configuration:

YAML
 




xxxxxxxxxx
1
89


 
1
apiVersion: monitoring.coreos.com/v1
2
kind: Prometheus
3
metadata:
4
 name: prometheus
5
 namespace: prometheus
6
spec:
7
 baseImage: quay.io/prometheus/prometheus
8
 logLevel: info
9
 podMetadata:
10
   annotations:
11
     cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
12
   labels:
13
     app: prometheus
14
 replicas: 2
15
 resources:
16
   limits:
17
     cpu: 100m
18
     memory: 2Gi
19
   requests:
20
     cpu: 100m
21
     memory: 2Gi
22
 retention: 12h
23
 serviceAccountName: prometheus-service-account
24
 serviceMonitorSelector:
25
   matchLabels:
26
     serviceMonitorSelector: prometheus
27
 storage:
28
   volumeClaimTemplate:
29
     apiVersion: v1
30
     kind: PersistentVolumeClaim
31
     metadata:
32
       name: prometheus-pvc
33
     spec:
34
       accessModes:
35
       - ReadWriteOnce
36
       resources:
37
         requests:
38
           storage: 10Gi
39
 version: v2.10.0
40
 securityContext:
41
   fsGroup: 0
42
   runAsNonRoot: false
43
   runAsUser: 0
44
---
45
apiVersion: v1
46
kind: ServiceAccount
47
metadata:
48
 name: "prometheus-service-account"
49
 namespace: "prometheus"
50
---
51
apiVersion: rbac.authorization.k8s.io/v1
52
kind: ClusterRole
53
metadata:
54
 name: "prometheus-cluster-role"
55
rules:
56
- apiGroups:
57
 - ""
58
 resources:
59
 - nodes
60
 - services
61
 - endpoints
62
 - pods
63
 verbs:
64
 - get
65
 - list
66
 - watch
67
- apiGroups:
68
 - ""
69
 resources:
70
 - nodes/metrics
71
 verbs:
72
 - get
73
- nonResourceURLs:
74
 - "/metrics"
75
 verbs:
76
 - get
77
---
78
apiVersion: rbac.authorization.k8s.io/v1
79
kind: ClusterRoleBinding
80
metadata:
81
 name: "prometheus-cluster-role-binding"
82
roleRef:
83
 apiGroup: rbac.authorization.k8s.io
84
 kind: ClusterRole
85
 name: "prometheus-cluster-role"
86
subjects:
87
- kind: ServiceAccount
88
 name: "prometheus-service-account"
89
 namespace: prometheus



Next, apply the promethus.yaml file to the Kubernetes cluster using the following command:

Shell
 




xxxxxxxxxx
1


 
1
$ kubectl apply -f prometheus.yaml



You can verify that the Prometheus Operator is installed using the following command:

Shell
 




xxxxxxxxxx
1


 
1
$ kubectl get pods -n prometheus



You should see output like that shown below with two Prometheus pods now running:

Two Prometheus pods running

Configuring Prometheus PVC

The Prometheus persistent volume claim (PVC) is used to retain the state of Prometheus and the metrics it captures in the event that it is upgraded or restarted. To verify that the PVC that has been created and bound to a persistent volume run the following command:

Shell
 




xxxxxxxxxx
1


 
1
$ kubectl get persistentvolumeclaim --namespace prometheus



You should see output like that shown below:

prometheus pods running

To access the Prometheus UI we need to first run the following command:

Shell
 




xxxxxxxxxx
1


1
$ kubectl port-forward service/prometheus-operated 9090:9090 --namespace prometheus



Now, go to Web preview in Google Console and select Change port > 9090. You should now see the Prometheus web UI, similar to the one shown below:

prometheus web ui

Configuring Prometheus to Monitor YugabyteDB

The next step is to configure Prometheus to scrape YugabyteDB metrics. Create a file named servicemonitor.yaml with the following content:

YAML
 




xxxxxxxxxx
1
18


 
1
apiVersion: monitoring.coreos.com/v1
2
kind: ServiceMonitor
3
metadata:
4
 labels:
5
   serviceMonitorSelector: prometheus
6
 name: prometheus
7
 namespace: prometheus
8
spec:
9
 endpoints:
10
 - interval: 30s
11
   targetPort: 7000
12
   path: /prometheus-metrics
13
 namespaceSelector:
14
   matchNames:
15
   - yb-demo
16
 selector:
17
   matchLabels:
18
     app: "yb-master"



We can now apply the servicemonitor.yaml configuration by running the following command:

Shell
 







Verify that the configuration has been applied by running the following command:

Shell
 




xxxxxxxxxx
1


1
$ kubectl get servicemonitor --namespace prometheus



You should see output similar to the one shown below.

prometheus pods running

Now, return to the Prometheus UI to verify that the YugabyteDB metric endpoints are available to Prometheus by going to Status > Targets.

Targets in Prometheus

Setting Up Thanos

Add the following Thanos specific configurations to the prometheus.yaml file under the spec section that starts at line 7:

YAML
 




xxxxxxxxxx
1
45


 
1
spec:
2
 baseImage: quay.io/prometheus/prometheus
3
 logLevel: info
4
 podMetadata:
5
   annotations:
6
     cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
7
   labels:
8
     app: prometheus
9
     thanos-store-api: "true"
10
 replicas: 2
11
 thanos:
12
   version: v0.4.0
13
   resources:
14
     limits:
15
       cpu: 10m
16
       memory: 50Mi
17
     requests:
18
       cpu: 10m
19
       memory: 50Mi
20
 resources:
21
   limits:
22
     cpu: 100m
23
     memory: 2Gi
24
   requests:
25
     cpu: 100m
26
     memory: 2Gi
27
 retention: 12h
28
 serviceAccountName: prometheus-service-account
29
 serviceMonitorSelector:
30
   matchLabels:
31
     serviceMonitorSelector: prometheus
32
 externalLabels:
33
   cluster_environment: workshop
34
 storage:
35
   volumeClaimTemplate:
36
     apiVersion: v1
37
     kind: PersistentVolumeClaim
38
     metadata:
39
       name: prometheus-pvc
40
     spec:
41
       accessModes:
42
       - ReadWriteOnce
43
       resources:
44
         requests:
45
           storage: 10Gi



Finally, add the following Thanos deployment configuration to the end of the prometheus.yaml file:

YAML
 




xxxxxxxxxx
1
57


 
1
---
2
apiVersion: apps/v1
3
kind: Deployment
4
metadata:
5
 name: thanos-query
6
 namespace: prometheus
7
 labels:
8
   app: thanos-query
9
spec:
10
 replicas: 1
11
 selector:
12
   matchLabels:
13
     app: thanos-query
14
 template:
15
   metadata:
16
     labels:
17
       app: thanos-query
18
   spec:
19
     containers:
20
     - name: thanos-query
21
       image: improbable/thanos:v0.5.0
22
       resources:
23
         limits:
24
           cpu: 50m
25
           memory: 100Mi
26
         requests:
27
           cpu: 50m
28
           memory: 100Mi
29
       args:
30
       - "query"
31
       - "--log.level=debug"
32
       - "--query.replica-label=prometheus_replica"
33
       - "--store.sd-dns-resolver=miekgdns"
34
       - "--store=dnssrv+_grpc._tcp.thanos-store-api.prometheus.svc.cluster.local"
35
       ports:
36
       - name: http
37
         containerPort: 10902
38
       - name: grpc
39
         containerPort: 10901
40
       - name: cluster
41
         containerPort: 10900
42
---
43
apiVersion: v1
44
kind: Service
45
metadata:
46
 name: "thanos-store-api"
47
 namespace: prometheus
48
spec:
49
 type: ClusterIP
50
 clusterIP: None
51
 ports:
52
 - name: grpc
53
   port: 10901
54
   targetPort: grpc
55
 selector:
56
   thanos-store-api: "true"
57
 
          



We are now ready to apply the configuration by running the following command:

Shell
 




xxxxxxxxxx
1


 
1
$ kubectl apply -f prometheus.yaml



Verify that the pods are running by running the following command:

Shell
 




xxxxxxxxxx
1


 
1
$ kubectl get pods --namespace prometheus



Notice that we now have Thanos running.

Connect to the Thanos UI

Connect to Thanos Query by using port forwarding. You can do this by running the following command replacing the thanos-query pod name with your own:

Shell
 




xxxxxxxxxx
1


1
$ kubectl port-forward pod/thanos-query-7f77667897-lfmlb 10902:10902 --namespace prometheus



We can now access the Thanos Web UI, using the web preview with port 10902.

Thanos Web UI

Verify that Thanos is able to access both Prometheus replicas by clicking on Stores.

Verifying Thanos is able to access Prometheus

The YugabyteDB metric data is now available to Thanos through both Prometheus instances. A few examples are below:

Yugabyte metric data

Thanos metric data

Conclusion

That’s it! You now have a YugabyteDB cluster running on GKE that is being monitored by two Prometheus instances, which not only made highly available but also appear as one, with Thanos. For more information, check out the documentation on YugabyteDB metrics and integration with Prometheus.

 

 

 

 

Top