Monitor Your Infrastructure With InfluxDB and Grafana on Kubernetes

 Grafana in action — Learn how to set it up in your AWS cloud
Grafana in action — Learn how to set it up in your AWS cloud

Monitoring your infrastructure and applications is a must-have if you play your game seriously. Overseeing your entire landscape, running servers, cloud spends, VMs, containers, and the applications inside are extremely valuable to avoid outages or to fix things quicker. We at Starschema rely on open-source tools like InfluxDB, Telegraf, Grafana, and Slack to collect, analyze, and react to events. In this blog series, I will show you how we built our monitoring infra to monitor our cloud infrastructure, applications like Tableau Server and Deltek Maconomy, and data pipelines in Airflow, among others.

In this part, we will build up the basic infrastructure monitoring with InfluxDB, Telegraf, and Grafana on Amazon’s managed Kubernetes service: AWS EKS.

Create a New EKS Kubernetes Cluster

If you have an EKS cluster already, just skip this part.

I assume you have a properly set up aws cli on your computer — if not, then please do it, it will be a life-changer. Anyway, first, install eksctl which will help you to manage your AWS Elastic Kubernetes Service clusters and will save tons of time by not requiring to rely on the AWS Management Console. Also, you will need kubectl, too.

First, create new a Kubernetes cluster in AWS using eksctl without a nodegroup:

Shell
 




x


 
1
eksctl create cluster --name "StarKube" --version 1.18 --region=eu-central-1 --without-nodegroup


I used eu-central-1 region, but you can pick another one that is closer to you. After the command completes, add a new nodegroup to the freshly created cluster that uses only one availability zone (AZ):

Shell
 




xxxxxxxxxx
1


 
1
eksctl create nodegroup --cluster=StarKube --name=StarKube-default-ng --nodes-min 1 --nodes-max 4 --node-volume-size=20 --ssh-access --node-zones eu-central-1b  --asg-access --tags "Maintainer=tfoldi" --node-labels "ngrole=default" --managed


The reason why I created a single AZ nodegroup is to be able to use EBS backed persistent volumes along with EC2 autoscaling groups. On multi-AZ node groups with autoscaling, newly created nodes can be in a different zone, without access to the existing persistent volumes (which are AZ specific). More info about this here.

TL;DR: Use single-zone nodegroups if you have EBS PersistentVolumeClaims.

If things are fine, you should see a node in your cluster:

Shell
 




xxxxxxxxxx
1


 
1
$ kubectl get nodes
2
AGE   VERSION
3
ip-192-168-36-245.eu-central-1.compute.internal   Ready    <none>   16s   v1.18.9-eks-d1db3c


Create a Namespace for Monitoring Apps

Kubernetes namespaces are isolated units inside the cluster. To create our own monitoring namespace we should simply execute:

Shell
 




xxxxxxxxxx
1


 
1
kubectl create namespace monitoring


For our convenience, let’s use the monitoring namespace as the default one:

Shell
 




xxxxxxxxxx
1


 
1
kubectl config set-context --current --namespace=monitoring


Install InfluxDB on Kubernetes

Influx is a time-series database, with easy to use APIs and good performance. If you are not familiar with time-series databases, it is time to learn: they support special query languages designed to work with time-series data, or neat functions like downsampling and retention.

To install an application to our Kubernetes system, usually we

  1. (Optional) Create the necessary secrets as an Opaque Secret(to store sensitive configurations).
  2. (Optional) Create a ConfigMap to store non-sensitive configurations.
  3. (Optional) Create a PersistentVolumeClaim to store any persistent data (think of volumes for your containers).
  4. Create a Deployment or DaemonSet file to specify the container-related stuff like whatwe are going to run.
  5. (Optional) Create a Service file explaining howwe are going to access the Deployment.

    As stated, the first thing we need to do is to define our Secrets: usernames and passwords we want to use for our database.
Shell
 




xxxxxxxxxx
1
10


 
1
kubectl create secret generic influxdb-creds \
2
  --from-literal=INFLUXDB_DB=monitoring \
3
  --from-literal=INFLUXDB_USER=user \
4
  --from-literal=INFLUXDB_USER_PASSWORD=<password> \
5
  --from-literal=INFLUXDB_READ_USER=readonly \
6
  --from-literal=INFLUXDB_USER_PASSWORD=<password> \
7
  --from-literal=INFLUXDB_ADMIN_USER=root \
8
  --from-literal=INFLUXDB_ADMIN_USER_PASSWORD=<password> \
9
  --from-literal=INFLUXDB_HOST=influxdb  \
10
  --from-literal=INFLUXDB_HTTP_AUTH_ENABLED=true


Next, create some persistent storage to store the database itself:

Shell
 




xxxxxxxxxx
1
14


 
1
---
2
apiVersion: v1
3
kind: PersistentVolumeClaim
4
metadata:
5
  namespace: monitoring
6
  labels:
7
    app: influxdb
8
  name: influxdb-pvc
9
spec:
10
  accessModes:
11
  - ReadWriteOnce
12
  resources:
13
    requests:
14
      storage: 5Gi


If you are new to Kubernetes, the way to execute these files is to call kubectl apply -f <filename> , in our case kubectl apply -f influxdb-pvc.yml.

Now, let’s create the Deployment, that defines what containers we need and how:

Shell
 




xxxxxxxxxx
1
31


 
1
---
2
apiVersion: apps/v1
3
kind: Deployment
4
metadata:
5
  namespace: monitoring
6
  labels:
7
    app: influxdb
8
  name: influxdb
9
spec:
10
  replicas: 1
11
  selector:
12
    matchLabels:
13
      app: influxdb
14
  template:
15
    metadata:
16
      labels:
17
        app: influxdb
18
    spec:
19
      containers:
20
      - envFrom:
21
        - secretRef:
22
            name: influxdb-creds
23
        image: docker.io/influxdb:1.8
24
        name: influxdb
25
        volumeMounts:
26
        - mountPath: /var/lib/influxdb
27
          name: var-lib-influxdb
28
      volumes:
29
      - name: var-lib-influxdb
30
        persistentVolumeClaim:
31
          claimName: influxdb-pvc


It will create a single pod (since replicas=1), passing our influxdb-creds as environmental variables and influxdb-pvc PersistentVolumeClaim to obtain 5GB storage for the database files. If all good, we should see something like:

Shell
 




xxxxxxxxxx
1


 
1
[tfoldi@kompi]% kubectl get pods -l app=influxdb
2
NAME                        READY   STATUS    RESTARTS   AGE
3
influxdb-7f694df996-rtdcz   1/1     Running   0          16m


After we defined what we want to run, it’s time for how to access it? This where Service definition comes into the picture. Let’s start with a basic LoadBalancer service:

Plain Text
 




xxxxxxxxxx
1
15


 
1
apiVersion: v1
2
kind: Service
3
metadata:
4
  labels:
5
    app: influxdb
6
  name: influxdb
7
  namespace: monitoring
8
spec:
9
  ports:
10
  - port: 8086
11
    protocol: TCP
12
    targetPort: 8086
13
  selector:
14
    app: influxdb
15
  type: LoadBalancer


It tells that our pod’s 8088 port should be available thru an Elastic Load Balancer (ELB). With kubectl get service, we should see the external-facing host:port (assuming we want to monitor apps outside from our AWS internal network).

Shell
 




xxxxxxxxxx
1


 
1
$ kubectl get service/influxdb
2
NAME       TYPE           CLUSTER-IP     EXTERNAL-IP                                                                  PORT(S)          AGE
3
influxdb   LoadBalancer   10.100.15.18   ade3d20c142394935a9dd33c336b3a0f-2034222208.eu-central-1.elb.amazonaws.com   8086:30651/TCP   18h$ curl http://ade3d20c142394935a9dd33c336b3a0f-2034222208.eu-central-1.elb.amazonaws.com:8086/ping


This is great, but instead of HTTP, we might want to use HTTPS. To do that, we need our SSL certification in ACM with the desired hostname. We can either do it by generating a new certificate (requires Route53 hosted zones) or upload our external SSL certificate.

    Image for post
Amazon Issued SSL Certs are great but require Route 53 hosted zones. Alternatively, you can import existing SSL certificates.

If we have our certificate in ACM, we should add it to the Service file:

Plain Text
 




xxxxxxxxxx
1
25


 
1
apiVersion: v1
2
kind: Service
3
metadata:
4
  annotations:
5
    # Note that the backend talks over HTTP.
6
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
7
    # TODO: Fill in with the ARN of your certificate.
8
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: arn:aws:acm:{region}:{user id}:certificate/{id}
9
    # Only run SSL on the port named "https" below.
10
    service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https"
11
  labels:
12
    app: influxdb
13
  name: influxdb
14
  namespace: monitoring
15
spec:
16
  ports:
17
  - port: 8086
18
    targetPort: 8086
19
    name: http
20
  - port: 443
21
    name: https
22
    targetPort: 8086
23
  selector:
24
    app: influxdb
25
  type: LoadBalancer


After executing this file, we can see that our ELB listens on two ports:

Java
 




xxxxxxxxxx
1


 
1
[tfoldi@kompi]% kubectl get services/influxdb
2
NAME       TYPE           CLUSTER-IP     EXTERNAL-IP                                                                  PORT(S)                        AGE
3
influxdb   LoadBalancer   10.100.15.18   ade3d20c142394935a9dd33c336b3a0f-2034222208.eu-central-1.elb.amazonaws.com   8086:30651/TCP,443:31445/TCP   18h


SSL is properly configured, the only thing is missing to add an A or CNAME record pointing to EXTERNAL-IP .

We all set, our database is running, and it is available on both HTTP and HTTPS protocols.

Installing Telegraf on Kubernetes

We need some data to validate our installation, and by the way, we already have a system to monitor: our very own Kube cluster and its containers. To do this, we will install Telegraf on all nodes and ingest cpu, IO, docker metrics into our InfluxDB. Telegraf has tons of plugins to collect data from almost everything: infrastructure elements, log files, web apps, and so on.

The configuration will be stored as ConfigMap, this is what we are going to pass to our containers:

Plain Text
 




xxxxxxxxxx
1
32


 
1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: telegraf
5
  namespace: monitoring
6
  labels:
7
    k8s-app: telegraf
8
data:
9
  telegraf.conf: |+
10
    [global_tags]
11
      env = "EKS eu-central"
12
    [agent]
13
      hostname = "$HOSTNAME"
14
    [[outputs.influxdb]]
15
      urls = ["http://$INFLUXDB_HOST:8086/"] # required
16
      database = "$INFLUXDB_DB" # required      timeout = "5s"
17
      username = "$INFLUXDB_USER"
18
      password = "$INFLUXDB_USER_PASSWORD"    [[inputs.cpu]]
19
      percpu = true
20
      totalcpu = true
21
      collect_cpu_time = false
22
      report_active = false
23
    [[inputs.disk]]
24
      ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
25
    [[inputs.diskio]]
26
    [[inputs.kernel]]
27
    [[inputs.mem]]
28
    [[inputs.processes]]
29
    [[inputs.swap]]
30
    [[inputs.system]]
31
    [[inputs.docker]]
32
      endpoint = "unix:///var/run/docker.sock"


To run our Telegraf data collector on all nodes of our Kubernetes cluster, we should use DaemonSet instead of Deployments.

Plain Text
 




xxxxxxxxxx
1
79


 
1
apiVersion: apps/v1
2
kind: DaemonSet
3
metadata:
4
  name: telegraf
5
  namespace: monitoring
6
  labels:
7
    k8s-app: telegraf
8
spec:
9
  selector:
10
    matchLabels:
11
      name: telegraf
12
  template:
13
    metadata:
14
      labels:
15
        name: telegraf
16
    spec:
17
      containers:
18
      - name: telegraf
19
        image: docker.io/telegraf:1.5.2
20
        env:
21
        - name: HOSTNAME
22
          valueFrom:
23
            fieldRef:
24
              fieldPath: spec.nodeName
25
        - name: "HOST_PROC"
26
          value: "/rootfs/proc"
27
        - name: "HOST_SYS"
28
          value: "/rootfs/sys"
29
        - name: INFLUXDB_USER
30
          valueFrom:
31
            secretKeyRef:
32
              name: influxdb-creds
33
              key: INFLUXDB_USER
34
        - name: INFLUXDB_USER_PASSWORD
35
          valueFrom:
36
            secretKeyRef:
37
              name: influxdb-creds
38
              key: INFLUXDB_USER_PASSWORD
39
        - name: INFLUXDB_HOST
40
          valueFrom:
41
            secretKeyRef:
42
              name: influxdb-creds
43
              key: INFLUXDB_HOST
44
        - name: INFLUXDB_DB
45
          valueFrom:
46
            secretKeyRef:
47
              name: influxdb-creds
48
              key: INFLUXDB_DB
49
        volumeMounts:
50
        - name: sys
51
          mountPath: /rootfs/sys
52
          readOnly: true
53
        - name: proc
54
          mountPath: /rootfs/proc
55
          readOnly: true
56
        - name: docker-socket
57
          mountPath: /var/run/docker.sock
58
        - name: utmp
59
          mountPath: /var/run/utmp
60
          readOnly: true
61
        - name: config
62
          mountPath: /etc/telegraf
63
      terminationGracePeriodSeconds: 30
64
      volumes:
65
      - name: sys
66
        hostPath:
67
          path: /sys
68
      - name: docker-socket
69
        hostPath:
70
          path: /var/run/docker.sock
71
      - name: proc
72
        hostPath:
73
          path: /proc
74
      - name: utmp
75
        hostPath:
76
          path: /var/run/utmp
77
      - name: config
78
        configMap:
79
          name: telegraf


Please note that this will use the same influxdb-creds secret definition to connect to our database. If all good, we should see our telegraf agent running:

Shell
 




xxxxxxxxxx
1


 
1
$ kubectl get pods -l name=telegraf
2
NAME             READY   STATUS    RESTARTS   AGE
3
telegraf-mrgrg   1/1     Running   0          18h


To check the log messages from the telegraf pod, simply execute kubectl logs <podname>. You should not see any error messages.

Set Up Grafana in Kubernetes

This will be the fun part. Finally, we should be able to see some of the data we collected (and remember, we will add everything). Grafana is a cool, full-featured data visualization for time-series datasets.

Let’s start with the usual username and password combo as a secret.

Shell
 




xxxxxxxxxx
1


 
1
kubectl create secret generic grafana-creds \                                                                                                                                            
2
  --from-literal=GF_SECURITY_ADMIN_USER=admin \
3
  --from-literal=GF_SECURITY_ADMIN_PASSWORD=admin123


Add 1GB storage to store the dashboards:

Shell
 




xxxxxxxxxx
1


 
1
apiVersion: v1
2
kind: PersistentVolumeClaimmetadata:
3
  name: graf-data-dir-pvcspec:
4
  accessModes:
5
    - ReadWriteOnce
6
  resources:
7
    requests:
8
      storage: 1Gi


Define the deployment. As Grafana docker runs as 472 uid:gid, we have to mount the persistent volume with fsGroup: 472.

Plain Text
 




xxxxxxxxxx
1
31


 
1
apiVersion: apps/v1
2
kind: Deployment
3
metadata:
4
  namespace: monitoring
5
  labels:
6
    app: grafana
7
  name: grafana
8
spec:
9
  replicas: 1
10
  selector:
11
    matchLabels:
12
      app: grafana
13
  template:
14
    metadata:
15
      labels:
16
        app: grafana
17
    spec:
18
      containers:
19
      - envFrom:
20
        - secretRef:
21
            name: grafana-creds
22
        image: docker.io/grafana/grafana:7.3.3
23
        name: grafana
24
        volumeMounts:
25
          - name: data-dir
26
            mountPath: /var/lib/grafana/
27
      securityContext:
28
        fsGroup: 472      volumes:
29
       - name: data-dir
30
         persistentVolumeClaim:
31
           claimName: graf-data-dir-pvc


Finally, let’s expose it in the same way we did with InfluxDB:

Plain Text
 




xxxxxxxxxx
1
18


 
1
apiVersion: v1
2
kind: Service
3
metadata:
4
  annotations:
5
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
6
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: arn:aws:acm:eu-central-1:<account>:certificate/<certid>    service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https"
7
  labels:
8
    app: grafana
9
  name: grafana
10
  namespace: monitoring
11
spec:
12
  ports:
13
  - port: 443
14
    name: https
15
    targetPort: 3000
16
  selector:
17
    app: grafana
18
  type: LoadBalancer


Voila, we should have our Grafana up and running. Let’s check the ELB address with kubectl get services , point a nice hostname to its hostname/IP, and we are good to go. If all set, we should see something like: Image for post

I am glad that you made it here, now let’s log on!


Use the username/password combination you defined earlier, and see the magic.

Image for post
Home screen for our empty Grafana

Define Database Connection to InfluxDB

Why this can be done programatically, to keep this post short (it’s already way too long), let’s do it from the UI. Click on the gear icon, data source, Add data source:

  Image for post
You know where should you click

Select InfluxDB: 

Image for post

Add http://influxdb:8066/ as URL, and set up your user or readonly influxdb user.

Image for post

Adding our First Grafana Dashboard

Our telegraf agent is loading some data, there is no reason to not look at it. We can import existing, community-built dashboards such as this one: https://grafana.com/grafana/dashboards/928.

Click on the + sign on the side bar, then Import. In the import screen add the number of this dashboard (928).
 


Image for post



After importing, we should immediately see our previously collected data, in live: Image for post

This is really cool

Feel free to start building your own dashboards, it is way easier than you think.


Next Steps

In the next blog, I will show how to monitor our (and our customers‘) Tableau Server and how to set up data-driven email/Slack alerts in no time.

 

 

 

 

Top