Galera Warning “Last Inactive Check”
Problem
I’ve been working with Percona XtraDB Cluster quite a bit recently and have been investigating various warnings. I came across this one today:
[Warning] WSREP: last inactive check more than PT1.5S ago (PT1.51811S), skipping check
This warning is related to the evs.inactive_check_period option. This option controls the poll period for the group communication response time. If a node is delayed, it is added to a delay list and it can lead to the cluster evicting the node.
Possible Cause
While some troubleshooting tips seem to associate the warning with VMWare snapshots, this isn’t the case here, as we see the warning on a physical machine.
I checked for backups or desynced nodes, and this also wasn’t the case. The warning was not accompanied by any errors or other information, so there was nothing critical happening.
In the troubleshooting link above, Galera developers said:
This can be seen on bare metal as well—with poorly configured mysqld, O/S, or simply being overloaded. All it means is that this thread could not get CPU time for 7.1 seconds. You can imagine that access to resources in virtual machines is even harder (especially I/O) than on bare metal, so you will see this in virtual machines more often.
This is not a Galera specific issue (it just reports being stuck, other mysqld threads are equally stuck) so there is no configuration options for that. You simply must make sure that your system and mysqld are properly configured, that there is enough RAM (buffer pool not over provisioned), that there is swap, that there are proper I/O drivers installed on guest and so on.
Basically, Galera runs in virtual machines as well as the virtual machines approximates bare metal.
It could also be an indication of unstable network or just higher average network latency than expected by the default configuration. In addition to checking network, do check I/O, swap and memory when you do see this warning.
Our graphs and counters otherwise look healthy. If this is the case, this is most likely nothing to worry about.
It is also a good idea to ensure your nodes are desynced before backup. Look for spikes in your workload. A further option to check for is that swappiness is set to 1 on modern kernels.
If all of this looks good, ensure the servers are all talking to the same NTP server, have the same time zone and the times and dates are in sync. While this warning could be a sign of an overloaded system, if everything else looks good this warning isn’t something to worry about.
Source
The warning comes from evs_proto.cpp
in the Galera code:
if (last_inactive_check_ + inactive_check_period_*3 < now)
{
log_warn << "last inactive check more than " << inactive_check_period_*3
<< " ago (" << (now - last_inactive_check_)
<< "), skipping check";
last_inactive_check_ = now;
return;
}
Since the default for inactive_check_period
is one second according to the Galera documentation, if it is now later than three seconds after the last check, it skips the rest of the above routine and adds the node to the delay list and does some other logic. The reason it does this is that it doesn’t want to rely on stale counters before making decisions. The message is really just letting you know that.
In Percona XtraDB Cluster, this setting defaults to 0.5s. This warning simply could be that your inactive_check_period
is too low, and the delay is not high enough to add the node to the delay list. So you could consider increasing evs.inactive_check_period
to resolve the warnings. (Apparently in Galera, it may also now be 0.5s but documentation is stale.)
Possible Solution
To find a sane value my colleague David Bennett came up with this command line, which gives you an idea of when your check warnings are happening:
$ cat mysqld.log | grep 'last inactive check more than' | perl -ne 'm/(PT(.*)S)/; print $1."n"' | sort -n | uniq -c
1 1.55228
1 1.5523
1 1.55257
1 1.55345
1 1.55363
1 1.5543
1 1.55436
1 1.55483
1 1.5552
1 1.55582
Therefore, in this case, it may be a good idea to set inactive_check_period
at 1 or 1.5 to make the warnings go away.
Conclusion
Each node in the cluster keeps its own local copy of how it sees the topology of the entire cluster. check_inactive
is a node event that is triggered every inactive_check_period
second to help the node update its view of the whole cluster, and ensure it is accurate. Service messages can be broadcast to the cluster informing nodes of changes to the topology. For example, if a cluster node is going down it will broadcast a service message telling each node in the cluster to remove it. The action is queued but the actual view of the cluster is updated with check_inactive
. This is why it adds nodes to its local copy of inactive, suspect and delayed nodes.
If a node thinks it might be looking at stale data, it doesn’t make these decisions and waits until the next time for a fresh queue. Unfortunately, if inactive_check_period
is too low, it will keep giving you warnings.