On-Call Cloud VM Instance Reboot

Occasionally VM instance reboot in an unplanned way and triggers alert, usually this is due to hardware or software issue on physical machine hosting the VM that causes the VM to crash.

From last reboot will see reboot records, or who -b displaying last time reboot info.

For CentOS/RHEL systems, you’ll find the logs at /var/log/messages while for Ubuntu/Debian systems, its logged at /var/log/syslog.

1
2
3

# exclude irrelevant info
# then looking around the possible key word log
sudo grep -iv ': starting\|kernel: .*: Power Button\|watching system buttons\|Stopped Cleaning Up\|Started Crash recovery kernel' /var/log/messages | grep -iw 'recover[a-z]*\|power[a-z]*\|shut[a-z ]*down\|rsyslogd\|ups'

From the command output above, skim /var/log/messages file around the timestamp, output example:

May  3 23:31:52 xxxxx systemd: Started Update UTMP about System Boot/Shutdown.
May  3 23:31:55 xxxxx rsyslogd: [origin software="rsyslogd" swVersion="8.24.0-52.el7" x-pid="949" x-info="http://www.rsyslog.com"] start
May  3 23:31:55 xxxxx systemd: Started Google Compute Engine Shutdown Scripts.
May  3 23:37:43 xxxxx audispd: node=xxxxx type=EXECVE msg=audit(1620085063.649:1984): argc=3 a0="last" a1="-x" a2="shutdown"

I see the reboot was caused by Shutdown Scripts, further check the VM instance log or StackDriver log (if on public cloud platform, check platform log system is more convenient than ssh to checking VM log), get error:

1	compute.instances.hostError

Then it was clear, What happened with host error?