This document describes the usual actions and tools used to drill down failing VMs issues, and find a root cause.
For troubleshooting specific issues, see also those tips.
Troubleshooting a failed deployment¶
These are usual steps to do in order to drill down to the root cause for some VM instance failure.
Identify any failing VM instance with
bosh -d <deployment-name> instances, possibly focusing on failing instances with
--failingor detailing failing jobs with
bosh sshto some VM having an issue.
Become superuser with
sudo -ifor full
Check failing Monit jobs with
monit summary. Whenever the failure has happened at
pre-atartstage, this list is empty because Monit configuration is not yet assembled.
Check for any full disk device with
Check for excessive memory consumption or anything suspicious in the process tree (like duplicate or zombie processes) with
Vfor tree display,
cfor command line arguments, double-
Efor GiB memory units,
efor MiB process mem units,
Wfor persisting the current display,
Lfor locating some process,
&for next search result,
kfor sending a signal to the process displayed in first line,
Check the logs for failing processes in
/var/vcap/sys/log/<job-name>/*.logand browse them with
>to go to the end of file, use
fto follow latest logs in live mode, press
^Cto stop following)
Troubleshooting the BOSH Agent¶
Troubleshooting the BOSH Agent is very unusual, but here we show how you can display some JSON metadata present on the VM instance, with tools that are available by default on stemcells.
Check the latest BOSH Agent logs with
Check BOSH Agent initial configuration with
python3 -mjson.tool /var/vcap/bosh/agent.json
Check BOSH Agent dynamic settings with
python3 -mjson.tool /var/vcap/bosh/settings.json | less
Check VM instance role (as from the BOSH deployment manifest: jobs, packages, networks, etc) with
python3 -mjson.tool /var/vcap/bosh/spec.json