Chris Trotter
2014-06-18 12:19:32 UTC
Hope this helps someone else.
I came in today and discovered that one of our check_mk Windows agents was
giving 'tcp connection refused'. The check_mk_agent service in CMK itself
was showing CRIT, and all other related checks on that Windows host were
stale. No notifications had been sent out - still have to dig into why.
Here is my rough troubleshooting flow:
- Run manual check
- Restart Windows agent
- Telnet host 6556 from my workstation (should normally work) - it works
- Check the port from the OMD server - a few ways to test - port is
responsive
- We are running 1.11 (updated earlier this week for cmk BI features),
so perhaps the agent needs updating
- Updated agent from 1.24p2 to 1.24p3, no change
- Noticed the service is not stopping correctly, have to kill the process
- Double-checked the configuration, cleaned out extraneous stuff
- Lots of googling later...
- Ran netstat -anb | find /i "6556" on the troubled Windows box
- I see 'CLOSE_WAIT' a number of times
- Restart the service (kill, start), see LISTENING
- Run manual check, still timing out
- CLOSE_WAIT showing up again
- Rebooted the Windows server (cuz, ya know)
- No change
- Started the check_mk_agent service, then from cmd: check_mk_agent.exe
test
- It was hanging on a particular check
- Commented out the related checks
- Everything returned to normal
The check was a 'cscript script.vbs' that normally outputs appropriate
Nagios-readable service data. The host runs about 30 of these checks, all
of them work fine except for a select few. The select few were getting
'server does not exist' errors due to a VPN tunnel crashing and not coming
back up.
Still not certain if this is a cmk agent bug, or if we just need to put
better error handling into our vbs code. (it's vbs because legacy and time)
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mathias-kettner.de/pipermail/omd-users/attachments/20140618/80fafc83/attachment.html>
I came in today and discovered that one of our check_mk Windows agents was
giving 'tcp connection refused'. The check_mk_agent service in CMK itself
was showing CRIT, and all other related checks on that Windows host were
stale. No notifications had been sent out - still have to dig into why.
Here is my rough troubleshooting flow:
- Run manual check
- Restart Windows agent
- Telnet host 6556 from my workstation (should normally work) - it works
- Check the port from the OMD server - a few ways to test - port is
responsive
- We are running 1.11 (updated earlier this week for cmk BI features),
so perhaps the agent needs updating
- Updated agent from 1.24p2 to 1.24p3, no change
- Noticed the service is not stopping correctly, have to kill the process
- Double-checked the configuration, cleaned out extraneous stuff
- Lots of googling later...
- Ran netstat -anb | find /i "6556" on the troubled Windows box
- I see 'CLOSE_WAIT' a number of times
- Restart the service (kill, start), see LISTENING
- Run manual check, still timing out
- CLOSE_WAIT showing up again
- Rebooted the Windows server (cuz, ya know)
- No change
- Started the check_mk_agent service, then from cmd: check_mk_agent.exe
test
- It was hanging on a particular check
- Commented out the related checks
- Everything returned to normal
The check was a 'cscript script.vbs' that normally outputs appropriate
Nagios-readable service data. The host runs about 30 of these checks, all
of them work fine except for a select few. The select few were getting
'server does not exist' errors due to a VPN tunnel crashing and not coming
back up.
Still not certain if this is a cmk agent bug, or if we just need to put
better error handling into our vbs code. (it's vbs because legacy and time)
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mathias-kettner.de/pipermail/omd-users/attachments/20140618/80fafc83/attachment.html>