Who Watches the Watchdog?

You depend on your phones to keep your business going, and Asterisk to keep your phones running. Who watches over Asterisk? To keep your Asterisk deployment running, you need a watchdog to monitor it, and an overseer that can switch services from one server to another when trouble strikes.

The Q-Suite Overseer was designed to:

  • monitor processes
  • check a number of failure conditions
  • coordinate with services on other servers

Simple watchdogs usually check for the presence of the Asterisk process, and do something when that happens. Usually just restart Asterisk, or start Asterisk if another Asterisk server is no longer pingable. This is an approach that works in many cases, but is insufficient. In the case where Asterisk experiences a locking bug or other non-crash failure, it may continue to run for hours. Sometimes it’s up, running, but because of a problem, certain types of call cannot be originated. It can take time to notice that you’re not receiving new calls.

The Overseer Watchdog does more. For instance, it runs periodic SIP requests and monitors the response. If too many SIP requests have gone unanswered, it notes the failure and reports it to the Overseer. Other conditions that may cause failures are a disk being full or unwriteable, which can lead to missing recordings and logs. Too much memory being used by Asterisk, which can indicate a problem with the process. Other conditions on the server which can cause the server to become unreliable even if Asterisk is running correctly. Finally, the Overseer itself can note that it has lost contact with the Asterisk server.

If a failure is noted, it can:

  • switch a floating IP between servers. As long as extensions are registered to the floating IP, they can continue to be used.
  • pull configuration and dialplan details from the database and make it live.
  • only switch to a server that is actually up and ready to take over.

Many of these conditions can be tweaked, to account for differences in installation. Asterisk that has been compiled with symbols and full debugging uses quite a bit more memory than the stripped and optimized version that is usually installed. Peak call volumes may result in some requests being queued and not handled for short periods, so a longer SIP request period may be desired. And of course you’ll only want it to fail over to a server that is itself in good shape; if one server is lagged and the other has no disk space, there’s no point in switching the service over.

By monitoring so many things simultaneously and reporting back to a central Overseer, the Watchdog can switch over much more quickly and reliably than a simple watchdog process. This reduces downtime and caller confusion.