Adding Reliability With The Overseer and High Availability

Failure is not an option. Failures are disruptive, and a disrupted call center floor is an expensive headache to manage. The promise of Asterisk, which was replacing monolithic telephony hardware installations with multiple commodity servers running call center software, was a lowering of the total cost of ownership. The promise has been fulfilled, but enterprise contact center installations have to be mindful of the possibility of hardware or software disruptions. This is why we introduced the HAASIPP, to allow call survival, and the Overseer to manage processes and allocate them between servers in a high availability solution.

The Overseer and its watchdog are tasked with monitoring processes, and moving delivery of services from an active to a standby process if a disruption is detected. Problems with service can take many forms:

  • A networking issue may make one or more servers unreachable
  • Asterisk has become overloaded or unresponsive
  • Web server thread usage exceeds a set threshold
  • A monitored process has died
  • A monitored process has exceeded a set threshold for memory usage
  • A disk becomes full
  • Database replication falls behind or breaks
  • and so on…

What services are expected to run and parameters for each are configurable, depending on the local situation. A balance must be struck between allowing a temporary situation to resolve itself (such as a large number of threads being temporarily created, or delays in processing new calls) and failing a service, moving it to another server. However, the Overseer must be able to act decisively in the case of a failure, especially when call survival is configured. Settings can be configured for individual services as well as the system as a whole; a server that is struggling under load is probably a bad candidate for acting as the primary for a service.

The Overseer process itself has to be monitored in some manner. A hierarchy of servers is configured on install, and in the case of an inability to contact the main overseer, an individual server’s overseer process will attempt to communicate with each server down the list until it believes itself to be the master. This gives a predefined order that all servers can agree on, so if more than one server is down and that includes the master Overseer, a new master can be established automatically. As long as a set of servers that comprise the full set of required services can communicate with each other, the Overseer can bring up the system and make it active. In cases where the Overseer is managing a system configured for high availability of services, thought must be put into the infrastructure. A fully redundant install of the Q-Suite is of little value if all the servers are plugged into the same switch and the switch experiences a fault. However, with a little preparation, the Q-Suite with redundant hardware and the Overseer can be quite resistant to faults.

OverSeer-Status-small

Of course, monitoring is essential. A status board reports the current status of services from the Overseer’s perspective. Clicking any service can bring up a more detailed list of components that are monitored for that service, and a testing history can be used to determine when any problems may have arisen. Colour-coding makes it instantly obvious if there is an issue. Green indicates things are satisfactory. Orange lets you know there is a warning, usually indicating some parameter exceeds the usual bounds. Red lets you know there is something wrong.

A lot of work has gone into the Overseer to make sure it’s testing the things that matter, and only failing over if something is seriously wrong. It’s a cornerstone of our high availability strategy.