The client was worried because one of the Asterisk servers had gone down without any notice at all. The overseer process on the other Asterisk server had noticed, and had taken over as the active server. The disruption was minimal. Agents were at work. But the client wanted to know what had happened, and what we could do to prevent the issue in the first place. After 30 minutes of poring over logs, digging around, and contacting the colo, we discovered that a tech had decided to swap out the power bar connected to that server. There was no notice, and not even a courtesy three-finger salute.
Whether it’s an incompetent tech, a bad power supply, or a network port going out, you will experience downtime. If you see the “magic smoke” leaving your server, it’s expired. Kaput. When your call center technology needs to be up, you need to make sure you’re using a high availability solution. High availability can be as simple as a watchdog process monitoring the health of your server, some means of synchronization, and kicking off a recovery process on the spare when the watchdog detects a failure. Better implementations will include the ability to monitor both servers and services. You’ll also want to have multiple servers, services, and monitoring for the health of each. High availability in the Q-Suite, Indosoft’s call center software, involves all of these.
Monitoring servers is important. Having a web server running successfully on dying hardware means a future dead web server. System load, free memory, disk usage, and other factors are all important. Many processes will not work successfully if the disk has been remounted read only or is full. Think about what that does to your recordings. You needed those, right? If it looks like your system is running into trouble, it’s better to do the handoff before something fails hard.
Monitoring the services requires that we be able to define what numbers to use. A database process using 32 GB of memory may be correctly running and configured. An apache process using 32 GB of memory is a recipe for disaster. An Asterisk process using 32 GB means that something has already gone horribly wrong. On the other hand, an Asterisk process using a big chunk of available CPU is normal, and seeing the same in the web service is a sign of impending doom. Other factors such as thread count and responsiveness can let you know that your processes are performing well or on the verge of failure.
One way to move a service is to transfer a floating IP address representing the service master to a spare machine. In the Q-Suite implementation of overseer/watchdog, a number of IPs can be set up. For instance, you might have 4 active Asterisk servers and 3 active web servers. Each would use their own floating IP, which could be handed off. Sometimes you have sets of IPs that move together, such as the case where a server has multiple IPs for incoming trunks. In other cases, you want the IPs to remain separate. You can’t make a server act as 3 web servers and perform as well as 3 separate machines, for instance.
With all the moving parts, it’s up to the software provider to configure and test failover, to ensure it’s working properly. Environmental factors can complicate things, such as network hardware, so it’s not the sort of thing you should be expected to configure yourself. Once it’s set up and you can monitor it, though, it’s reassuring to know that you can have the odd failure without bringing down your floor.