Failure is not an option. Failures are disruptive, and a disrupted call center floor is an expensive headache to manage. The promise of Asterisk, which was replacing monolithic telephony hardware installations with multiple commodity servers running call center software, was a lowering of the total cost of ownership. The promise has been fulfilled, but enterprise contact center installations have to be mindful of the possibility of hardware or software disruptions. This is why we introduced the HAASIPP, to allow call survival, and the Overseer to manage processes and allocate them between servers in a high availability solution. Continue reading “Adding Reliability With The Overseer and High Availability”
The Real Meaning of Call Survival
With Q-Suite’s HAASIPP our product now features Call Survival. This is often confused with Call Recovery and the terms then used interchangeably.
To understand what Call Survival really means lets look at an example. Starting off with a caller connected over a SIP trunk and ultimately connected to an agent. The diagram below shows a simplified setup with the communication path as follows:
Which Telephony Interface for your Solution: VoIP Gateway VS PCI card?
What is your PBX or CTI solution without an interface to the outside world. One can use VoIP but there are still a lot of systems going the traditional route and connecting to a Telco either via Analog (POTS) or a T1/PRI/E1. With Asterisk the two main solutions to do this are an internal PCI card or an external Gateway device. Both options will make and receive calls from the telco but which one is better?
I’ll cut to the chase and say PCI cards should not be recommended for High Availability solutions. They can still have their place in a system without HA and where costs are a major factor, but the decision to use them should be made with their limitations in mind.
Using a Gateway device, such as Patton or Audiocodes, provides the following benefits over an internal card:
- Multiple telephony servers can connect to a single gateway. Which is important for the next two items.
- With multiple servers connected to a single gateway in a HA solution calls will be routed to the active server(s).
- Load balancing done at the gateway level in a high volume centers to distributed calls across servers.
- Independence from a single server. If a specific server needs to be rebooted or taken offline for maintenance a gateway will keep working.
- Location of telco demarc can be independant of telephony system. This can be in a different room, floor, building or even country. Just be careful of lag causing issues. But given the proper connections can allow moving the IP PBX system into the cloud while still supporting traditional telco trunks.
- In a mixed trunk environment of VoIP and traditional telco connections the Gateway can abstract this so the IP PBX’s configuration is similar for all trunks.
- Scaling up only requires adding a new gateway and a configuration change to the telephony system which minimizes downtime and risk. Mainly due to avoiding the need to open the system to install new cards.
Considering the above it is hard to see the case for a PCI card, especially in an HA solution. They may still have their place elsewhere but I’ll be recommending a VoIP Gateway going forward.
The Challenge of VoIP System Failures Not Addressed by Most High Availability Designs
Hardware or software can fail at anytime and induce a system failure. It is not possible to reduce such failures to nil. When VoIP based systems experience such failures, it results in the loss of on-going calls. High availability (HA) or redundant systems cannot address this unless they are capable of restoring an on-going call without either one of the end-points re-initiating the call. Most high availability system for Session Initiation Protocol (SIP) based VoIP calls and their redundancy setup, deploy an immediate replacement of the failed component/sub-system to allow continued use of the system. It is good enough for many situations but it might not be adequate for mission critical applications when the HA cannot not restore on-going calls.
Imagine a scenario where an outside caller initiates a call and when it hits the demarcation point of the contact center installation. This could be a premise based contact center or a Cloud set up offering virtual contact center services. When the call setup reaches the intended peer and conversation starts, it is possible that your system, either Cloud based or on-premise solutions, could experience a failure. Once the system detects a failure, its high availability and redundant setup will kick-in and the system will be ready for future calls but what happens to the on-going call? They just die. This is the normal operating mode of traditional high availability systems including most high availability solutions offered for Asterisk. This issue becomes more critical for large contact centers using automatic call distribution (ACD) with significant traffic at any given time.
With contact center ACD, the importance of going beyond the traditional high availability is extremely important. Having the capability to keep calls alive through call survival is critical. This will allow the user to continue the phone conversation without the need for re-initiating the call. It is a sophistication in offering redundancy that goes beyond recognizing the need to bring into action the replacement software and hardware components. It introduces intelligence required in preserving all the on-going calls essential for mission critical systems.
SIP Registration Timeout Settings for High Availability
In setting up a high available telephony system most worry about the back end and ensure it functions as they would expect and require. However one highly visible user issue I have seen is a misconfiguration of the connected SIP phones in regards to the registration timeouts. When these are very high on your SIP phone then it may not notice a service has moved (via IP/DNS/etc changes) due to a HA switchover and can potentially miss incoming calls until it does. Typically an outgoing call attempt will work or at the very least cause a registration attempt to the new server the service has moved to.
For example take a look at Aastra, their defaults in a few models I’ve seen are at a half hour for a failed registration.
If the failed registration timeout is half an hour and the phone attempts to re-register and fails your phone will show an error or unregistered for the next half hour. This can happen in the cases where the registration comes in as a box is failing or a failover happened and the configuration is being written/updated due to the switchover process. More reasonable set of values are shown in the following.
In this one I’ve lowered the registration failed retry and also the timeout retry timers. These will make the SIP phone resolve the registration issue quicker by retrying more often than the defaults. They could be lower depending on the situation.
One precaution before everyone sets these very low. These settings should be set appropriately when the SIP phone is off-site and there are protections, for example Fail2Ban, in place to block brute force attacks. In these cases where the SIP phone is on an app on a mobile device this failed registration timeout should be set high enough to not trigger a lockout of a valid device. If the devices are in-house or IPs can be whitelisted then the values can be lower without worry.
The Differences in Call Survival and Call Recovery
While investigating High Availability (HA) in CTI and PBX systems you will often find mention of Call Recovery. Another term you run into is Call Survival, which is often used interchangeably with Call Recovery incorrectly. This is because each is a different approach to solving a problem. The problem being a failure which would interrupt the calls of a system.
With Call Survival when a failure happens the caller and callee do not have to take action to continue their call as it survives the failure. At a high level this is done by reacting to the failure quickly and re-routing the audio path around the failure.
With Call Recovery when a failure happens the recovery is different depending on the system. Sometimes the caller will need to initiate the redial the callee or it could be an automated process but the callee still have to answer this new call.
From a user perspective the better option is Call Survival as they may only experience a momentary interruption in their audio as the path is rerouted around the failure instead of having to re-initiate a call to recovery it.
The Q-Suite platform supports Call Survival with the help of the Overseer Watchdog providing HA for other services in addition to being one part of the Call Survival solution.
Audio alerts triggered by real-time contact center ACD activities
Automatic Call Distributors (ACD) control and manage the work-flow of a contact center. A multi-channel contact center ACD offers skills based routing and queue prioritization for phone calls, emails, and web channels. The real-time queue metrics are a good indicator of the contact center activities. Even with work-force management (WFM) software predictions, it is not always possible to staff adequately for handling sudden spurts in call volumes. Organizations should have procedures in place to handle such events. One such option is having supervisory staff and supplementary employees participate in call handling if required.
Key queue metrics like the total number of calls waiting in a given queue, the wait time, the abandon rate, and the overall service level provide a measure of real-time call center activities. A good contact center software will allow call center managers to set conditions based on the queue metrics to trigger audio alerts. Different audible alerts can be set, each specific to a particular queue metrics parameter.
Q-Suite for Asterisk is a powerful contact center ACD offering such feature as a part of its call center software. It is a multi-tenant software for setting up Cloud based fault tolerant High Availability (HA) contact center solutions. Its audible notifications can be triggered by setting conditions on queue parameters that are monitored as a part of the real-time contact center reporting. These notifications allow the contact center floor operations to initiate procedures that are put in place for handling sudden spurt in call volumes.