When recently assisting a customer in choosing a new cloud service provider, the providers of choice offered 95%, 99%, and 99.9% availability labeling their service “High Availability”. For the human brain and considering a scale from 0% to 100% all of these numbers sound rather good, and we would naturally think, that these services almost never fail. However, let us have a closer look at what high availability truly means for IT environments and how it affects UCS and let us think about why you should also consider the time to recovery and planned downtimes.
High Availability: Percentages and Time Consideration
The percentages, as mentioned above, are the most common way to distinguish availability. They denote how much time within a year the system will be offline. Sometimes you also find terms such as “double nine” or “triple nine” in advertisements, which refer to 99% or 99.9% availability.
When considering a yearly contract the most common statements translate into the following downtimes for 365 days per year:
|90%||36 days and 12 hours|
|95%||18 days and 6 hours|
|99%||3 days, 15 hours and 36 minutes|
|99.9%||8 hours, 45 minutes and 36 seconds|
|99.99%||52 minutes and 33 seconds|
|99.999%||5 minutes and 15 seconds|
Admittedly these times are looking at availability more from a business perspective. They can help you to gauge the general reliability of a service, the better the numbers, the more consideration has to be put into making the service robust. However, as they represent a cumulative number, they are not the most accurate measure of reliability and might be even misleading.
Just to give you an example: In a hospital, a life-support device fails for a quarter of a second every 30 minutes. That are cumulative 73 minutes or a bit over an hour per year. However, everyone would prefer this 99.9% reliability to the one offering 99.999% reliability but failing once a year for 5 minutes. Thus the machine with the higher reliability produces, the worse results, while the one which on paper is worse allows you to survive.
While UCS is not built for medical devices, the example nicely showcases, why more factors need to be considered.
High Availability: Time to Recovery
The above example makes the less reliable system, the better choice because it is faster back into operational mode. Technically, this is known as “time to recovery.” In many cases involving UCS, this measure is the better idea to design your network. In most cases, if a service recovers before any user notices an error, the end user will perceive the system to be reliable.
When looking at UCS’ inbuilt features, one of most obvious pain points is the authentication service. When the user cannot log in, he will certainly notice this issue very soon. UCS’ inbuilt domain concept, when used with at least a master and backup, provide an excellent redundancy mitigating this problem.
Planned Downtime and High Availability
Another factor to take into account is that high availability might not always be understood to cover planned downtimes. However, whether downtimes are considered might be crucial to planning the system.
To give you two examples. First, let’s think about a 50 people smelting plant. It most probably is closed over the weekend, allowing a planned downtime to occur. At the same time, if the computer fails during the work hours, thousands of dollars in smelted metal might need reworking.
At the other end of the spectrum are airlines. No matter whether planned or unplanned, their systems cannot go down.
Planning the Needs
These three areas present the bulb of questions when planning High Availability. How often can my system fail, how long can it be down if it fails and can I take it down for maintenance work?
These questions need to be answered for every service, to map the needs of the system. Particularly, on the more stringent end, it might be impractical to have only one set of parameters as it might be unnecessary and would get impractical and prohibitively expensive.
Just to give you an example. Imagine you are running a call center with 10.000 agents and four people in HR. If your HR software goes down, it might annoy four people. If no one can login, including the 10.000 agents, you have a threat that is existential to your business. At the same time, the HR software is probably considerably more complicated to make High Available (HA) due to it not being designed from ground up for HA. Thus the two services will be assigned different priorities.
Software Principles to create High Availability Systems
There are several techniques to create High Availability Systems. Here are two of the most popular ones, used in Software.
The first one and mentioned already above is redundancy. If there are multiple systems, that can handle the work. UCS Authentication services is a primary example for this. The clients can work with multiple servers that provide login functionality. If the software, as here Samba 4, is designed to handle these scenarios, it makes for a quick setup and operation of the whole network.
When deciding how many servers for an individual service you need, the rule of thumb is: the number of servers you need to handle the load, plus one to do maintenance, plus one to fail. However, the many questions about the actual needs of the users need to be asked, before deciding in which services to invest.
Another technique is to cache data if the primary server fails. If the data or service can be temporarily stored on another machine, this might be enough for the user not to notice that the server failed. Roaming Profiles and Folder Redirection are two prime examples in client-server environments. Placing one or more SMTP Proxy servers in front of your mail server combines this approach with the redundancy one. Caching, however, provides only a short-term bridge between working states.
UCS and High Availability
After looking at the criteria for HA and the two most common implementation in software, let us consider which method UCS uses by default and which you can easily implement.
|Directory||LDAP||The LDAP Server is available for reading on all Master, Backup and Slave systems. Operating at least a Master and a Backup will give you the needed peace of mind. Please do not forget to point your Clients to two or more servers.|
|Directory||Samba AD||The Samba AD has an inbuilt multi-master setup. As such using group policies and authentication, as well as a certain amount of administration, will work as long as one Domain Controller is online. Please be aware about the dependencies on a working DNS setup.|
|Directory||Univention Management Console||Manual failover: The Domain Management is only available on the Master. The Master can be migrated to a Backup if it permanently fails, the decision is made by an administrator. However, the failure of the UMC would only affect the administration and have little to no effect on the end user.|
|Authentication||Kerberos||Like the LDAP Server, Kerberos is available for authentication on all Master, Backup and Slave systems.|
|Authentication||SAML||UCS comes with a preconfigured SAML cluster, which means that any UCS DC Master and DC Backup instance offers SAML authentication with a shared backend. To expose these to your users one can use DNS round robin or a load balancer.|
UCS does, by default, not replicate its file services across multiple servers. The main reason is the need of space for most installations. The protocol complexity and variety of endpoints are two more reasons.
The easiest way to mitigate file share issues is by caching the content on the client. For Windows clients folder redirection are the way to go.
|File Services||NFS||Like Samba NFS is not replicated by default and like Samba caching is the easiest way to go. CacheFS is a good option for client-side caching.|
|Network Services||DNS||Each UCS Domain Controller provides a DNS Service. You only need to make sure, that your client uses multiple servers.|
|Network Services||DHCP||HA for DHCP depends on the intended usage. If every client has an IP in the Management Console, installing DHCP on multiple servers is sufficient. If free address ranges are used, the DHCP servers have to be configured as failover peers for the pool.|
Most other services, such as virtualization and most partner products require a more detailed analysis of what the goals are. Our team will be happy to provide you with an individualized project plan.
Conclusion: High Availability for IT Environments
High Availability is an often considered topic. When done correctly, HA can provide a great user experience. However, when not planned properly it can consume considerable resources without solving any problems.
Hungry for more informations? This article could also interest you:
If you have any further questions, please do not hesitate to contact us via our forum or