Post

Availability

What is availability?

Availability can be thought of in a couple of ways. One way to consider it is how resistant a system is to failures. For instance, what happens if a server in your system fails? What happens if your database fails? Will your system go down completely, or will it still be operational? This is often described as a system’s fault tolerance.

Another way to think about availability is the percentage of time in a given period, like a month or a year, during which a system is operational and capable of satisfying its primary functions.

Availability is crucial to consider when evaluating a system. In today’s world, most systems have an implied guarantee of availability.

Imagine a system supporting airplane software, which allows an airplane to function properly. If that system were to go down while an airplane is flying, it would be absolutely unacceptable. Similarly, consider stock or crypto exchange systems, where downtime could lead to customers losing trust and money.

Even with less critical examples like YouTube or Twitter, downtime would be detrimental, as hundreds of millions of people use these platforms daily.

Cloud providers such as AWS, Azure, and GCP also need to maintain high availability. If parts of their systems go down, it affects all the businesses and customers relying on their services. For example, in summer 2019, Google Cloud Platform experienced a significant outage that lasted for a few hours, affecting many businesses, including Vimeo.

In summary, availability is of great importance in system design and operations.


How to measure availability?

Availability is usually measured as the percentage of a system’s uptime in a given year.

For instance, if a system is up and operational for half of an entire year, then we can say that the system has 50% availability, which is quite poor.

In the industry, most services or systems aim for high availability, so we often measure availability in terms of “nines” rather than exact percentages.

“Nines” are essentially percentages, but they specifically represent percentages with the number nine. For example, if a system has 99% availability, we can say that the system has two nines of availability because the number nine appears twice in this percentage. Similarly, 99.9% would be considered three nines of availability, and so on.

This terminology is a standard way that people discuss availability in the industry.

Below you can find a chart from Wikipedia that showcases a range of popular availability percentages, which can help illustrate the differences between various levels of system availability.

Availability %Downtime per yearDowntime per monthDowntime per weekDowntime per day
55.5555555% ("nine fives")162.33 days13.53 days74.92 hours10.67 hours
90% ("one nine")36.53 days73.05 hours16.80 hours2.40 hours
95% ("one and half nines")18.26 days36.53 hours8.40 hours1.20 hours
97%10.96 days21.92 hours5.04 hours43.20 minutes
98%7.31 days14.61 hours3.36 hours28.80 minutes
99% ("two nines")3.65 days7.31 hours1.68 hours14.40 minutes
99.5% ("two and a half nines")1.83 days3.65 hours50.40 minutes7.20 minutes
99.8%17.53 hours87.66 minutes20.16 minutes2.88 minutes
99.9% ("three nines")8.77 hours43.83 minutes10.08 minutes1.44 minutes
99.95% ("three and a half nines")4.38 hours21.92 minutes5.04 minutes43.20 seconds
99.99% ("four nines")52.60 minutes4.38 minutes1.01 minutes8.64 seconds
99.995% ("four and a half nines")26.30 minutes2.19 minutes30.24 seconds4.32 seconds
99.999% ("five nines")5.26 minutes26.30 seconds6.05 seconds864.00 milliseconds
99.9999% ("six nines")31.56 seconds2.63 seconds604.80 milliseconds86.40 milliseconds
99.99999% ("seven nines")3.16 seconds262.98 milliseconds60.48 milliseconds8.64 milliseconds
99.999999% ("eight nines")315.58 milliseconds26.30 milliseconds6.05 milliseconds864.00 microseconds
99.9999999% ("nine nines")31.56 milliseconds2.63 milliseconds604.80 microseconds86.40 microseconds

As you can see, even though 99% availability seems impressive, being down for three and a half days or more per year is still quite problematic and might be considered unacceptable. For systems that involve life-and-death situations, such downtime is undoubtedly unacceptable. Even for services like Facebook or YouTube, which serve billions of users, that amount of downtime is too high.

Five nines of availability (99.999%) is often considered the gold standard for availability. If your system achieves this level of availability, it can be regarded as a highly available system.


A Service Level Agreement and Service Level Objective are related concepts used to define and measure the quality of service provided by a service provider.

SLA (Service Level Agreement)

An SLA is a formal, legally binding contract between a service provider and a client that outlines the expected level of service, performance metrics, and responsibilities of both parties. It specifies measurable targets such as availability, response time, and throughput, as well as consequences for not meeting these targets, such as refunds or service credits.

SLO (Service Level Objective)

An SLO is a specific, measurable goal or target within an SLA that defines a particular aspect of the service quality. SLOs serve as benchmarks to evaluate the service provider’s performance and ensure that the agreed-upon service levels are met. Examples of SLOs include system availability (e.g., 99.9% uptime), maximum response time for support requests, or error rates.


While availability is a critical consideration in system design, it’s not always of utmost importance. Achieving five nines of availability (99.999%) isn’t always necessary, as high availability comes with trade-offs. Ensuring a high level of availability can be challenging and resource-intensive.

When designing a system, it’s essential to carefully evaluate whether your system requires high availability or if only specific components need it. This assessment helps allocate resources effectively and prioritize the most critical components for high availability while allowing other parts to function with lower levels of redundancy. Ultimately, balancing the need for high availability with the system’s overall requirements and constraints is crucial for efficient and sustainable system design.

For example, consider a payment processing system like Stripe or Visa. In such a system, the transaction processing component is critical and requires high availability to ensure uninterrupted service for customers making payments. Downtime or failures in this component could result in lost revenue and negatively impact the user experience.

On the other hand, there may be secondary components like an analytics dashboard or reporting tools that, while important, do not need the same level of high availability. These components can tolerate occasional downtime without significantly impacting the overall system performance or user experience.


How to achieve HA system?

Conceptually, achieving a high availability (HA) system is straightforward. First and foremost, you need to ensure that your system doesn’t have single points of failure (SPOFs). These are points that, if they fail, the entire system fails. Keep in mind that even team members with specialized knowledge can be considered SPOFs.

To eliminate SPOFs, you can introduce redundancy. Redundancy involves duplicating, triplicating, or even further multiplying certain parts of the system.

For instance, if you have a simple system where clients interact with a server and the server communicates with a database, the server is a single point of failure. If it gets overloaded or goes down for any reason, the entire system fails.

To enhance availability and eliminate the SPOF, you can add more servers and a load balancer between clients and servers to distribute the load across multiple servers.

However, the load balancer itself could become a single point of failure, so it should also be replicated and run on multiple servers.


Passive Redundancy

In passive redundancy, a primary system/component is backed up by a secondary (standby) system/component that takes over when the primary fails. There are several standby strategies, each with different recovery characteristics:

  • Cold standby: The backup system is offline and powered down. When the primary fails, the standby must be started and configured before it can serve traffic. Recovery time is the longest of the three, often measured in minutes or longer. This approach is the cheapest to maintain.
  • Warm standby: The backup system is running and periodically synchronized with the primary, but it is not actively serving traffic. On failure, it needs a brief promotion step (e.g., updating DNS or promoting a database replica). Recovery is faster than cold standby, typically measured in seconds to a few minutes.
  • Hot standby: The backup system is running, fully synchronized, and ready to take over instantly. Failover can happen automatically with near-zero downtime. This is the most expensive option but provides the fastest recovery.

Examples include redundant power supplies, database replication with a standby database, or a backup server that takes over when the primary server fails.

Active Redundancy

In active redundancy, multiple systems/components work in parallel and share the workload, ensuring continued operation even if one or more components fail. Active redundancy may also provide load balancing and improved performance. Examples include redundant network connections, RAID configurations for data storage, or multiple load-balanced servers providing the same service.

High Availability Technologies

In practice, several widely adopted technologies help achieve high availability:

  • Load balancers (such as Nginx, HAProxy, or cloud-native solutions like AWS ALB) distribute traffic across multiple backend servers and automatically route around unhealthy instances.
  • Database replication (primary-replica setups in PostgreSQL, MySQL, or managed services like Amazon RDS Multi-AZ) ensures that a copy of the data is always available for failover.
  • Container orchestration platforms like Kubernetes maintain desired pod replica counts, automatically restarting failed containers and rescheduling them onto healthy nodes.

Monitoring and Alerting

Redundancy alone does not guarantee high availability. Without proper monitoring and alerting, failures can go undetected, and recovery processes may never be triggered. At a minimum, an HA system should include health checks for all critical components, dashboards that surface real-time system status, and automated alerts (via tools like Prometheus, Grafana, Datadog, or PagerDuty) that notify on-call engineers when thresholds are breached. The faster a team detects a failure, the faster it can respond – whether automatically or through manual intervention.

It’s also important to mention that you’ll want to have a rigorous process in place to handle system failures, as they might require human intervention. For instance, if servers in your system crash, you’ll need a person to bring them back online, and it’s crucial to establish processes that ensure timely recovery. So, keeping that in mind is essential for maintaining high availability in your system.


Trade-offs

Like many aspects of programming, achieving availability also involves trade-offs, such as:

  • Cost: Higher availability typically requires redundant resources, such as additional servers, storage, and networking infrastructure, leading to increased operational costs.

  • Complexity: Implementing fault tolerance, failover, and load balancing mechanisms to achieve high availability can introduce complexity into the system architecture, making it harder to understand, maintain, and troubleshoot.

  • Performance: Highly available systems may require distributing data across multiple nodes or geographic locations, which can increase latency and reduce overall performance.

  • Consistency: In distributed systems, maintaining high availability may involve sacrificing strong consistency for eventual consistency or using weaker consistency models, which can impact application logic and data integrity.

In summary, while high availability is crucial for many systems, it’s essential to carefully consider the trade-offs involved and balance them against the specific requirements and constraints of your application.

This post is licensed under CC BY 4.0 by the author.