Understanding Fault Tolerance: Building Robust Systems

2 min readJul 20, 2024

Fault tolerance is a crucial concept in tech infrastructure, ensuring systems can handle errors and outages without losing functionality. This capability is essential for maintaining user confidence and avoiding costly downtime. Here’s a simplified guide to fault tolerance and how to achieve it in different system layers.

What is Fault Tolerance?

Fault tolerance refers to a system’s ability to continue operating properly in the event of a failure. For example, if a database server crashes, a fault-tolerant system would switch to a backup database to maintain functionality.

Fault Tolerance vs. High Availability

While related, fault tolerance and high availability are not identical. High availability focuses on maximizing system uptime, often requiring fault-tolerant systems. However, achieving high availability also involves minimizing downtime during upgrades and maintenance.

Goals of Fault Tolerance

Normal Functioning vs. Graceful Degradation

Normal Functioning: The system continues to operate fully without any noticeable impact on the user experience, even during a failure. This approach is ideal but more costly.
Graceful Degradation: The system allows some functionality to degrade but remains operational. This approach balances cost and user experience.

Setting Survival Goals

Achieving 100% fault tolerance is impractical, so it’s important to set realistic goals based on your application’s needs:

Survive Node Failure: Use multiple nodes within the same data center.
Survive Availability Zone (AZ) Failure: Deploy across multiple data centers within a region.
Survive Region Failure: Use multiple cloud regions.
Survive Cloud Provider Failure: Combine on-premises systems with multiple cloud providers.

Cost Considerations

Building fault-tolerant systems involves costs, such as additional hardware and software instances. However, the costs of downtime — lost revenue, reputation damage, and engineering hours — can be significantly higher. Therefore, investing in fault tolerance can be more economical in the long run.

Examples of Fault-Tolerant Architectures :

Application Layer

Modern applications often use containerization platforms like Kubernetes. This allows multiple instances of services to run simultaneously, so if one instance fails, others can handle the load.

Database Layer

Using distributed databases like CockroachDB can enhance fault tolerance. CockroachDB’s node-based architecture ensures data is distributed, providing strong consistency and high availability.

Achieving Fault Tolerance

To build fault-tolerant systems, consider the following:

Multiple Hardware Systems: Spread databases across different servers.
Multiple Software Instances: Use containerization to run multiple instances of services.
Backup Power Sources: Use generators to protect on-premises systems from power outages.

By adopting these strategies, you can create robust systems that minimize the impact of failures, ensuring a smooth user experience and protecting your business from costly downtime.

That’s it!. Feel free to follow me and share your thoughts on what else I can improve.

Do checkout the my complete ongoing series on System Design —

jaspreetsodhi02.medium.com/list/system-design-series-f5054fed26fd

See you in the next part! 😊