Cloud services hosted by Amazon Web Services, Azure, Google and most others publish the Service Level Agreement, or SLA, for the individual services they provide. Architects, Platform Engineers and Developers are then responsible for putting these together to create an architecture that provides the hosting for an application.
Taken in isolation, these services usually provide something in the range of three to four nine's of availability:
- Azure Traffic Manager: 99.99% or 'four nines'.
- SQL Azure: 99.99% or 'four nines'.
- Azure App Service: 99.95% or 'three nine five'.
However when combined together in architectures there is the possibility that any one component could suffer an outage resulting in an overall availability that is not equal to the the component services.
Serial Compound Availability
In this example there are three possible failure modes:
- SQL Azure is down
- App Service is down
- Both are down
Therefore the overall availability of this "system" must lower than 99.95%. My rationale for thinking this is if the SLA for both services was:
The service will be available 23 hours out of 24
Then:
- The App Service could be out between 0100 and 0200
- The Database out between 0500 and 0600
Both component parts are within their SLA but the total system was unavailable for 2 hours out of 24.
Serial and Parallel Availability
In this architecture there are a large number of failure modes however principally:
- SQL Server in RegionA is down
- SQL Server in RegionB is down
- App Service in RegionA is down
- App Service in RegionB is down
- Traffic Manager is down
- Combinations of Above
Because Traffic Manager is a circuit breaker it is capable of detecting an outage in either region and routing traffic to the working region, however there is still a single point of failure in the form of Traffic Manager so the total availability of the "system" cannot be higher than 99.99%.
How can the compound availability of the two systems above be calculated and documented for the business, potentially requiring rearchitecting if the business desires a higher service level than the architecture is capable of providing?
If you want to annotate the diagrams, I have built them in Lucid Chart and created a multi-use link, bear in mind that anyone can edit this so you might want to create a copy of the pages to annotate.