Quantcast
Channel: How do you calculate the compound Service Level Agreement (SLA) for cloud services? - DevOps Stack Exchange
Viewing all articles
Browse latest Browse all 4

Answer by Tensibai for How do you calculate the compound Service Level Agreement (SLA) for cloud services?

$
0
0

I'd take that as a math problem with the SLA being the probability of being OK.

In this case we can rely on probability rules to get an overall.

For your first case the probability that App Service (A) and Sql Service (B) are down at the same time is the product of their probability:

P(A)*P(B) = 0.0005 * 0.0005 = 0,00000025

The probability that one of them is down is the sum of their probability:

P(A)+P(B) = 0.001

When two events are independents the resulting formula to take in account the probability of both being down is:

P(A,B) = P(A) + P(B) - P(A)*P(B) = 0.001 - 0,00000025 = 0,00099975

So the overall SLA would be 1 - 0,00099975 = 0,99900025 wich in percent is 99.900025 %

A simplification is the product of the first probability: 0.9995 * 0.9995 = 0,99900025.

Applied to your 1h/24h outage (4,166666% of a day) this gives (decimals are abbreviated):

0.0416 + 0.0416 - (0.0416 * 0.0416) = 0,081597222

So the probability of being OK is 1 - 0.0816 = 0.9184 in percent: 91,84%

24 * 0.0816 = 1.95 h

This is less than the worst case of 2 hours because there's a chance both are down at the same time.

Keeping that in mind, you may notice the availability for each is 95,84% and 0,958333333 * 0,958333333 = 0,918402778 which is our 91.84% from above (sorry for the full decimals here, but they are needed for the demonstration)

Now for your second case, we'll start gain from our compound probability for each region (Sorry I dismissed the change for SQL to keep it reasonable), assuming there's no independent probability for the region itself and that each region is isolated and as such a DB failure take only its region down.

We have the traffic manager OK probability P(T) = 0.9999 and each app+DB couple with a OK probability P(G) = 0,99900025 from

How much region we have play a role as we have to apply the product of failure probability only to get the probability both region are down as the same time:
0,00099975 * 0,00099975 = 0,0000009995000625 which means an overall availability of at least one region of 99,049375 %

Now we have the overall regions availability, the product with the traffic manager one give us the overall availability of the system:

0.9999 * 0,9999990004999375 = 0,99989900059988750625

The overall availability is 99.989900 %

Another source as explanation is available on Azure's docs (link courtesy of Raj Rao)


Viewing all articles
Browse latest Browse all 4

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>