Micro Service SLAs

The issue with SLAs is that folks often want to aim for extremly high SLAs like "five nines" 99.999% uptime... Companies also want high fully independent teams running and managing microservices... Let's take a look at how this plays out.

> (0.999) * 100
 => 99.9
> (0.999 * 0.999) * 100
 => 99.8001
> (0.999 * 0.999 * 0.999) * 100
 => 99.7002999
> (0.999 * 0.999 * 0.999 * 0.999) * 100
 => 99.6005996001
> (0.999 * 0.999 * 0.999 * 0.999 * 0.999) * 100
 => 99.5009990004999
> (0.999 * 0.999 * 0.999 * 0.999 * 0.999 * 0.999) * 100
 => 99.4014980014994
        

Impacts of the internet

Now let's consider that even if one has achived all of this, are your customers actually receiving "five nines" of service? If we consider their provider ISP (assuming home internet connections via ISPs), local wifi setup, or even worse a mobile connection... You can already drop expectations to at least %99.9, really you are trying to make your service appear as stable as their internet, with as many network failures as the customer has to endure with any other serivce.

If failures are being measured from the end-user perspective and it is possible to drive the error rate for the service below the background error rate, those errors will fall within the noise for a given user’s Internet connection. While there are significant differences between ISPs and protocols (e.g., TCP versus UDP, IPv4 versus IPv6), we’ve measured the typical background error rate for ISPs as falling between 0.01% and 1%. -- Embracing Risk, Site Reliability Engineering