Experiments Visualizing Micro Services SLAs

Micro Service SLAs

The issue with SLAs is that folks often want to aim for extremly high SLAs like "five nines" 99.999% uptime... Companies also want high fully independent teams running and managing microservices... Let's take a look at how this plays out. Every time you add another service dependency you reduce the theoritical maximum SLA you could provide.

> (0.999) * 100
 => 99.9
> (0.999 * 0.999) * 100
 => 99.8001
> (0.999 * 0.999 * 0.999) * 100
 => 99.7002999
> (0.999 * 0.999 * 0.999 * 0.999) * 100
 => 99.6005996001
> (0.999 * 0.999 * 0.999 * 0.999 * 0.999) * 100
 => 99.5009990004999
> (0.999 * 0.999 * 0.999 * 0.999 * 0.999 * 0.999) * 100
 => 99.4014980014994

Let's look at this from the perspective of a typical AWS applications, assuming a single app stack (no micro services), a pretty standard setup with app code of 99.9% still leaves a total SLA max of %99.6
AWS Service SLAs

cloudfront   ALB       ECS      YOUR APP    Elasticache   RDS (Postgres)
(0.999   *  0.9999  * 0.9999  *  0.999   *    0.999        *  0.9995)   * 100
=> 99.63052065660449

Impacts of the internet

Now let's consider that even if one has achived all of this, are your customers actually receiving "five nines" of service? If we consider their provider ISP (assuming home internet connections via ISPs), local wifi setup, or even worse a mobile connection... You can already drop expectations to at least %99.9, really you are trying to make your service appear as stable as their internet, with as many network failures as the customer has to endure with any other serivce.

If failures are being measured from the end-user perspective and it is possible to drive the error rate for the service below the background error rate, those errors will fall within the noise for a given user’s Internet connection. While there are significant differences between ISPs and protocols (e.g., TCP versus UDP, IPv4 versus IPv6), we’ve measured the typical background error rate for ISPs as falling between 0.01% and 1%. -- Embracing Risk, Site Reliability Engineering

WHAT ABOUT COST ^^ Google covers as well

Continuously Deployed

Continuously Deployed

Micro Service SLAs

Impacts of the internet