Micro-Service Request Depth Availability
Micro-Service Request Depth Availability
In systems that use micro-services, often the growth and interaction of the services grow organically over time. While it is enabling teams to move quickly and integrate whatever they need it leads to some known bad patterns in terms of micro-service interactions that have serious impacts on availability. This post explains two of micro-service integration anti-patterns calling them “The Mesh” (I prefer Distributed Monolith) and “Services in Depth”. In both, the issue is a single request into your system can fan out to many individual services both in breadth and depth. Most micro-service systems I have seen have a mix of both fan-outs in-breadth and deep lines of service depths in some cases.
Understand the Implecations of Deep Service Call Depth
Let’s consider service depth, as the simpler version of the problem to reason about. When a team is investing going micro-services some breath and depth calls are to be expected but understanding what it means and how to consider the impacts in the designing of the system. Below we will consider that each application has an aggregate request availability of 99.9%.
What does the request success rate look like for a request with 6 micro-service call depth?
> (0.999) * 100 => 99.9 > (0.999 * 0.999) * 100 => 99.8001 > (0.999 * 0.999 * 0.999) * 100 => 99.7002999 > (0.999 * 0.999 * 0.999 * 0.999) * 100 => 99.6005996001 > (0.999 * 0.999 * 0.999 * 0.999 * 0.999) * 100 => 99.5009990004999 > (0.999 * 0.999 * 0.999 * 0.999 * 0.999 * 0.999) * 100 => 99.4014980014994
Assuming each service has an aggregate availability of 99.9%, A service call depth of 6 has a request availability of %99.4
This is likely a lot lower than teams expected. Also, often depending on your infrastructure it is a lot easier to stack up to six network calls than you may think. Also, while I am not covering the impact to latency in this post, understand it will have a large and negative impact on latency for a deep request call stack.
Visualizing The Request Failure Rate
A nice way to think about the combined success rate is by thinking of each network hop as having a small opportunity for failure. These failure threads peel off as requests navigate the micro-service call stack. As the complexity of the network communications increases and the call depth deepens, the likelihood of failure increases as well. This would include things like your load balancer, DBs, App servers, and application caches.
Each network hop is an opportunity for failure, in the above showing 7 failure opportunities
Request Depth Failure Trends
Another way to visualize this is just a simple bar chart showing a decline of expected availability as service depth grows.
Micro-Service Request Availability Calculator
The below calculator will let you quickly estimate your theoretical availability based on the estimated SLA across multiple service calls. Consider each part of your infrastructure (CDNs, load balancers, DBs, Caches) as well as the total services involved in a successful response to a request.
Micro-Service Call Depth Availibility Calculator
So far we have been talking about micro-services and their availability, we should consider the services often used to host our application code into the cloud. A very popular cloud for hosting service is AWS, which publishes all AWS SLAs. Let’s look at this from the perspective of a typical AWS application, assuming a single app stack (no micro-services), a pretty standard setup, and assuming the app code has a runtime SLA of 99.9%, the combined math leaves a total theoretical max request success expectation of %99.6. If you are building something that needs extremely high reliability, are you able to keep availability promises?]
You can see as you stack AWS services, regardless of your application stability, request success percentage decreases…
|Service||SLA||% Success Math||Request % Success|
|ALB||99.99||0.999 * 0.9999||99.89%|
|ECS||99.99||0.999 * 0.9999 * 0.9999||99.88%|
|Custom App||99.9||0.999 * 0.9999 * 0.9999 * 0.999||99.78%|
|Elasticache||99.9||0.999 * 0.9999 * 0.9999 * 0.999 * 0.999||99.68%|
|RDS (Postgres)||99.95||0.999 * 0.9999 * 0.9999 * 0.999 * 0.999 * 0.9995||99.63%|
Mitigations / Considerations
When you realize as the system scales and grows and the number of total microservice dependencies a typical request into your system may have, it is worth thinking about and considering some mitigation strategies. Opposed to pushing towards Five Nines: Chasing The Dream?, embrace failure and resilience, find an acceptable and achievable level of availability for your service. Then invest in mitigation techniques and strategies to deliver a reliable client experience on unreliable internet. A few examples of mitigations are listed below.
Client Side Retries: A good reason to have client-side retries and avoid implementing retries at all levels of the infrastructure (some other special cases may make sense to avoid full round trips). See Google’s SRE book, sections Client-Side Throttling and Deciding to Retry from the Handling Overload chapter.
Be Wary of Circular Graphs: Detect circular graphs, even if this can technically be supported in your infrastructure, it may be best to avoid as a way to force folks to think through more robust and scalable solutions.
- Avoid Duplicate Service Calls: This happens when you might have a very common piece of data that a service calls, before you know it all your micro-services call this in high demand service. You might have an initial request fan out to 3 micro-services that all call this common data service under the hood. This often happens for something like user data.
- Consider common data and look at data forwarding, which can early in the request processing add metadata that is sent to all upstream requests. Avoiding all upstream requests from making individual network requests for the data.
- Agree on Constraints: Consider alerting on requests that exceed an agreed maximum service call depth or circular call graphs.
There are a lot of benefits of microservices, but I feel like the expectations around reliability and latency are often overlooked when folks move from a larger shared codebase and adopt microservices. The companies are looking for faster deployments and teams that move independently and do not fully grasp that they slowly have turned every method call or DB join into a remote network request with all the failures and performance characteristics that come with it....
I have been a ‘full-stack’ developer for a long time. These days depending on where you work and how your org works being full-stack isn’t really viable anymore. Given the growing complexity of both the front-end and back-end end systems, it is more and more required to specialize. That being said, I feel like there are good reasons to understand and think across the front-end boundaries. For example, if you care about user performance how you design backend APIs and deploy front ends can have a massive impact. From fully supporting and leveraging CDNs, pre-fetching, cached API-queries, and more. Anyways, as my front-end skills fell further behind and some exciting changes have been occurring in the front-end world it was a good time to spend a little time refreshing my knowledge and sharpening my tools.
Where I started
I decided to look at a couple of different projects and ways to explore the space. This hasn’t brought me fully up to speed with the amazing front-end folks I work with, but I have learned a lot and enjoyed working in a bit more visual space.
- I worked on a side project modernizing its look and design updating Semantic-UI/Fomantic-UI designs.
- I picked up, Modern Front-End Development for Rails and worked through some exercises.
- I dug into CDNs, caching, and compression options (Brotli)… modernizing our setup at work
- I helped automate and setup lighthouse tracking on projects at work, and fixed some low hanging fruit
- I played around with hotwire
- I picked up Tailwind CSS for a few toy projects
- I converted my blog to Tailwind CSS from an old customized Twitter bootstrap theme
- I started working on some visualizations in D3.js, diagramming network traffic, failure rates, and org/system structures.
- I built a presentation as a Tailwind / D3 microsite vs a slide show
Some of the next things coming up?
- I will likely convert over the Semantic-UI/Fomantic-UI site to Tailwind
- I am helping support our frontend team on efforts to decouple our Front End deployment from Rails
- I will be digging into our custom react design system a bit more at work and porting over a few pages to it* Adding some more visualizations directly to this blog
After trying out a few frameworks including a bit of a deep dive on Semantic-UI/Fomantic-UI. I wasn’t satisfied, the growing buzz around Tailwind pulled me in. I still have a lot to learn and a ways to go, but it is better matching my needs/desires for front-end support than anything has in a long time. As I played with Tailwind, I needed a few projects to drive a bit more real-world usage.
Converting the Blog to Tailwind
This blog you are reading moved from Twitter Bootstrap now supports Purge CSS for the various pages, has Tailwind layouts, templates, and dev support. Nothing too complicated, but I feel it looks much better than it did. Simpler header, more readable font / white space. Dropped most of the sidebar, etc… I had to reformat some of my Markdown and post tags, I wrote a conversion script to reformat my post history.
Tailwind Learning Sources
There are a lot of great resources out there and I wanted to share a few. I have also been using Tailwind on some test Rails projects, so some of the links are more Rails -> Tailwind specific.
- Learning & Exploring
- Tailwind Play, this real time configurable exploritory playground for Tailwind is a great way to quickly fool around with ideas and see results. I used this a lot before working on templates and files. You can also share ideas.
- Center elements with Tailwind CSS
- Tailwind breakout grid
- Tailwind preflight, I recommend understanding what preflight does and what it ‘resets’
- Rails & Tailwind
- Tailwind Blogs / Jekyll (My templates are heavily based on some of the below, but customized a bit)
Other Learning Projects
Other than the blog, a few other examples from my recent front-end exploration exploration.
Coverband Web built in Semantic-UI
D3 Network request flow chart
Presentation breaking down org charts, team / system relationships, and network request flows (D3 and Tailwind)
D3 Request Max SLA Calculator (interactive visualization)
It is good to revisit and resharpen skills in an area even if you aren’t planning to be an expert. While I don’t really do full-stack work in my daily workflow anymore, I am often heavily influencing our system design and architecture as it relates to microservices, mobile, and the future of our front-ends. I want to ensure I am still looking closely enough to know what questions to ask and understand when folks are sharing ideas and concerns… I want and need to know the landscape, so to speak, including some of the toolchains, pain points, and benefits over older styles of development. A quick bit of focused exploratory work can help one stay fresh while also not slowing down or getting in the way of the experts doing the real front-end work where I work. I am able to be a better and more capable partner in discussions and designs.
As I continue learning more about Tailwind and Visualization tools, please share any good links with me....
Principal / Staff Engineer Resources
Principal / Staff Engineer Resources
A friend who recently was promoted to a Staff Engineer, and wanted to learn a bit more about the role asked me if I had any good resources. I put together a small set of resources that could be helpful when aspiring or transitioning into Principal or Staff level engineering roles. I figured it would be good to share out with others who might be interested. I broke up the resources into a few different categories, so folks can dig in where relevant. It is worth noting that higher-level technical roles have a lot of variance between companies. Folks will find that they can often shape the role or find a company that defines the role in a way that will be interesting to them. Don’t feel like you need to fit into a tightly defined box, long as you are providing high-level value, there is a place for you to grow and apply your technical experience. You will hear of staff engineers who never code, while others with the same title still consider software development a core part of their job responsibilities. Whatever way you want to grow and increase the value of your work there is always more to learn as you expand your technical software career and look to have a broader impact on software development.
Staff and Leadership Resources
A few things actually focus on this narrow niche of technology roles. Overall, these are all excellent resources that I highly recommended. This list includes sources that continually create and add new content.
Staff Eng - easily the best community and set of materials out there for learning more at the top of the technical track.
- a newsletter
- a book, Staff Engineer: Leadership beyond the management track
- a podcast, Staff Eng Podcast
- a list of staff engineer learning resources
- Software Lead Weekly - email newsletter with great links of posts being discussed in the community.
LeadDev StaffPlus - LeadDev runs some of the best advanced eng confrences. Staff Plus, is the best conf that touches on all the cross functional aspects that you will be involved with as you advance in a technical career.
- A lot of past material is turned into articles, videos, and other online content.
- See my Lead Dev London Summary to get a better idea for the feel.
- CTO Connection - Community, confs, newsletter for leaders. Frequent interviews and videos shared via email updates. This is more management focused then the others. It also has been a bit more hit and miss, and seems to have been less relavent for me lately.
Staff and Leadership Articles
There are many good articles as well, but obviously these are more of one offs.
- Thriving on the Technical Leadership Path
- Becoming a Sr Technical Leader
- An incomplete list of skills senior engineers need, beyond coding
- Defining a Distinguished Engineer
- A Thread on Sr. Software Skills
Keeping Up with Tech in General
As you are shaping technology decisions that can have impacts for years, plan to try to keep up on some of what is happening with the industry in general. Find sources that relate to your field, attend conferences, read newsletters, or listen to podcasts. Figure out the best way for you to low effort keep your ear to the ground. A few resources, I have enjoyed.
- The Changelog podcast, excellent discussions with engineers
- Leading From Incidents, It Will Never Work in Theory - Short summaries of recent results in empirical software engineering research
- Greater Than Code Podcast - A podcast that is more about the who, why, and community… vs the technical how.
Distributed Team Resources
My friend and I both work on remote-first teams. A few resources around remote work, and how to lead tech outside of centralized teams.
- First, if you are working remote long term, it is worth getting a good remote audio/video setup
- I don’t think top down leadership scales particularly well for software, I think it is even less effective when distributed. Scaling Architecture Conversationally sets out a number of approaches to encourage a more team sourced achitecture with guidance.
- Hashicorp: Distributed, Async, and Document Driven - learn from how Hashicorp builds distributed software
- Remote, while this company seemed to go off the Rails (pun intended) and loose a lot of the good will they had in the remote community, much of the advice can still be useful.
Although there are lots of great posts, talks, and threads about technology leadership, sometimes nothing can give the big picture and the depth of a book. The shortlist of books below, stand out in my mind as worth the time.
- Accelerate - If you want to shape how a software org functions, ensure you are basing it on research backed successes. This book summaries what is actually working best based on research.
- The First 90 days - While I highly disagree with a few things in this book, there is enough that is actionable and will help you have an impact as well as understand what is motivating other new leaders, that it is worth reading.
- The Manager’s Path - Even if you are looking to stay on the Staff / technical track many parts of this book will help you grow with parts of your job.
- Working Backwards - Another book I don’t agree or like all the recommendations, but has enough practical and actionable advice on how Amazon scaled and managed to stay agile with massive growth, that it is worth reading.
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems - Even if you aren’t handling big data (yet), this book will let you understand when and why to reach for distributed systems and tools.
- Domain Driven Design - One of the most important things while scaling and hardest to fix later is bad domain modeling, deeply learn about building good data models.
I am mostly operating in the Ruby world, and my friend is also at a Ruby-based company, so some Ruby-specific resources. As you grow as a tech leader in some sense you will care less about a specific technology, but there are reasons to keep an ear to the ground for the technologies your business has invested so much in. As I have grown my career, I have found it valuable to stay up to date with things going on in the community. If your company is invested in Ruby it means your hiring, training, operations, and infrastructure costs are all related to Ruby, ensure you are using it at its best.
- Ruby Weekly - easiest way to see the most discussed articles in and around the Ruby community
- RubyFlow - a more techy stream of Ruby links and project updates
- Speedshop - a few great ways to soak in Ruby operational and performance tips Rails performance Blog, Ruby Performance Newsletter
- Ruby Central Youtube channel - various confrence talks
- Ruby Confreaks Recordings
- Rails Security Email List - ensure you see critical security issue announcements
- Sustainable Web Development with Ruby on Rails - If you are currently struggling with scaling Rails as the dev team grows, this book should help guide ways to move forward that are less likely to come back to haunt you.
Enjoy the Role
Good luck with the new role and growth. As always there are lots to learn, but there is a growing community out there to find friends and mentors and talk about what you want out of your career. The highly defined ladder is changing and roles are more malleable as we move to more hybrid and distributed ways of working. Feel free to explore and help shape the ways folks can lead in technology, it doesn’t have to be a path to being a manager, director, or VP in all cases anymore. It is an exciting time to be a technical learner and leader.
If you have any good articles, sites, or books please share them with me, as I am always looking to learn more about what others in this area are doing....
Learning with Game Days
photo credit [email protected]
Learning with Game Days
Many different companies and posts talk about why and how to run Game Days. I won’t rehash all of that in this post, instead I will give a basic intro and link to some sources then dive into some more specific Game Days I have recently been involved with.
A game day simulates a failure or event to test systems, processes, and team responses. The purpose is to actually perform the actions the team would perform as if an exceptional event happened
Below are a few recent Game Day examples, with some details of what we did, what we learned, how we broke something in a “safe” manner, and a runbook to run your own. In all cases, one important factor of running the Game Day is having a way to stop if the simulated incident starts to cause a real incident. When planning a simulated emergency make sure you have a planned way to escape out of the test if something unexpected is occurring.
Safe and Confident Deployments
For various reasons some of our systems were relatively slow to deploy. This means...
Performance of JSON Parsers at Scale
photo credit [email protected]
Performance of JSON Parsers at Scale
In a recent post, benchmarking JSON Parsers (OJ, SimdJson, FastJsonParser). This compared the parsers based on local microbenchmarks. In the end, I recommended for almost all general use cases go with OJ. Saying that FastJsonParser might be worth it for specific use cases. I want to do a quick follow up on sharing what happens when microbenchmarks meet real-world...
benchmarking JSON Parsers (OJ, SimdJson, FastJsonParser)
photo credit Tumisu lt: @pixabay
UPDATE: Added FastJsonParser
After some feedback on reddit (thx @f9ae8221b), pointing out a JSON gem I wasn’t aware of, I updated the benchmarks to also support
FastJSONparser and cover symbolize_keys, which is important for my companies use cases (which a co-worker pointed out) and can cause significant performance issues if you have to do that independently of JSON parsing.
Performance Benchmarks between OJ, SimdJSON, FastJsonparser, and StdLib
I was recently looking at the performance of some endpoints that process large amounts of JSON, and I wondered if we could do even better than we do in terms of performance for that processing. Across our company we have recently switch most of our apps from the Ruby StdLib JSON to OJ, but I had read about SimdJSON and was curious if we should look further into it as well. In this article I will tell you a bit about each of the Ruby JSON options and why you might want to consider them.
OJ is a Ruby library for both parsing and generating JSON with a ton of options. I would basically say if you don’t want to think too much but care about JSON performance just set up the OJ gem, and it should be...