Learning with Game Days
08 January 2021

performance

photo credit [email protected]

Learning with Game Days

Many different companies and posts talk about why and how to run Game Days. I won’t rehash all of that in this post, instead I will give a basic intro and link to some sources then dive into some more specific Game Days I have recently been involved with.

A game day simulates a failure or event to test systems, processes, and team responses. The purpose is to actually perform the actions the team would perform as if an exceptional event happened

Below are a few recent Game Day examples, with some details of what we did, what we learned, how we broke something in a “safe” manner, and a runbook to run your own. In all cases, one important factor of running the Game Day is having a way to stop if the simulated incident starts to cause a real incident. When planning a simulated emergency make sure you have a planned way to escape out of the test if something unexpected is occurring.

Safe and Confident Deployments

For various reasons some of our systems were relatively slow to deploy. This means if something bad happened, it could take awhile to revert, pass CI, deploy etc… We finally got access to a rollback tool which was much faster than the normal deploy process to jump back to a recently deployed version of the code. While the tool was available for awhile many folks had their first experience using it during a real incident! Trying to find docs and understand the commands while stressed about some service being broken. Not the ideal way to learn… We got a group together with a preplanned and scheduled time, we then broke a ‘hidden’ endpoint on production, used rollback, fixed the code and showed how to “roll forward” back into the standard development flow. Since our tool did a bunch of cool things like freezing the deploy when you roll back, alerting various folks that it was going on, etc we got to see and feel the full experience. We had folks running the commands who had never had to do a rollback in production before, which made the exercise really valuable.

Runbook

  • schedule with a group of interested folks
  • announce to everyone in relevant channels so they don’t believe a real incident is occurring
  • deploy a non-important change
  • roll back and verify all the alerts, notifications, and commands
  • watch dashboards confirming how quickly the rollback worked and shows how to monitor deploy progress
  • fix the non important change
  • unfreeze deploys (if your system supports something like this)
  • deploy the main branch and restore the standard flow.

Upstream Partner Timeouts

We have a feature that I will cover more at some point in a future post, called chaos traces, similar to chaos monkey it lets you inject some chaos into your system to see how it behaves. In this Game Day, after the release of a new partner integration, we used our chaos trace tool to validate how the integration handle timeouts and errors. The code author and the PR reviewer paired up following the deploy and used chaos traces to verify that when the integration had an API Error or if the integration was running slow that the app handled it how everyone expected. By injecting say 3 seconds of latency into the respond time, we verify that the app would handle the timeout and give the customer a reasonable UX experience.

Runbook

  • schedule with a group of interested folks
  • announce to everyone in relevant channels so they don’t believe a real incident is occurring
  • cause partner error scenarios (if you don’t have chaos functionality a feature flipper, or ENV toggle may work)
  • watch dashboards confirming expected logs / metrics
  • screen shot / share the UX experience of the error states with the team
  • have confidence that edge case integrations are not leaving users stuck in a confusing state

Simulating Traffic Spikes

I didn’t participate directly in this Game Day, but I helped write some of the code and reviewed PRs. As a new service was being integrated into a high traffic flow, before the new service went live the team wanted to load test it. In the past we have often used simulated load, but it is hard to realistically generate traffic that looks like production… We tried something new and did a Game Day around production load… The team had areas of site that would soon make the new experience visible fire and forget all get requests to the new services without displaying and results… They even had a scale factor so we took traffic on a normal day and generated 3X production traffic to the new backing service… Since all the requests were fire and forget it didn’t slow down the user experience and let us work on performance tunning the new service under real heavy load.

Runbook

  • schedule with a group of interested folks
  • announce to everyone in relevant channels so they don’t believe a real incident is occurring
  • have a feature toggle to enable or disable the fire and forget load
  • enable and start to scale up the load factor
  • watch dashboards
    • ensuring the Game Day isn’t impacting existing services or users
    • learn how the new service response to heavy and increasing load and fix any performance issues
  • toggle back off the simulated load, and move forward with confidence the new service can handle real traffic

Cache Failures

Caching is hard, eventually you might have to break a bad cache… If this cache has data for a few million users, sometimes very bad things happen to performance while the system tries to rebuild a cache. This is a future Game Day, that we have not yet done, but are currently planning after running into a cache issue. We were able to quickly bust the cache and fix a service that was in a invalid state, but it got us thinking… As the service and data set keeps growing, how long can we rely on that? At what point will we break cache and start to see a cascading of timeouts throughout all of our client services as we buckle under the load… Seems like a perfect thing to Game Day, we plan to put a feature toggle in place to let us ramp up cache_skips for a percentage of requests. Turn up the dial and ensure we can hit 100% without to large of an impact to our performance and stability.

Runbook

  • schedule with a group of interested folks
  • announce to everyone in relevant channels so they don’t believe a real incident is occurring
  • have a feature toggle turn up cache skips
  • start to turn up while watching dashboards and dial back down if there is any unexpected performance issues
  • identify any scaling issues and address and repeat until comfortable
  • enjoy the confidence that next time someone needs to bust the entire cache it shouldn’t be a problem

Event Bus Failures

After some recent issues with our Event Bus, we put a number of protections in place that we thought would ensure our service could handle a significant unplanned outage without causing any interruption to clients. We made several changes to how we were handling events

  • moving from synchronous to asynchronous sending of events where possible
  • on any synchronous message queue retries in background jobs for later
  • increase our background job storage to be able to handle millions of stalled and failed events waiting for recovery
  • autoscaling background workers to be able to quickly process a big backlog of events without going to fast to overload the system

OK, great next time there is an incident our service will keep humming along without any issue… Well, would it really hold up? How could we be sure? Of course by now, y’all realize that instead of waiting for the next time our events queue has an incident, we can run a Game Day! That is what we did, we made a toggle to disable our access to the event queue server… Then flipped it off, we got to see if our protections worked verifying each doing what was expected… If we were wrong we could very quickly restore the system… During the incident the week prior, our internal background job storage overflowed in about 15m as we helplessly watched in horror… As we watched for 15 minutes during this Game Day our new monitor and connected alert showed the estimated time remaining we could buffer all events before running into problems. Showing how we could now handle hours of downtime, giving the teams plenty of time to get paged, and work on a resolution if needed.

Runbook

  • schedule with a group of interested folks
  • announce to everyone in relevant channels so they don’t believe a real incident is occurring
  • toggle off access to the event queue service
  • watch dashboards and improve any lacking observability to be confident in being able to understand your system
  • scale up any resources that are more impacted than expected
  • use the background job stalled jobs load to calculate and set alerts and storage needed for appropriate response times
  • enjoy the confidence that next time the event service has unexpected maintenance your service won’t drop any data and the services clients won’t see any errors.

Conclusion

Game Days can be used to work through what could be a difficult or exceptional situation in a controlled manner at a planned time and the support of a team. When done well, they can teach folks how to be better prepared for incidents, better understand their systems, and have more empathy for how other folks process and understand system in the stress of a real incident. They are a great way to ensure everyone on the team is ready when something unexpected next happens.

comments

Performance of JSON Parsers at Scale
02 December 2020

performance

photo credit [email protected]

Performance of JSON Parsers at Scale

In a recent post, benchmarking JSON Parsers (OJ, SimdJson, FastJsonParser). This compared the parsers based on local microbenchmarks. In the end, I recommended for almost all general use cases go with OJ. Saying that FastJsonParser might be worth it for specific use cases. I want to do a quick follow up on sharing what happens when microbenchmarks meet real-world data, scale, and systems.

TL; DR; you probably just want to use OJ as originally recommended, even on data where FastJsonParser wins in a microbenchmark, the real-world data was undetectable. While moving from StdLib to OJ was a 40% latency improvement holding up across multiple services.

Microbenchmarks

As often the case microbenchmarks come with a lot of issues. In this case, my microbenchmarks showed with a single example of real-world data that FastJsonParser was that faster and had the lowest memory usage… OJ was about 1.55x slower in both with_keys and normal string key benchmarks. I benchmarked against two JSON pieces, a very small fake JSON payload, and a real-world large payload pulled from one of our production systems. For the specific examples, I used and with no other concerns yes FastJsonParser is faster, but that doesn’t mean it will translate into a real-world performance win.

Given that we had previously seen 40% latency improvements when moving to OJ, it seemed like another 50% speed lift would be worth it, so I set out to test FastJsonParser on some of our production systems.

40 percent improvement graph

Last week / Yesterday -> OJ released with 40% latency improvement

What Does Real World & At Scale Mean?

In my case, I started with a single app having all API calls use FastJsonParser to parse responses as well as when pulling JSON out of caches. The single app had a smaller JSON payload than I benchmarked with but had a very high throughput. After deployment, there was no detectable change in latency… Why not?

  • At that point, the way the app was performing was already fairly well optimized
  • According to DataDog trace spans JSON parsing was taking up less than 1ms of response time
  • Um… what is 50% faster on 1ms of a response, where JSON parse wasn’t even in the top 10 time-consuming spans of building the response? Nothing really

OK, I figured I picked a bad test case… I had originally benchmarked with a large JSON collection blob that passed through multiple systems. I decided to target 5 applications that worked together that used and served the original data I used to benchmark. This broke down like so:

  • 1 front end app
  • 4 microservices sending different JSON payloads

In total that large JSON collection data was passed through 3 of the 5 apps, with other JSON data coming from the other services. I figured this would have a bunch of small wins that would add up to show reduced latency for the front end application. Since all the small gains would eventually roll up to its final response time.

After sending out 5 PRs, getting approvals, deploying, I played a waiting game watching graphs and collecting data… NOTHING, I could see nothing. No errors, no problems, no performance impact.

Why Wasn’t It Faster?

I think similar to the single app example, even in this case OJ had really already captured the majority of the wins. JSON.parse was no longer in the top 10 spans of any of the 5 apps I updated. It previously was a part of the critical performance path… It no longer was… I am guessing there might have been tiny improvements, but nothing I could see with the naked eye… For most of these services, a 1ms improvement in each service, wouldn’t have been visible with all the random network noise.

I think just network latency outweighed any further improvement on JSON parsing… None of the payloads were large or complex enough to drive a significant cost. This goes back to the original point, you need to really have a good reason to spend the extra time with FastJsonParser to drive further improvements over OJ’s Json.parse drop-in replacement which also ensures all the Rails toolchain and middleware is using the improvements. Since FasterJsonParser requires the developer to explicitly call FastJsonparser.parse I only did that where we handled API calls, it took more work and it wasn’t an improvement. If you have spans where JSON.parse is showing significant time in your application traces, it could be different for you.

via GIPHY

Microbenchmark -> Welcome to the real world

Conclusion

Unless you are maintaining a gem and are avoiding dependencies, I highly recommend using OJ for your applications. It requires very low effort and holds up in microbenchmarks and across many different services and real-world data.

comments

benchmarking JSON Parsers (OJ, SimdJson, FastJsonParser)
15 November 2020

Compare

photo credit Tumisu lt: @pixabay

UPDATE: Added FastJsonParser

After some feedback on reddit (thx @f9ae8221b), pointing out a JSON gem I wasn’t aware of, I updated the benchmarks to also support FastJSONparser and cover symbolize_keys, which is important for my companies use cases (which a co-worker pointed out) and can cause significant performance issues if you have to do that independently of JSON parsing.

Performance Benchmarks between OJ, SimdJSON, FastJsonparser, and StdLib

I was recently looking at the performance of some endpoints that process large amounts of JSON, and I wondered if we could do even better than we do in terms of performance for that processing. Across our company we have recently switch most of our apps from the Ruby StdLib JSON to OJ, but I had read about SimdJSON and was curious if we should look further into it as well. In this article I will tell you a bit about each of the Ruby JSON options and why you might want to consider them.

OJ

OJ is a Ruby library for both parsing and generating JSON with a ton of options. I would basically say if you don’t want to think too much but care about JSON performance just set up the OJ gem, and it should be the best option for most folks. It is well known, tested, and trusted in the community with a ton of support.

  • A drop in replacement, set it forget it, it is faster and better
  • Has built in Rails support, for some Rails JSON quirks
  • Supports both generation and parsing

SimdJSON

The SimdJSON Ruby library doesn’t have a lot of talks, documentation, or attention… but it is binding to the fastest JSON parser out there… It offers parsing speeds that nothing else can touch, and if you are trying to parse extremely large and dynamic JSON, it might just be the best option for you.

  • Only supports parsing
  • Doesn’t support symbolize_keys
  • The fastest option out there (uses simdjson C libs)
  • Not a lot of community activity

FastJsonparser

While simdJSON is a fast gem, it doesn’t have much support and the way it handles rescuing errors could leak memory. While I didn’t see such issues in my limited production rollout that is worth noting. The user @f9ae8221b pointed out the memory issue and that the gem FastJsonparser also wraps simdjson and has wider community support. I had never heard of the gem, and was already trying to patch SimdJSON to support symbolize_keys. Luckily FastJsonparser already supports that option. It still is faster than OJ, and requires a bit more work to intgrate, but looks like a better option than SimdJSON when you are looking for improved parsing speed. The user still mentions it could have some production issues, so I will have to report back as I roll it out to various systems.

  • Only supports parsing
  • Does support symbolize_keys
  • The fastest option out there (uses simdjson C libs)
  • Larger community support

StdLib Ruby JSON

It is built in, seriously if you do much with JSON in a production system, just use OJ, unless you want to dig in deeper or find some specific reason it won’t work for you. The Ruby library is fine and will work for any quick check, but if you have any reason to care about performance, OJ is a easy to use drop in replacement… A note of when you shouldn’t use it? If you are authoring a gem, reduce your hard dependencies as much as possible, if you call JSON.parse and a hosting app is using OJ, your gem will use OJ and be faster… You shouldn’t force users of your gem to require OJ.

Benchmarking the methods

Let’s see the difference with favorite Ruby benchmarking gem benchmark-ips, which gives a bit more readable reports than the standard benchmark lib. These are just quick micro-benchmarks, with all the issues that come with them, but the performance impact has been further validated by deploying to production systems with measurable impacts on the response time. The product use case included far larger JSON payloads and with much higher variability to the data, making me think the results would apply to most web service like systems.

Benchmarking JSON Parsing (without symbolize_keys, added FastJsonparser)

We will load up the various libraries, and some weird fake HASH/JSON data. Then benchmark parsing it for a number of seconds…

require 'benchmark/ips'
require 'json'
require 'oj'
require 'simdjson'
require 'fast_jsonparser'
require 'memory_profiler'
require 'rails'

json = {
  "one":1,
  "two":2,
  "three": "3",
  "nested": {
    "I": "go",
    "deep": "when",
    "i": "need",
    a: 2
  },
  "array":[
    true,
    false,
    "mixed",
    "types",
    2,
    4,
    6
  ]
}.as_json.to_json.freeze

puts "ensure these match"
puts  Oj.load(json, symbol_keys: false) == Simdjson.parse(json) &&
        Simdjson.parse(json) == JSON.parse(json, symbolize_names: false) &&
        FastJsonparser.parse(json, symbolize_keys: false) == Simdjson.parse(json)

Benchmark.ips do |x|
  x.config(:time => 15, :warmup => 3)

  x.report("oj parse") { Oj.load(json, symbol_keys: false) }
  x.report("simdjson parse") { Simdjson.parse(json) }
  x.report("FastJsonparser parse") { FastJsonparser.parse(json, symbolize_keys: false) }
  x.report("stdlib JSON parse") { JSON.parse(json, symbolize_names: false) }

  x.compare!
end

# Let's check memory as well...
report = MemoryProfiler.report do
  100.times { Simdjson.parse(json.dup) }
end
puts "simpdjson memory"
report.pretty_print

report = MemoryProfiler.report do
  100.times { Oj.load(json.dup) }
end

puts "OJ memory"
report.pretty_print

Benchmark Results (without symbolize_keys)

This shows as claimed that SimdJSON and FastJsonparser outperform OJ even on pretty small and contrived JSON examples. The Performance gap holds up or sometimes looks more significant when looking at more realistic production payloads seen in some of the product systems I work with. Note if you need symbolize_keys or want a bit more community support I would go with FastJsonparser.

ensure these match
true
Warming up --------------------------------------
            oj parse    12.697k i/100ms
      simdjson parse    17.276k i/100ms
FastJsonparser parse    17.834k i/100ms
   stdlib JSON parse     8.662k i/100ms
Calculating -------------------------------------
            oj parse    121.709k (± 3.5%) i/s -      1.828M in  15.040973s
      simdjson parse    171.253k (± 4.3%) i/s -      2.574M in  15.060276s
FastJsonparser parse    190.436k (± 3.2%) i/s -      2.853M in  15.000218s
   stdlib JSON parse     93.032k (± 3.4%) i/s -      1.403M in  15.102830s

Comparison:
FastJsonparser parse:   190436.3 i/s
      simdjson parse:   171252.9 i/s - 1.11x  (± 0.00) slower
            oj parse:   121709.5 i/s - 1.56x  (± 0.00) slower
   stdlib JSON parse:    93032.1 i/s - 2.05x  (± 0.00) slower

Benchmarking JSON Parsing (with symbolize_keys)

require 'benchmark/ips'
require 'json'
require 'oj'
require 'simdjson'
require 'fast_jsonparser'
require 'memory_profiler'
require 'rails'

json = {
  "one":1,
  "two":2,
  "three": "3",
  "nested": {
    "I": "go",
    "deep": "when",
    "i": "need",
    a: 2
  },
  "array":[
    true,
    false,
    "mixed",
    "types",
    2,
    4,
    6
  ]
}.as_json.to_json.freeze

puts "ensure these match"
puts  Oj.load(json, symbol_keys: true) == Simdjson.parse(json).deep_symbolize_keys! &&
        Simdjson.parse(json).deep_symbolize_keys! == JSON.parse(json, symbolize_names: true) &&
        FastJsonparser.parse(json) == Simdjson.parse(json).deep_symbolize_keys!


Benchmark.ips do |x|
  x.config(:time => 15, :warmup => 3)

  x.report("oj parse") { Oj.load(json, symbol_keys: true) }
  x.report("simdjson parse") { Simdjson.parse(json).deep_symbolize_keys! }
  x.report("FastJsonparser parse") { FastJsonparser.parse(json) }
  x.report("stdlib JSON parse") { JSON.parse(json, symbolize_names: true) }

  x.compare!
end

Benchmark Results (with symbolize_keys)

This is the other main reason to use FastJsonparser depending on the integrations in your apps you might rely on symbolized_keys… We had added that at a very low level in our shared ApiClient, and the performance implications of having to symbolize_keys as a second pass make a big difference. This shows how the simdjson performance win doesn’t hold up when you need symbolized_keys.

ensure these match
true
Warming up --------------------------------------
            oj parse    13.455k i/100ms
      simdjson parse     7.752k i/100ms
FastJsonparser parse    19.458k i/100ms
   stdlib JSON parse     8.546k i/100ms
Calculating -------------------------------------
            oj parse    134.285k (± 4.5%) i/s -      2.018M in  15.060313s
      simdjson parse     75.825k (± 7.2%) i/s -      1.132M in  15.022033s
FastJsonparser parse    208.199k (± 3.1%) i/s -      3.133M in  15.061737s
   stdlib JSON parse     86.504k (± 3.5%) i/s -      1.299M in  15.035736s

Comparison:
FastJsonparser parse:   208199.1 i/s
            oj parse:   134285.4 i/s - 1.55x  (± 0.00) slower
   stdlib JSON parse:    86503.7 i/s - 2.41x  (± 0.00) slower
      simdjson parse:    75825.4 i/s - 2.75x  (± 0.00) slower

Using Large JSON Data

The results are very similar for a much larger production 120K JSON payload, pulled for a live system. (NOTE: these benchmarks were run on a different machine)… In this case we are showing nearly a 2X performance boost.

without symbolize_keys:

Warming up --------------------------------------
            oj parse    62.000  i/100ms
      simdjson parse    79.000  i/100ms
   stdlib JSON parse    42.000  i/100ms
Calculating -------------------------------------
            oj parse    622.377  (± 3.9%) i/s -      9.362k in  15.066907s
      simdjson parse    815.699  (± 4.5%) i/s -     12.245k in  15.045902s
   stdlib JSON parse    426.656  (± 3.5%) i/s -      6.426k in  15.083428s

Comparison:
      simdjson parse:      815.7 i/s
            oj parse:      622.4 i/s - 1.31x  (± 0.00) slower
   stdlib JSON parse:      426.7 i/s - 1.91x  (± 0.00) slower

with symbol_keys:

ensure these match
true
Warming up --------------------------------------
            oj parse    71.000  i/100ms
      simdjson parse    29.000  i/100ms
FastJsonparser parse    82.000  i/100ms
   stdlib JSON parse    41.000  i/100ms
Calculating -------------------------------------
            oj parse    726.191  (± 1.5%) i/s -     10.934k in  15.059977s
      simdjson parse    294.947  (± 2.4%) i/s -      4.437k in  15.052250s
FastJsonparser parse    909.828  (±10.2%) i/s -     13.530k in  15.026051s
   stdlib JSON parse    497.749  (± 3.6%) i/s -      7.462k in  15.011659s

Comparison:
FastJsonparser parse:      909.8 i/s
            oj parse:      726.2 i/s - 1.25x  (± 0.00) slower
   stdlib JSON parse:      497.7 i/s - 1.83x  (± 0.00) slower
      simdjson parse:      294.9 i/s - 3.08x  (± 0.00) slower

The MemoryProfiler (nor production deployments, server metrics) on either small or large JSON objects didn’t really show any substantial difference, so I wouldn’t be too concerned with memory when picking these libraries.

Conclusion

If you have a Ruby service that is parsing large quantities of JSON, it might be worth taking a look at the newer and less known FastJsonparser. While the gem is less documented and takes a bit more work to integrate into your app than OJ. If you are looking for a drop in replacement OJ is still the way to go, but for some use cases SimpdJSON or FastJsonparser will be worth the extra effort. If you are using Rails with a production deployment I can’t really see any reason to not use OJ for the significant performance benefits that come with it. The OJ library made it as easy as possible to use as a drop in replacement and if you rely on nearly any particular JSON quick of the past they have options to help you stay fully compatible. I know as we look towards Ruby 3 we are also hoping to move away from some of the native extension C libraries, but when it comes to very low level repetitive application tasks vs application logic, sometimes it is hard to beat and worth the integration and dependency cost.

comments

Ruby: Understanding create_or_find_by vs find_or_create_by
22 October 2020

Bugs

photo credit geralt: @pixabay

Performance Benchmarks & Considerations between create_or_find_by & find_or_create_by

I was recently optimizing an endpoint and got to think through some interesting differences between two Active Record methods that help you either find an existing record or create a new one. At first glance, it seems either is fine with some notable differences around their race conditions.

find_or_create_by

The find_or_create_by method has been around longer and is more familiar to many Rubyists. The race condition is called out in the linked docs, excerpt below.

Please note this method is not atomic, it runs first a SELECT, and if there are no results an INSERT is attempted. If there are other threads or processes there is a race condition between both calls and it could be the case that you end up with two similar records.

This lead to Rails 6 adding the newer methods…

create_or_find_by

The new create_or_find_by methods have a more rare race condition (on deleted ids), but can prevent a more common insert race condition on duplicates… It is well described in this post, Rails 6 adds create_or_find_by, along with some downsides. For example without a unique DB constraint it will create duplicates (ex: add_index :posts, :title, unique: true). These issues are also called out in the docs linked above, excerpt below.

  • The underlying table must have the relevant columns defined with unique constraints.

  • While we avoid the race condition between SELECT -> INSERT from #find_or_create_by, we actually have another race condition between INSERT -> SELECT, which can be triggered if a DELETE between those two statements is run by another client. But for most applications, that’s a significantly less likely condition to hit.

  • It relies on exception handling to handle control flow, which may be marginally slower.

  • The primary key may auto-increment on each create, even if it fails. This can accelerate the problem of running out of integers, if the underlying table is still stuck on a primary key of type int (note: All Rails apps since 5.1+ have defaulted to bigint, which is not liable to this problem).

Benchmarking the methods

While the docs are good at calling out the race conditions, they are not as clear about the performance implications… In fact, they could lead one to believe that create_or_find_by is always slower from this line, “may be marginally slower”… The reality is you need to know the usage characteristics of where you will be calling these methods to pick the one with the best performance characteristics.

Both of the methods will either find or create a record, and they try those in different orders… If you expect to most often find the record vs most often create a record that has a big impact on the performance. Let’s see how much by breaking out my favorite Ruby benchmarking gem benchmark-ips, which gives a bit more readable reports than the standard benchmark lib. These are just quick micro-benchmarks, with all the issues that come with them, but the performance has also been validated by deploying to production systems at scale.

Benchmarking All Finds

In this case, we are going to benchmark a case that simulates a code path that is all finds, and no creates… If you have an endpoint that creates once in a user’s lifecycle and then forever is hitting the find, you likely will have a much higher find vs create ratio close to this benchmark.

require 'benchmark/ips'
ActiveRecord::Base.logger = nil
Post.destroy_all

Benchmark.ips do |x|
  x.config(:time => 10, :warmup => 3)
  x.report 'create_or_find_by' do
    Post.create_or_find_by!(title: "create_or_find_by")
  end
  x.report 'find_or_create_by' do
    Post.find_or_create_by!(title: "find_or_create_by")
  end
  x.compare!
end

results:

As expected, when you would find an existing record all the time find_or_create_by is much faster, approximately 4X faster!

Warming up --------------------------------------
  create_or_find_by     49.000  i/100ms
  find_or_create_by    204.000  i/100ms
Calculating -------------------------------------
   create_or_find_by     450.791  (± 7.8%) i/s -      4.508k in  10.063664s
   find_or_create_by     2.078k (± 6.9%) i/s -     20.808k in  10.061016s

Comparison:
  find_or_create_by:     2078.1 i/s
  create_or_find_by:     450.8 i/s - 4.61x  (± 0.00) slower

Benchmarking All Creates

In this case, we will benchmark where nearly all the calls are creating new records… This would simulate an endpoint that is generally creating brand new records and very rarely should find an existing record.

require 'benchmark/ips'
ActiveRecord::Base.logger = nil
Post.destroy_all

Benchmark.ips do |x|
  x.config(:time => 10, :warmup => 3)
  x.report 'create_or_find_by' do
    Post.create_or_find_by!(title: "create_or_find_by #{rand}")
  end
  x.report 'find_or_create_by' do
    Post.find_or_create_by!(title: "find_or_create_by #{rand}")
  end
  x.compare!
end

results:

In a case where you are always creating it is faster to create_or_find_by but the overall difference is less dramatic.

Warming up --------------------------------------
  create_or_find_by     73.000  i/100ms
  find_or_create_by     44.000  i/100ms
Calculating -------------------------------------
  create_or_find_by     722.939  (± 8.3%) i/s -      7.227k in  10.069582s
  find_or_create_by     522.615  (± 9.6%) i/s -      5.192k in  10.028946s

Comparison:
  create_or_find_by:      722.9 i/s
  find_or_create_by:      522.6 i/s - 1.38x  (± 0.00) slower

Conclusion

When you are working with create_or_find_by or find_or_create_by ensure you are considering how and which race conditions might affect your code. If it is easier to just have your app handle DB constraint errors and retry directly, most of the time using find_or_create_by is going to be simpler and more performant… If you reach for create_or_find_by ensure you understand the additional complexity and performance impacts depending on the expected hit and miss ratio for your use case.

comments

Ruby: Patching StdLib in Gems
15 July 2020

Bugs

photo credit patches: AnnaER @pixabay

Why Patch Ruby StdLib Code in Gems

Well, the Ruby community does this a lot, it can unlock powerful enhancements, features, observability, and more…

Here are some examples of patching Ruby’s StdLib (standard library). Let’s just look at a few that patch a single piece of Ruby, Net::HTTP. Many libraries want to tap into what is happening around the network.

Sometimes opposed to patching upstream Ruby code, one can just have adapters/wrappers around them, while related it is a much different approach and you can see how Faraday handles adapting Net::HTTP as an example of that approach. Which is safer, but requires upstream apps to change their code to use the libraries’ APIs as opposed to modifying existing behavior.

Gems Patch Ruby StdLib, So What?

The problem comes up with multiple gems trying to patch the same method. From the examples above, there are multiple ways to attempt to modify the original code, which doesn’t always play nicely together.

  • alias, alias_method, and the like
  • prepend, class/module extension ways of extending a method and using super
  • replacing constants, I don’t know the common term for what WebMock does to patch Net::Http

If you have multiple gems patching the same upstream Ruby StdLib (or Rails) class or function, you can run into issues. This is a known Ruby ‘Bug’ along with a known solution to detect and patch in the same way.

Example: Errors: stack level too deep

The reason I am writing this up is that I had a bug in Coverband for months, thx bug reporters(@) I appreciate it, that made no sense to me… I couldn’t reproduce it, I didn’t have any great stack traces, I had no idea what area of code the issue was even in… I couldn’t even investigate the issue. At the time all I really knew about the bug? Exception: Stack level too deep error.

After months, of once in awhile taking a look but not understanding the problem… I got a new bug report from @ hanslauwers… Which, added some details, specifically that the gem AirBrake and Coverband, both were patching Resque… but in different ways…

A few days prior to the above report, I saw while working on another project this excellent description of a problem that had been solved in the MiniProfiler project, the readme documents how to resolve Net::HTTP stack level too deep errors… So the new bug report made my spidey sense tingle, and I was finally able to fix it.

How to handle applications differences

I ended up following the same pattern as MiniProfiler, which described the problem and the fix excellently in it’s readme.

If you start seeing SystemStackError: stack level too deep errors from Net::HTTP after installing Mini Profiler, this means there is another patch for Net::HTTP#request that conflicts with Mini Profiler’s patch in your application. To fix this, change rack-mini-profiler gem line in your Gemfile to the following:

… examples …

This conflict happens when a ruby method is patched twice, once using module prepend, and once using method aliasing. See this ruby issue for details. The fix is to apply all patches the same way. Mini Profiler by default will apply its patch using method aliasing, but you can change that to module prepend by adding require: [‘prepend_net_http_patch’] to the gem line as shown above.

The readme, explains the issue, has code examples for how app’s integrating the gem can resolve the issue, and links to the original Ruby “Bug”, which explains the issue in detail and discusses approaches to solve the problem

Coverband’s Patching Solution

This is the PR that was merged after understanding the problem and approach I took to resolve the problem. Again, heavily patterned off the MiniProfiler solution.

In the end, it is a pretty simple fix, but it took time and various folks participating in the bug report to understand. If you see an open github issue that still seems relevant, add some comments and details. You never know if you will be the trigger that helps folks understand and resolve the issue.

I know patching always gets a bad wrap in Ruby, and it can be hard to fully understand and debug, but it is also extremely powerful. It is good to understand the gotcha’s that can occur, and how to work around those issues, especially if you are shipping shared code that can patch other shared code like Ruby’s StdLib.  

comments

Rails Flaky Spec Solutions
14 January 2020

Bugs

photo credit flaky wall: pixabay

Introducing Rails Flaky Spec Examples

I have written about how our teams are dealing with flaky ruby tests in legacy projects. In this post, I want to show, how we can teach about common testing pitfalls and how to avoid them.

In this post, I want to introduce a new project Rails Flaky Spec Examples. I created the example flaky Rails spec suite for an internal code clinic at Stitch Fix. We ran it as a 1-hour workshop to teach about common flaky test issues and how to resolve them. I am hoping that over time, I can continue to grow the example base and talk about some of the more complex flaky tests and how we could perhaps more systematically avoid them. As I work with this project over time, I hope to make it into a good learning tool for the community.

Flaky vs Stable Suite

Running the flaky and thens stable suite

Why A Rails Suite? One Problem with Flaky Test Posts

While there are a number of great blog posts on flaky specs:

The majority of the posts don’t have runnable examples. While they might have some code snippets showing examples in the post they don’t have a runnable project. You can try to paste some of the examples into a file, but they reference dependencies, like an active record model without the related migration or dependencies. Often the snippets get too simplified to show how the errors look in a real large legacy project.

Since the examples aren’t runnable it makes it a bit harder to use them as a teaching tool, or show more complex tests or CI interactions. This project aims to be in the sweet spot, where it is still small enough to easily understand the issues, but it is part of a real runnable app that can be extended to highlight more complex Rails and JS testing issues. Adding things like common libraries (Faker, Capybara, etc) and different levels of tests including browser-based javascript tests and the related race conditions.

While this project isn’t a real-world example which are the best sources of flaky specs, sometimes real-world examples are hard to easily understand. Many of the examples in this project were extracted from real world examples. If you really want to dive into a fully developed complex code base that has Flaky specs, the best source for that with tagged flaky specs comes from @ samsaffron/discourse.org in their tagged collection of flaky heisentest which is described more in the excellent post, tests that sometimes fail.

This project allows devs, to run spec examples, see the failures, and try to fix the flaky specs either themselves or with a small group. If they get stuck example solutions are readily available. It should also be relatively easy to extend the project to add examples extracted from real-world projects. I would love to get some flaky test submissions for difficult flaky spec issues.

Project Structure

The project is designed to have two versions of every spec the flaky version, and the stable version.

Project Structure

You can see this in the folder structure each spec folder has a sub-directory solved

This lets me use Rspec tags to run either the solved or flaky specs.

config.define_derived_metadata(file_path: Regexp.new('/solved/')) do |metadata|
  metadata[:solved] = true
  ENV['SOLVED_SPECS'] = 'true'
end

With these dynamic tags and a default .rspec with --tag ~solved:true we can now run either the flaky or stable suite.

  • flaky: bundle exec rspec
  • stable: bundle exec rspec --tag solved

Example: Solving A Flaky Spec

Let me show the expected workflow when learning with the project…

  1. Run the suite and pick a failure that looks interesting…
  2. Read the Related Code
  3. Modify the Spec, try to fix it
  4. Compare your answer to the provided solution (remember the is more than one way to solve many of these issues)

1. Pick a Failure

Let’s run the suite and pick a failure. In this case spec/models/post_example_e_spec.rb, Post ordered expect many order posts to be in alphabetical order, looks interesting.

bundle exec rspec
Run options: exclude {:solved=>true}

Randomized with seed 5788
...Capybara starting Puma...
* Version 4.3.1 , codename: Mysterious Traveller
* Min threads: 0, max threads: 4
* Listening on tcp://127.0.0.1:64214
...FF.FF.FFF.F.FFF

Failures:
...

3) Post post ordered expect many order posts to be in alphabetical order
   Failure/Error: alphabet.each { |el| Post.create!(title: Faker::String.random(2), body: el) }

   ActiveRecord::RecordInvalid:
     Validation failed: Title has already been taken
   # ./spec/models/post_example_e_spec.rb:12:in `block (4 levels) in <top (required)>'
   # ./spec/models/post_example_e_spec.rb:12:in `each'
   # ./spec/models/post_example_e_spec.rb:12:in `block (3 levels) in <top (required)>'
...

Finished in 10.2 seconds (files took 3.33 seconds to load)
21 examples, 11 failures

Failed examples: ...
rspec ./spec/models/post_example_e_spec.rb:10 # Post post ordered expect many order posts to be in alphabetical order

Let’s take a closer look at the code involved. In this case, from the comments, we see this spec is flaky on its own and doesn’t require the full suite to be flaky.

require 'rails_helper'

class Post < ApplicationRecord
  validates :title, uniqueness: true

  scope :ordered, -> { order(body: :asc, id: :asc) }
end

# Classification: Randomness
# Success Rate: 80%
# Suite Required: false
RSpec.describe Post, type: :model do
  describe "post ordered" do
    it "expect many order posts to be in alphabetical order" do
      alphabet = ('a'..'z').to_a
      alphabet.each { |el| Post.create!(title: Faker::String.random(2), body: el) }
      expect(Post.ordered.map(&:body)).to eq(alphabet)
    end
  end
end

3. Modify The Spec

For the above example, since we require the title to be unique, but we are using a small random value… We can see collisions occur. There are a number of ways to solve this, perhaps we don’t need randomness at all in this case!

require 'rails_helper'

class Post < ApplicationRecord
  validates :title, uniqueness: true

  scope :ordered, -> { order(body: :asc, id: :asc) }
end

# Classification: Randomness
# Success Rate: 80%
# Suite Required: false
RSpec.describe Post, type: :model do
  describe "post ordered" do
    it "expect many order posts to be in alphabetical order" do
      alphabet = ('a'..'z').to_a
      alphabet.each { |el| Post.create!(title: el, body: el) }
      expect(Post.ordered.map(&:body)).to eq(alphabet)
    end
  end
end

Run the spec a few times to ensure it now always passes.

4. Compare Answers

Now you can look at the file in the solutions folder. It will contain a solved spec with additional details explaining why there was an issue, and how it was solved. Occasionally offering more than one solution. Now you can repeat the steps until you have no more errors in the spec suite.

Using the Project for A Workshop

This is how we turned the project into a code clinic or workshop. We ran it entirely remotely, scheduling multiple 1-hour sessions folks could sign up for. There was a brief set of slides, explaining the project and getting folks bootstrapped and installed. Then we used Zoom to breakout into small groups of 4 devs, to solve the specs as a little team. Regrouping at the end to discuss and share our solutions. The workgroup format was broken down like below:

  • introduce the project and the workflow (10m)
  • solve a single flaky test with the whole group mob programming style (10m)
  • break up into groups and have the groups solve flaky tests (30m)
  • regroup, share solutions, and discussion (10m)

If you try running a workshop, let me know I would be curious how it went.

Related Links

Here are some other helpful links related to flaky Rails tests.

comments

What I learned about Software Development from building a Climbing Wall
20 November 2019

Theo Copies Erin on The Wall

Theo liked to imitate as he was learning to walk.

TL;DR: This post is just for fun!

I didn’t really learn about programming, this is just a catchy title, and I wanted to share a big project I have continued to work on that has nothing to do with programming. I thought I could make a kind of funny post by stringing together a bunch of programming ‘wisdom’ that could really be associated with nearly anything.

The Wall's Current State

climbing wall’s current state

Prototype, Iterate, Test, and Iterate More

Climbing Wall: If you aren’t sure how much you will use the wall, or how much effort you want to put into it you can start small. Build and expand over time.

Software: In software, most of the time these days folks take an agile / lean approach and try to deliver working MVPs along the life of the project to deliver continuous value and learning.

Prototype

I started really small, initially with a training board. I installed this above the stairs entering the basement.

climbing board

A training board lets you strengthen your fingers, and do weird pull-ups

Iterate

I then added a few holds directly into the wall near the original board so I could string together a few moves.

iterate on wall

A first box of bulk climbing holds, a test mounting on a board

Iterate Again & Again

Then I added a single set of boards on the wall… and started to expand out from there.

expanding the wall

empty wall, mounts, one board, and more

Project Planning

Negotiate with the Team

Climbing Wall: I did have to discuss everything with my wife before I started drilling a bunch of holes in our wall. This was a bit of a process, starting with the training board, the wall holes (and an agreement that I would patch any holes… hmm I still have to do some of those. A discussion that became easy as she found she enjoyed adding frequent climbing into her exercise routine as much as I did.

Software: In software you are always working with stakeholders, PMs, designers, other developers, and hopefully directly with some customers. Nearly everything built requires negotiation and compromises on time, features, UX, etc.

negotiation with my partner

negotiation with my partner

It Starts with A Plan

Initially, I started with just a small idea, but it expanded out, especially after my wife decided she also really enjoyed climbing on the wall as well.

planning and materials

planning and materials

Ensure Your Project has a Reasonable Learning Curve

Climbing Wall: On a climbing wall you want it to be fun for beginners, kids, and more experienced climbers. Just like in software you want a project to be accessible and learnable by new hires and developers of different skill levels.

Our wall was a big hit with so many kids, I have added a lot of easier holds and built a lower route so small children can jump on the wall and have a great time.

Software: You can’t build a team, recruit, and mentor folks if you have a software project that is all expert level. Ensure it is easy to set up the apps development environment and it is easy to add features, test, and deploy safely so new folks can learn with confidence.

Theo Wall Climb

Theo Climbing, click for video

Set Milestones

On personal projects as in software, you want to set short, medium, & long term goals.

  • Short: something at home to help with training for climbing
  • Medium: I want to connect all the reasonable basement walls for a long route
  • Long: I want to connect the original training board to the route along with ceiling holds
  • Longer: I want to add a removable incline that can add a step grade when good climbers are visiting

Ceiling Holds

Adding Ceiling Holds was a longer-term goal

Celebrate Your Wins

Climbing Wall: In general, whenever I expanded the climbing wall, I would quickly add some holds and celebrate by climbing my new longer route.

Software: Your team should be proud and get to celebrate after shipping something big. Also, ensure developers are sharing the things they learned along the release with the team. Having space to make investments and to pay down debt, requires that everything can’t always be moving at maximum speed all the time. Celebrate the progress folks are making.

climbing in progress

Climbing the Routes as the wall is in progress

Erin in progress

Erin testing our new “door crossing” problem

Plan For Growth

Climbing Wall: On the wall, I would build and leave space to add more when I had the time.

Software: In software, you want code that is flexible and easy to adapt. This doesn’t mean to over optimize, but know when to be specific and when to offer flexibility (the rule or threes can help with this).

gaps when I was low on supplies

Gaps When, low on Supplies

Make it work, make it right, make it fast

Climbing Wall: The wall was built to keep up and extend my skill level… When I didn’t have the parts or the time, I would sometimes make something fast and leave gaps to extend the routes.

Software: In software, make sure you can get it working, this ensures you solve the hard problems. Then make it right soft the edge cases and the gotchas. Then make it fast and scalable.

color coded climbing routes

color coded climbing routes

Learning & Growth

Climbing Wall: As I worked on the wall project I became better with tools, building, designing routes, & more. I got comfortable and started to think up some more complex projects.

Software: Software takes practice, you will get better the more you build things. Learning which practices to follow and which don’t scale well.

You Can Learn Anything

Climbing Wall: I seriously know very little about building things, tools, construction, or really even climbing. All the information you need is available online to learn so much about any topic that interests you.

Software: You are always learning in software. A new framework, language, domain, etc… The field changes so fast that you have to keep learning to stay up to date. To know when something is a fad or is really worth investing time in deep learning.

You Get Better with Practice

Climbing Wall: Originally, I barely knew what size drill bits, bolts, nuts, and holds… Now I can put all this together and set up a new board in almost no time at all.

Software: It is good to keep practicing… Often this is how you learn to navigate all the grey areas of programming. The best solution to a problem isn’t always black and white, often the best practices have edge cases… Learning what to bend and what to break comes with practice and experience. Feeling the pain of maintaining systems over time, knowing what will stick around and what code often just gets removed.

Wall Teamword

over time, building became faster

It Takes A Team

Climbing Wall: You can’t hold up 4x4 plywood and drill it in yourself… Building a climbing wall requires teamwork. Collaboration to successfully complete the project.

Software: In software, most projects can’t be done by a single developer anymore. It takes collaboration, coordination, and teamwork to build something that lasts.

Wall Teamword

A few of the friends who have helped out

Learn From Prior Art

Climbing Wall: I read a number of things to learn how to build a climbing wall, this free build a climbing wall e-book from Atomik is great. No reason to try to learn from scratch.

Software: Not often are you building a program from scratch with no prior art. Learn from existing frameworks, applications, books, and open source. Build on the shoulders of giants as they say.

Google Image's Climbing Wall Teamword

need inspiration? google climbing walls

Start Cheap and Upgrade Later

Climbing Wall: I started with some cheap holds, but over time I upgraded to nicer holds over time as I spent more time on the wall and as I expanded it. In the end, I really love atomik climbing holds, and I buy most of my equiptment there.

Software: In a software startup or a feature, you want to find the fastest and cheapest way to verify the value of something. When you know there is a value and have been able to build something sustianable (or I guess in startup world, with a hockey stick growth), you might want to move on from “it works” to it is best in class… Particularly, for things that aren’t part of the core company business value.

Wall Teamword

Cheap bulk grey holds, later upgraded to various specialty holds

comments
Dan Mayer Profile Pic

Dan Mayer's Dev Blog

I primary write about Ruby development, distributed teams, & dev/PM process. The archives go back to my first CS classes during college when I was learning to write software. I contribute to a few OSS projects and often work on my own projects. You can find my code on github.

Twitter @danmayer

Github @danmayer