The following is a combination horror story and true crime documentary of developers tricking themselves into seeing things that aren’t there, not seeing things right in front of their faces, and breaking various rules of debugging. The faint of heart should close the browser tab now.
There’s one part of this story that I just will not address. We have to use 64-bit integers (C# long) for the ids of one of our resources due to downstream / legacy dependencies. They end up being incredibly large, non-sequential values. That’s not the point here, just some context.
A Problem Is Reported
INT. A developer’s desk, early morning, before coffee, bright blue sunny day.
An email is received reporting an issue where the ids of certain resources are intermittently being returned as two less than the true value. Or one less. Or one or two above. Always “close”, but often incorrect.
There is a Fiddler .saz to prove it, and screenshots of that same Fiddler .saz.
A fearless developer figures this must be something pretty silly, opens up the issue in the issue tracking system, and decides to take it on. There will be much weeping.
The Plot Thickens
INT. The same developer’s desk. Plus another developer to pair.
The developer has traced through the entire command side of the CQRS system and confirmed that the ID is not strangely mutated before publishing an event to the bus.
Also traced through the read model updaters to ensure no strange mutations occur there.
Queried the read model store directly. Data is correct at rest. It’s got to be in the query side or API itself.
We have some message handlers that fire late in the pipeline, just before the response stream is written. The original dev has set breakpoints there and confirmed the ID is correct before it goes out the door.
Yet, Fiddler keeps showing that odd ids always become even. Even ids always stay even, but sometimes go up or down by two.
And then, on a whim, add
Accept: text/xml to the Fiddler composer instead of taking the default
It’s correct in the XML.
INT. Kitchenette. Coffee. Sanity beginning to fray.
That Annoying BAMB BAMB Sound From Law and Order
INT. The same developer’s desk. Plus another. Total of three developers.
The third developer runs through a lot of what we’ve already done, just to make sure we didn’t miss something silly.
It’s only happening to json. Could it be in the serializer? The message formatter? What is going on?
File > New Project. Eliminate all of our code from this and spin up a barebones project. F5 and see the correct values displayed in the browser. Do we have a different version of the json serializer package? A different version of our web api framework package? What else is in the pipeline for json and not for xml?
EXT. Beautiful, bright blue day. Sunny. Warm. The opposite of the cold grey oppression of this bug.
Walk to lunch, eat lunch, decide to work from a coffee shop and wrap our heads around this thing.
Frayed Ends of Sanity
EXT. Outdoor tables at local cafe.
We create more reductions. The json serializer and message formatters seem fine, but we still aren’t sure what else is in the pipeline. We start mutating the data store to different ids to attempt to establish a pattern.
We find that id = 7 stays 7, even though it’s odd. Just the big ones go off.
We write a quick script to create 1000 resources with sequential ids starting with one of the large problem ids. Then we request each of them through our API and dump the id we expect, and the id the API gave us. They all match. None are wrong. This was in C#.
But Fiddler is showing the wrong id, even when C# is showing the rig—–
We saw even/odd, but we didn’t see that they were all, for example, multiples of four.
When we started a new Web API project from scratch, we viewed those responses in a browser instead of in Fiddler. We changed all the things at once, including what we used to view the output.
A few years ago, I read this book and recommended it to everyone I worked with. As I look at what we went through in this story, I see that we held fast to some of the principles for good debugging, but got caught out on a couple others.
What Went Well
- We had good evidence of the problem and quickly established a scenario to reproduce the problem
- We eliminated the command side, event bus, and read models very quickly. This let us isolate the problem to the response pipeline.
- We further isolated it to only the json output, further supporting the case that our commands and queries were not the cause of mutation.
Lessons Learned Things We Knew Better But WTH Were We Doing
- Fiddler is a great tool, but it isn’t showing you the raw response unless you’re actually looking at the raw tab
- We narrowed the issue down to “the response pipeline” and excluded tools used to view the output after the response was delivered.
- The prime directive to turn only one knob at a time includes what you’re using to view output. If this had been a CSS thing, we certainly would have counted different browsers as knobs we were turning. Here, we tricked ourselves into thinking the raw text response in the browser was the same as the json tab in Fiddler.