David Ruttka

Engineering | Gaming | Improv | Storytelling | Music

Rewriting History to Remove Unwanted Binaries

| Comments

We’re in the middle of a TFVC (TFS) to Git migration that I’ll probably blog about more completely later. Right now, I want to cover one thing that we’re cleaning up in the process.

These Aren’t The Packages You’re Looking For

We got into a bad situation where we were checking in our packages folder instead of letting NuGet restore handle it.

  • Some of this was because we were using older build templates in VSO and NuGet restore wasn’t working how we expected
  • Some of this was because we have private packages that aren’t on any feed that VSO can currently access

Rewriting History

If we’re going to have a new project in VSO, and we’re going to be creating a new repo based on the TFVC history, we might as well move up to the new templates and pretend those packages were never there.

The naive approach would be to just do the git tf migration, then do a follow-up commit that deletes all the packages. They’d still be in the history, the repo would still be oversized, and the index would still take the hit of having all those binaries hanging around.

Here’s a command that would do the trick for the Newtonsoft.Json.6.0.1 package alone.

DANGER! WARNING! WE’RE DOING THIS TO A NEW REPO, DURING MIGRATION, BEFORE IT BECOMES SHARED HISTORY. READ THE MANUAL AND CAVEATS BEFORE YOU DO THIS TO EXISTING, SHARED REPOS read it here

git filter-branch -f --index-filter "git rm -rf --cached --ignore-unmatch packages/Newtonsoft.Json.6.0.1" HEAD

Breakdown

tl;dr this is going to roll through all of our commits, and for each one, remove the Newtonsoft.Json.6.0.1 directory and everything inside of it from both the working tree and index. It NEVER EXISTED.

  • packages/Newtonsoft.Json.6.0.1 is the most self-explanatory part. This is the sub-directory we’re going to pretend was never added.
  • -rf should be similarly self-explanatory for anyone with a bit of *nix background. Recursive. Force.
  • --ignore-unmatch sparked a bit of discussion between Michael and me. What it boils down to is that this instructs git rm to exit with a success code even if no files match the pattern. Otherwise it would exit as a failure if no files were matched and removed.
  • git rm says to remove the files.
  • --cached removes it from the index but leaves the working tree alone.
  • This is all wrapped in quotes because it’s a parameter to git filter branch
  • git filter-branch on the left is going to rewrite history for every commit
  • --index-filter is better explained in the docs*
  • HEAD on the right side says we want to save the rewritten history in HEAD.
  • -f forces. I will cover why we added this, but not just yet.

But That’s Just One Of Them

We want to do this for all the public packages, and we want to do it for none of our private ones. We’ll need to get a list.

From the packages directory

git log --diff-filter=A --summary . `
    |? { $_ -ne $null -and $_ -match 'create mode \d+ (.*)?(/lib/.*)' } `
    |% { $($matches[1]) } `
    | select -unique | sort `
    |% { "git filter-branch -f --index-filter ""git rm -rf --cached --ignore-unmatch $_"" HEAD" }

Breakdown

  1. The first line dumps all of the adds that happened in the current directory (packages)
  2. The second line filters out blank lines, and matches the regex of created files, capturing the path as a group.
  3. The third line pulls out the matched path
  4. dlls, pdbs, nupkgs themselves, who knows what else might have been added, but our filter-branch + rm above is going -rf on it anyway. Dedupe them, and sort just for convenience.
  5. Write the command we want to execute to the output stream. This doesn’t execute it, it just outputs what you want to execute.

The output looks something like this

git filter-branch -f --index-filter "git rm -rf --cached --ignore-unmatch packages/bar" HEAD
git filter-branch -f --index-filter "git rm -rf --cached --ignore-unmatch packages/foo" HEAD

You could go so far as to not wrap that in quotes and dump it to the output stream. You could just execute it directly. But then, you wouldn’t have accounted for the private packages, so you’d want to have some kind of $safePackages = ("x","y","z") and add a |? to strip them. Your choice, but until we’ve run this through its paces, I kind of like the safety of having the chance to review what’s going to be done.

The Final Stroke, And Why We -f It

Each filter-branch creates a .git-rewrite directory and a bunch of temporary history. While this exists, the next filter-branch will not execute and ask you to clean it up. We are going to be doing this a whole bunch of times and then clean it up after all of them are done, so we just -f through it.

Then, at the end, you do want to clean it up and run a gc. How?

rm -rf .git/refs/original/ && git reflog expire --all &&  git gc --aggressive --prune

Congrats! Now before you push your fancy, newly migrated to Git repo, it’ll be a lot lighter without all those unnecessary binaries. You could certainly consider applying this procedure to all kinds of other directories where you’ve been putting garbage into source control.

* Hat tip to an old post by David Underhill that got us started.

** More thanks to Josh for responding quite quickly to my call for his favorite way to find subdirectories that used to exist but got deleted. I ended up going a different way, and just looking for everything that was ever added, whether it exists or not. And using PowerShell. He says he’s considering blogging his solution soon!!

Little Wonders: Alt+Space, E, P

| Comments

This is why we pair! – Me, many times

Another one. These things are so simple, but they’re only useful if you know they exist. I’ve been planning a post that covers my thoughts on pairing, but (spoiler alert!) this kind of little stuff is one of the great benefits.

I had to reimage a machine, so my console hotkeys are not all the same as they used to be. I usually remap Ctrl+V to paste, since that is consistent with most apps in the Windows ecosystem. Until I do so, it’s Right Click > Edit > Paste.

I was telling Patrick how I love to stay at the keyboard and eschew the mouse as much as possible, and he pointed out that I could just do Alt+EnterSpace*, E, P. This isn’t quite as short as the Ctrl+V remapping, but it works out of the box. I like it!

* Thanks to Michael for noticing that I had this correct in the title, but typed the wrong content in the body.

Little Wonders: Uri.GetLeftPart

| Comments

File another one under learning something new everyday. Patrick was asking me what the best way is to get the scheme + host + port of a Uri in .NET. We discussed a few ways that we’ve done it in the past, but agreed that there must be a better way.

Behold, Uri.GetLeftPart! This will give you everything up to and including the part you want.

Given a new Uri(http://somewhere.com:8080/this/is/path?andQuery#fragment), calling GetLeftPart(UriPartial.Authority) will give you back http://somewhere.com:8080. You can also use Scheme, Path, and Query. There’s a table showing what comes back at each part in the MSDN documentation.

It sure beats the string manipulations I’ve done in the past!

Communication Protip: Be Specific

| Comments

The other day, I says to Michael, I says

We suck so much at communication. Not you and me. I mean, I am including myself, but I mean ‘we’ in the universal human sense.

Today, I sucked again, and I am able to see exactly where I went wrong.

Be Specific

If you have multiple environments being used for development purposes, don’t talk about making updates to “dev.” This is going to confuse people about whether you meant hacky-dev-builds.fancycloud.net or slightly-more-stable-dev.yourdomain.com.

Refer to the environment by something more unique like the URI, and don’t leave room for doubt about exactly what changes you made.

Don’t do this:

I just deployed the latest bits to dev.

Do this instead:

I just deployed to endpoint.host.tld. The changes included in this deployment since the last deployment are X, Y, and Z. [Actions people would need to take because of these changes]?.

Principles of Debugging: A Postmortem

| Comments

The following is a combination horror story and true crime documentary of developers tricking themselves into seeing things that aren’t there, not seeing things right in front of their faces, and breaking various rules of debugging. The faint of heart should close the browser tab now.

No, it’s not a Heisenbug. Heisenbugs don’t repro the same when you’re looking into them. This is…something else. – Actual quote

Wont-Fix

There’s one part of this story that I will not address. We have to use 64-bit integers (C# long) for the ids of one of our resources due to downstream / legacy dependencies. They end up being incredibly large, non-sequential values. That’s not the point here, just some context.

A Problem Is Reported

INT. A developer’s desk, early morning, before coffee, bright blue sunny day.

An email is received reporting an issue where the ids of certain resources are intermittently being returned as two less than the true value. Or one less. Or one or two above. Always “close”, but often incorrect.

There is a Fiddler .saz to prove it, and screenshots of that same Fiddler .saz.

A fearless developer figures this must be something pretty silly, opens up the issue in the issue tracking system, and decides to take it on. There will be much weeping.

Violating All the Principles of Debugging: A Postmortem

| Comments

The following is a combination horror story and true crime documentary of developers tricking themselves into seeing things that aren’t there, not seeing things right in front of their faces, and breaking various rules of debugging. The faint of heart should close the browser tab now.

Wont-Fix

There’s one part of this story that I just will not address. We have to use 64-bit integers (C# long) for the ids of one of our resources due to downstream / legacy dependencies. They end up being incredibly large, non-sequential values. That’s not the point here, just some context.

A Problem Is Reported

INT. A developer’s desk, early morning, before coffee, bright blue sunny day.

An email is received reporting an issue where the ids of certain resources are intermittently being returned as two less than the true value. Or one less. Or one or two above. Always “close”, but often incorrect.

There is a Fiddler .saz to prove it, and screenshots of that same Fiddler .saz.

A fearless developer figures this must be something pretty silly, opens up the issue in the issue tracking system, and decides to take it on. There will be much weeping.

The Plot Thickens

INT. The same developer’s desk. Plus another developer to pair.

The developer has traced through the entire command side of the CQRS system and confirmed that the ID is not strangely mutated before publishing an event to the bus.

Also traced through the read model updaters to ensure no strange mutations occur there.

Queried the read model store directly. Data is correct at rest. It’s got to be in the query side or API itself.

We have some message handlers that fire late in the pipeline, just before the response stream is written. The original dev has set breakpoints there and confirmed the ID is correct before it goes out the door.

Yet, Fiddler keeps showing that odd ids always become even. Even ids always stay even, but sometimes go up or down by two.

And then, on a whim, add Accept: text/xml to the Fiddler composer instead of taking the default json.

It’s correct in the XML.

Wat

INT. Kitchenette. Coffee. Sanity beginning to fray.

That Annoying BAMB BAMB Sound From Law and Order

INT. The same developer’s desk. Plus another. Total of three developers.

The third developer runs through a lot of what we’ve already done, just to make sure we didn’t miss something silly.

It’s only happening to json. Could it be in the serializer? The message formatter? What is going on?

File > New Project. Eliminate all of our code from this and spin up a barebones project. F5 and see the correct values displayed in the browser. Do we have a different version of the json serializer package? A different version of our web api framework package? What else is in the pipeline for json and not for xml?

Lunch

EXT. Beautiful, bright blue day. Sunny. Warm. The opposite of the cold grey oppression of this bug.

Walk to lunch, eat lunch, decide to work from a coffee shop and wrap our heads around this thing.

Frayed Ends of Sanity

EXT. Outdoor tables at local cafe.

We create more reductions. The json serializer and message formatters seem fine, but we still aren’t sure what else is in the pipeline. We start mutating the data store to different ids to attempt to establish a pattern.

We find that id = 7 stays 7, even though it’s odd. Just the big ones go off.

We write a quick script to create 1000 resources with sequential ids starting with one of the large problem ids. Then we request each of them through our API and dump the id we expect, and the id the API gave us. They all match. None are wrong. This was in C#.

But Fiddler is showing the wrong id, even when C# is showing the rig—–

Derp

The raw tab of Fiddler always had the correct id. It was the json tab that looked incorrect. The json tab that you know, handles numbers the way JavaScript handles numbers.

We saw even/odd, but we didn’t see that they were all, for example, multiples of four.

When we started a new Web API project from scratch, we viewed those responses in a browser instead of in Fiddler. We changed all the things at once, including what we used to view the output.

Debug It!

A few years ago, I read this book and recommended it to everyone I worked with. As I look at what we went through in this story, I see that we held fast to some of the principles for good debugging, but got caught out on a couple others.

What Went Well

  • We had good evidence of the problem and quickly established a scenario to reproduce the problem
  • We eliminated the command side, event bus, and read models very quickly. This let us isolate the problem to the response pipeline.
  • We further isolated it to only the json output, further supporting the case that our commands and queries were not the cause of mutation.

Lessons Learned Things We Knew Better But WTH Were We Doing

  • Fiddler is a great tool, but it isn’t showing you the raw response unless you’re actually looking at the raw tab
  • We narrowed the issue down to “the response pipeline” and excluded tools used to view the output after the response was delivered.
  • The prime directive to turn only one knob at a time includes what you’re using to view output. If this had been a CSS thing, we certainly would have counted different browsers as knobs we were turning. Here, we tricked ourselves into thinking the raw text response in the browser was the same as the json tab in Fiddler.
  • JavaScript can’t go full 64-bit on integers. All numbers are 64-bit floating point numbers, so the max “long” is 253.

Fixing the External-url (Linklog) Feature in Octopress

| Comments

The Linklog Feature

Another quick edit that I made to in the theme was to handle the external-url for linklog posts, as described in the Octopress docs.

I found that this feature wasn’t working for me. I couldn’t find anyplace that attempted to use external-url, so I just kind of decided on a place that works for me and put it there.

Show Me The Code

You can check out the pull request or the commit it contains. I cleaned up the unintentional whitespace changes in the follow-up commit.

Basically, I just added a capture for entryhref, using the external_url if present. Later, after the line that handles adding the Comments link as appropriate, I added a line that adds the Permalink as appropriate.

I’m still finding my way around Octopress and am definitely open to thoughts if I’m going about this stuff the wrong way. Let me know.