#2 - Link Roundup, Feb 7 2020
The Subtle Signs Your Team Is in Trouble - Rachel Muenz, Laboratory Manager
An older article that Laboratory Manager sent around on twitter this week. It’s written in the context of laboratories, but carries over very clearly to research computing teams. They mention two opposite signs to look out for:
Lack of conflict/disagreement - I think this is a great one, and easy to overlook. If you have a bunch of technical people in a room and there isn’t some steady level of (healthy, respectful) disagreement about how to proceed, then either people are checked out or they don’t feel like they can voice their opinions. Myself, I try to run tight meetings, which is fine, but I find that unless I’m really careful I can easily make people feel like they should keep disagreements or other “speaking out of turn” to themselves.
Escalation to open conflict - whether just a learned behaviour or because disagreement has been felt to be tamped down so firmly, when disagreement does happen does it spin out of control? This is more obvious but still has to be dealt with, and quickly.
In both cases the solutions are the same - do lots of constant listening (e.g., through one-on-ones but also just generally) and give lots of feedback (positive and negative; positive for any input even if disagreeing, and positive for doing so constructively while negative for doing so otherwise).
Tracking Toil using SRE principles - Eric Harvieux, Google Cloud Blog
Writing Runbook Documentation when you’re an SRE - Taylor Barnett, Transposit
“Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”
These two articles came out in the same week, and they work very nicely together. One of the under-appreciated advantages commercial cloud organizations (or other large operations) have at scale isn’t about hardware - it’s that they have better awareness of how much stuff that gets done is repetitive across teams, and they put the resources in to ensuring those things get done smoothly.
Technical teams tend to hate documenting, but identifying these repetitive “toil” tasks and documenting what’s involved in getting them done in a “runbook” has three really important benefits:
It makes sure the tasks are done reliably in reproducible ways
It makes it much easier to onboard new people to the team or transfer responsibility for these tasks to someone else
It is the first step in automating the task to make them easier to do, or ideally to have them done completely automatically.
Are there things that have to routinely get done in your team that aren’t documented clearly in (kept-up-to-date) runbooks? Would things be easier if there were fewer such tasks?
And related to this; here’s a list of “flight rules” (e.g., very short runbooks) for common nontrivial tasks in git.
I talked last week about Andy Grove’s book High Output Management, which is aimed at management and in particular senior management in technical organizations; this article talks about lessons from the book for those still in the trenches, as technical leads (but not managers).
A Chaos Test Too Far - Kathryn Downes & Arjun Gadhia, Financial Times
I’m a big fan of “chaos testing” (or, arguably more appropriate for a lot of research computing work, Slack’s “disasterpiece theatre”) for testing the hypotheses that “our system is robust to [bad thing]”; the software equivalent is probably fuzzing. We know stuff is going to break, usually at the least opportune time; systems crash, users or operators enter invalid information… best to test it in a controlled manner.
So in fairness I should probably post examples of it going - or nearly going - horribly wrong, like this one. I’d still argue though that the series of events they tested here would almost certainly have happened eventually; far better that it happen with all hands on deck when they were primed to respond rather than towards the end of the day on a lazy Friday afternoon when people were already heading home.
Are you tired of your desktop’s file system working too well to be able to do good robustness testing of software? Want to mess with a process’s view of system time? Want stderr output always colorized so you can tell the difference from stdout? Check out this list of ld-preload hacks.