[T]here’s often this question of whether teams should move quickly to capture opportunities and respond to customer needs, or whether they should move more cautiously, ensuring that everything works properly, scales well, and is bug-free. However, at really good software companies, this is a false dichotomy.
We're trying to work fast, while delivering high quality products. This does not happen on its own, and we're dedicating time and effort to reach that goal.
Where we are today
Since I joined over two years ago, the engineering team has roughly tripled in size, and so has the overall company. Increasing the number of engineers and stakeholders naturally led to growing pains. The number of communication channels has increased, the codebase has grown past the stage where one engineer can reliably know how everything works, etc.
With a little foresight and dedicated effort we've managed to maintain the pace at which we're delivering value. But we're starting to see the cracks: releases are getting bigger and more painful, and we need to coordinate more and more people to get features in front of clients. On top of that, stakeholders are pushing for faster delivery:
They want to be able to test out assumptions as early as possible.
They are eager to conduct A/B testing style campaigns.
They don't understand why a small feature is holding up the rest of the release.
Don’t get me wrong: they're absolutely right to ask that of the engineering team :)
For all these reasons we've decided to allocate dedicated engineering time to improve our technical platform.
Note: different teams work differently, depending on size of the team, types and size of changes, release cadence, etc. What comes below mostly describes how our backend engineering team works today.
What to work on
Our engineering team is full of senior people who have tons of good ideas on how to achieve the following goals:
Increase code quality to make it easier for people to contribute.
Improve the QA process / facilitate each feature's QA.
Smooth out the release process.
We're not lacking in ideas, but we know we cannot execute on each of them. How do we know which issue we should try and tackle first?
Because we're engineers, we're going to tackle this as we would a technical performance issue: we don't go in gung-ho, optimising things randomly. We first put in measures, to figure out where the bottlenecks are.
Quick interlude: how we track progress in the engineering team
To track our work, the engineering team uses Gitlab issues. We have a Kanban-like workflow, moving items left to right:
Once an engineer starts working on something, the corresponding issue gets moved to "Doing". Once it's in the review stage, it's moved to "Review" – shocking, right? Then comes the QA process "Ready for Final QA", then it's "Waiting for PROD". Once a release happens the issue is closed and marked as "Released".
This is pretty standard stuff, but it'll give some context for later :)
Coming back to the team's goal, we want to deliver as much value as we can. Although the process of delivering something may take time, the value is only generated when something is in front of customers.
So this is what we're going to be focusing on: measuring the time it takes for a piece of work to get in front of customers, once an engineer starts working on it.
To measure this, I decided to write a bit of Python. It pulls data from the Gitlab API, formats it a bit, then pushes it to a Google Sheet. The code is available on Hokodo’s public Github.
Looking at 2021 data, this is how many days it took for issues to go from “Doing” to “Released”:
This is the overall picture. We can also look at each status transition, to see where there might be a bottleneck. For example, this is the same type of graph, looking at how many days it took for a piece of work that is ready to go out, to actually get in front of customers:
We can see that a significant number of issues took over ten days to go through that transition. That’s especially frustrating for everyone involved, and probably something we want to work on.
Some notes on these measures...
The data we’re using here is not perfect. Sometimes us developers are not as diligent as we should be when moving issues from one stage to the next. Other times when we should be creating follow-up issues we actually move an issue backward. This may create outliers, our hope is that looking at the aggregate saves us from these problems. We'll look more into this in a future blog post.
Anecdotally, when looking at some of these outliers, they were indeed issues that took a lot of time. They should have been broken down into smaller issues, and/or it took several releases to fully get the work done. These outliers were indicative of something.
When a measure becomes a target, it ceases to be a good measure
Today we think these measures are useful for us, maybe they won’t be in the future. It’s important to not treat these measures as goals in themselves. If they stop being relevant to the actual business, we will need to re-evaluate.
It would also be easy to improve these measures by taking shortcuts which will inevitably decrease product quality. The measures presented here are only useful with the assumption that we do not want to decrease overall product quality.
Still, these measures let us know where to put in work, and will later help us to evaluate our impact. They are also very useful to “sell” technical work to executives and garner support from non technical higher-ups. You know how you’ve always wanted to refactor the 'BeanFactory' module because the cyclomatic complexity is getting out of control? That doesn’t make for a great sales pitch 🙆
Where to go from here
Now that we have a way of measuring the impact of our efforts, it’s time to actually put in the work!
The measures above can be automated, and will be re-computed every month or quarter to gauge the impact of our work and ascertain if we need to improve our technical platform even more.
Hopefully we’ll come back to you soon to show off our progress :)