25.9.2023

Gathering speed: Improving Hokodo's engineering processes

Pierre Mourlanne

Staff Software Engineer

Hi, I’m Pierre, and I’m a staff engineer at Hokodo, working on our internal technical platform.

Eighteen months ago we looked at identifying the bottlenecks in the engineering team. Today, I’m back to talk about the changes we put in place to improve our release process, and the positive effects we saw from this work.

Key takeaways from our first analysis

In early 2022, we conducted our first analysis. Our two key takeaways were as follows:

Issues routinely took over ten days to get from the “Doing” status (i.e. a developer picks it up) to getting released in production (i.e. available to end users)
Issues often spent more than five days in the “Waiting for Prod” status, right before being released

The first measure can be considered to be our cycle time: how much time it takes for an issue to be delivered once a developer picks it up. As a technical leader, I care about this metric because it shows how easy it is to develop a piece of work and get it into the hands of customers. It also matters to stakeholders, because it gives them visibility regarding our ability to deliver.

The second measure is a pet peeve of ours: once a piece of work has been worked on, passed technical review, and has undergone thorough QA, it’s ready to go into production. The natural question from stakeholders is often: “If it’s ready to be delivered, what are we waiting for?”

As a staff engineer on the platform team, I very much care about this issue, because this is about our release process. Improving it directly affects our ability to deliver value, and to make stakeholders happy (which I love doing).

Since early 2021, we have taken big steps trying to improve both our processes and our internal technical platform. I’m going to go over the actions we took, while measuring the impacts these changes have had.

Changes implemented in 2022

As a rapidly growing fintech, we implemented a lot of changes to our ways of working during 2022.

Quality Assurance

So far we had relied on ad hoc integration tests when it came to conducting quality assurance (QA). In particular we didn’t have one person dedicated to improving our processes on the matter. Backend engineers had started working on a repository to conduct automated integration tests against live environments, but the project was in its infancy. For more complex testing, ops people had built a massive spreadsheet gathering the non regression tests needed to pass when we developed a new feature.

The QA process was brittle, time consuming, and there was no process for documenting or sharing knowledge among the team.

We hired a quality engineer, whose role was to improve these processes, hopefully by iterating on what we had already built. And that is what they did!

Even though we hired someone for the QA role, it didn’t mean that this person would do all the QA. We decided that every engineer was expected to QA features. There were two motivations behind this decision:

Avoiding silos: we don’t want engineers to lob their work to another team and not be involved anymore
Having engineers do QA as part of their workload means our QA capacity scales nicely with the number of pieces of work delivered

Committing to (at least) one release per week

Up until early 2022, we were aiming for one release per week. In reality, there were times we didn’t think there were enough things in the pipeline that it merited a release, and we ended up missing a release, or things would slip, etc.

In 2022 we committed to doing one release per week because:

Having a consistent release schedule means better management of stakeholders expectations
Releasing often is a good motivation for improving the pain points in our process

Our release process is currently simple: we are using a single release train. This means everything currently merged into the main branch that hasn’t been released during the previous deployment will be released in the next one.

Our major pain point was that we weren’t able to QA features before they got merged. Since we don’t want to ship a version of our code that hasn’t been QA’d, we often ended up in situations where we had to stop merging into the main branch, to allow already merged features to be QA’d. To solve this problem, we put a merge freeze in place.

A typical week would look like this:

On Tuesday afternoon, the person in charge of the release enables a merge freeze: no one is allowed to merge new features into the main branch (fixes to the already merged features are encouraged though!)
By Wednesday afternoon, hopefully everything on the main branch has been QA’d, so the releaser can create the version of the code to deploy, creating a release tag
On Thursday, the new version of the code goes into production, and the releaser disables the merge freeze: engineers are free to merge features (which will get released next week)

Comparing 2022 metrics to the 2021 values

Cycle time

Overall things have improved: there used to be a long trail of issues that would take over ten days to be released, and there are significantly fewer of those. The median cycle time in 2021 was fourteen days, in 2022 it was reduced to eleven days, which is a significant decrease.

Time it takes for an issue to get released once it’s ready for production

For this measure, we can see there are way more issues taking between two and seven days to be released. Looking at the 90th percentile:

In 2021, 90% of issues took fewer than 19 days to get released
In 2022, 90% of issues took fewer than 9 days to get released

This is very good news: the tail end of issues that used to wait for a long time before getting released has dramatically diminished.

A note on data quality

With some issues taking more than 30 days to get released, it may be surmised that developers have simply forgotten to update their status in some cases.

However, when looking at several of these outliers, I found they were indeed issues that took a lot of time. In hindsight, they probably should have been broken down into smaller issues, and/or it took several releases to fully get the work done. These outliers are indeed indicative of something we care about.

Changes we implemented in 2023

Preview environments

As I explained earlier, our release process is a simple, single release train. The pain point that had been building up for a while is that we were only able to QA features after they had been merged. We solved this problem in 2022 by instating a weekly merge freeze.

As the engineering team grew, the feature freeze became a blocker: engineers would only have a very short window of time when they could merge, for fear of having to push back the release if we couldn’t QA features in time.

The solution I implemented was to give developers the ability to create preview environments. These are independent environments, running the code they want to merge, where people can QA the feature before it gets merged in the main branch.

This has been a boon for engineers, as they are not constrained by our release schedule to get their work QA’d. It has also been well received by stakeholders, as we are never in a situation where one untested feature holds up a whole release anymore.

Actual teams

For a while at Hokodo, everyone in the backend engineering team was responsible for everything. Obviously people became more entrenched in particular areas, but we didn’t want people to not work on things because it was outside of their comfort zone. “Never not my responsibility” was one of our mottos.

This worked well for a while. However, as both the team and the product grew, pain points appeared:

Cognitive load got too high: it became impossible for any one person to understand everything happening in the product
Because of this, people specialised in certain areas of the business by default
Without clear teams, or with short-lived teams dedicated to building a feature, the responsibility of knowing, and fixing, a particular piece of the system was thrust on individuals, and not teams

It became clear that our strategy wasn’t working, as engineers were getting stretched thin, and for multiple business domains there was only one person knowledgeable enough to fix issues.

Following the advice outlined in Team Topologies, and with the help of our recently hired VP of Engineering, we put in place long-lived teams, each responsible for distinct business domains. Implementing a drastic change like this in an already established team, while continuing to deliver value, was a difficult task – but we rose to the challenge.

The goal was to improve developers’ lives, by having more specialised teams and clearer ownership. Hopefully this improvement would reflect in our ability to deliver value.

Comparing (almost) three years worth of measures

Instead of looking at the same graphs and comparing them year over year, let’s look at the percentage of unreleased issues after a certain number of days, for each period of time (2021, 2022 and 2023).

First, for the cycle time:

Here we can see that we did get better year over year. We already compared the data from 2022 to the 2021 data earlier, here we’re seeing the same improvement, in a different form. We can see this trend continued in 2023, where we improved our cycle time for the second year in a row.

Now let’s look at the number of days it takes for an issue to get released once it’s ready for production:

Comparing 2022 data to 2021 data, there isn’t a clear improvement: the trail of issues is significantly lower, but looking at five days and fewer, we used to be better in 2021.

For 2023 the data is much clearer, and we improved across the board.

Conclusion

Imperfect measures, and gaming the system

Something I touched on in my previous post is that our measures are worthwhile only as long as they don’t become targets, and they are not being gamed.

Maybe the improvements we see are only caused by the lower size of issues: we wrote smaller scoped tickets, thus they take less time to be delivered. We didn’t make any particular efforts towards this, but if it did happen, I don’t think it would explain our improvements across the board, and it would still mean we improved visibility. Stakeholders better know when to expect an issue to be delivered, and that is the main thing they care about.

An easy way to deliver value quicker is to compromise on quality, or security. We stayed strong on these regards: we are dealing with customers’ money, we have always been committed to getting things right.

The right direction

Even if the measures may have been unconsciously gamed, and are imperfect, we are seeing clear improvements across the board. Considering the fact that the engineering team grew from seven people in the beginning of 2021 to twenty people today, and that the product is far more complex than it used to be, we can be sure that we’re doing something right, and we’re going in the right direction. 🚀

SHARE ON: