In our previous article, How to start a data-driven insurtech… without any data, we talked about how we ingested and aggregated data from different sources into one central database in order to build our credit risk models.
After successfully developing our first credit risk model which outperformed traditional credit bureaus, our main goal has been to accelerate model development. To achieve this, our data science team reviewed all the steps required to generate new credit models to identify where there was room for improvement or automation.
We believe that the challenges we faced in the context of credit risk modelling are very similar to the challenges that most data science teams face when moving from one successful achievement in an R&D configuration to a repeatable and scalable process, regardless of what they seek to optimize. In our case, we continuously strive to become better at predicting credit risk, but the operational constraints remain the same if you were to model customer behaviour, fraud detection ...
We first looked at the different Data Science IDEs (Integrated Development Environments) available on the market such as Dataiku, Alteryx, KNIME, RapidMiner, or Jupyter notebooks.
These IDEs are great tools for a broad range of use cases, however, they are often too generic and require the users to bend to the system to meet their more advanced needs. They often target audiences with heterogeneous technical backgrounds. For instance, Dataiku proves to be really useful for less geeky users that need to carry out some analytics.
However, the needs of our tech and data teams were pretty specific:
When building a large and constantly evolving codebase, it is essential that segments of the code can be quickly read and understood. Most IDEs come with a set of built-in tools. However, when custom enhancements are required it can get messy to notify changes to the team and difficult to cascade these changes across projects;
Properly tested code helps minimise small deleterious impacts that grow in a codebase and allows developers to quickly identify bugs. When making changes in the data flow through the IDEs, there is no built-in regression test that can guarantee the changes made do not have any impact elsewhere in the codebase;
The team acknowledges that there is one project, so should the codebase. Team members must be able to work in parallel on the same project without disturbing the work of others. Most of the IDEs provide logs of all the changes made to a specific project, but do not implement git’s wonderful set of tools for tracking and merging changes;
When using those IDEs the thinking behind the design is required to be sequential, which means progressing one step at a time till you reach the end of the modelling process. This fails to help us approach the problem in a more holistic way to ensure that there won’t be consequences later on in the process, or that code can be reused easily and flexibly in change management.
As an insurtech company, one of our basic requirements is to own the core components of our tech stack, to be able to improve it over time, and to fit it to the market needs. As discussed in our previous article, we do not carry legacy issues that hamper traditional insurers. Most of the time, these insurers have to rely on the third party as they don’t have the agility and capacity to execute or grandfathered issues caused by decisions taken years ago or by the acquisition of multiple businesses and therefore systems. By developing our own framework, we are able to shape it to our custom needs instead of compromising our goals to fit a market solution. Enter HOODS (Hokodo Object-Oriented Data Science).
What is HOODS?
HOODS is a modular framework for data science that we have designed and implemented here at Hokodo. It leverages Python’s strengths as an object-oriented language by assigning distinct and decoupled behaviours to custom objects. These objects are then assembled in an end-to-end data science pipeline, which includes data extraction, features engineering, analysis, and modelling. Critically, these objects can be used and reused in different contexts. By granting ourselves ownership of the tools and processes, we have in effect created our own IDE, which gave us full control over what is happening under the hood.
How does it work?
Building a credit model works a bit like assembling IKEA furniture, that’s why HOODS can be explained by using very similar concepts. The diagram below illustrates the different components of HOODS and how they interact with each other:
Tools are at the core of HOODS; they are a collection of specific transformations that we apply to our data sets. An example of such a transformation could be, the removal of outliers. Tools are independent of the models, which means they can be reused when expanding to new scenarios.
In the context of the furniture assembly, HOODS Tools would be equivalent to your toolbox: a collection of screwdrivers, hammers, etc.
2. Inputs, transformations & outputs
Each tool defines the inputs, transformations and outputs that will be used for a given step of the data science pipeline, as well as the libraries required by the tool functions. These libraries include cutting-edge and well-maintained packages such as pandas, scikit-learn, Matplotlib, SciPy or NumPy.
For example, one tool could be a screwdriver, which has its own characteristics such as the shape and size, with specific transformations like screw or unscrew.
One of the backbones of HOODS is what we call a data pipeline.
It can be seen as the end-to-end structure of the modelling process. Like any process, it is composed of different steps that will be executed sequentially - each step being a chain of instructions to be executed. In the context of HOODS, an instruction would be the use of a specific tool part of the HOODS toolbox.
Similar to when you go through assembly instructions, you follow each of the steps, one by one using the right set of tools and components, forming the final furniture.
As explained above, each set of instructions, has to be executed at each step of the process. Each instruction will be defined by an input (most of the time a dataframe),a set of transformation to be applied using dedicated tools, and finally an output (the transformed dataframe).
Inputs could be compared to the specific component of the furniture; transformations to the assembly of those components using a set of tools like a screwdriver or a hammer; and output to a part of the furniture assembled together.
In order to avoid retouching the code every time we want to make changes to the steps in a pipeline (such as adding a new transformation), we have created parameter files. These parameter files are used as a reference by the pipeline, meaning the pipeline reads those files in order to execute the instructions in the right way and order.
As with the furniture example, this represents the characteristics of the screwdrivers, the size of each component, which components go together, etc…
One of the key steps in each data science project is data analysis and visualisation, and that’s why HOODS has a tool fully dedicated to it.
Once all the pre-processing has been applied in the pipeline, HOODS automatically generates a set of plots and visualisations that the data scientist can interact with through jupyter notebooks.
7. Apply to pipeline
Once the data scientist is satisfied with its analysis, all the parameters used for this analysis are applied to the pipeline, which allows us to do discovery work before automatically applying it to the pipeline.
What are the main advantages?
Using such a framework brings a variety of advantages, below is a list of the key ones:
At Hokodo, we make no distinction between the different teams, and HOODS is the fruit of this philosophy. By drawing on the different skill sets in the company (data engineering, software engineering, machine learning, data analysis, credit risk …) we have managed to create a simple and robust framework, that enables us to create any machine learning we will use in the future, in a scalable way.
We believe that by investing the time to design and implement on our data science framework, we will reduce the time to develop models, increase collaboration within the team, enable the ability to react quickly to the real world and be able to stay on top of the innovation.