Icon arrow left

How to start a data-driven insurtech…without any data

How to start a data-driven insurtech…without any data

Data is the new gold, and forward-thinking businesses are in a race to become more data-centric. Even insurance companies are getting on board, motivated by the promise of offering better products, more accurate pricing and more tailored service. But, as they’re all quickly discovering, building a data-centric business isn’t as straightforward as it sounds.

Take incumbent insurers. They aren’t short of data, but where they’re hitting the buffers is in organising and making use of it. They’re grappling with inconsistent data sources, legacy systems, complex data architecture and the headache of executing across vast organisations. Not to mention dealing with regulatory issues on top.

Then at the other end of the scale, a young startup like Hokodo has exactly the opposite problem. We have all the freedom and flexibility in the world to design our data architecture from scratch, seamlessly consolidating all our data sources. But we have one big stumbling block — we don’t have vast treasure troves of data to work with. And the only way to get our own data set is to start trading… but to do that, we need data. So, it’s a catch 22 situation.

And we’re not alone — this same data dilemma is plaguing many startups across all sorts of industries. Thankfully, we’ve found a few sneaky shortcuts…

Can we change the question?

When working out what data you require, you first need to think about what question you want to answer. In our case, as providers of invoice protection, the key question was: “What’s the probability of a buyer not paying the insured invoice to the seller, thus giving rise to a claim?”

In an ideal world, we would, of course, have a mountain of claims and invoice history to draw on but as we don’t, we can change the question, asking instead: “What’s the probability that a business will become insolvent?”

Since a company going through insolvency wouldn’t be in a position to pay its creditors, we can use this as our dependent variable for model development. This makes it infinitely easier to get hold of the data that we need.

Maximise open data sources

The next step is to identify the available data sources that contain this information; in our case, a record of liquidated companies in the UK. And as luck would have it, The Gazette has made its official records of all UK business insolvencies publicly available, giving us our main data source for our key dependent variable. A similar database will also be available for all EU member states by mid-2019 through the European e-Justice Portal.

There are also tonnes of open data sources available for independent variables, for example, in the UK, Companies House provides a variety of information for free through an API. So, in our case, that gives us critical information about insolvent companies, including industry, location and financial statements.

Fill in the gaps with external data providers

Open data sources are a great first port of call, however, they can be quite limited in terms of breadth and quality, which is why it’s good to enrich with external data. There’s a whole smorgasbord of different data providers out there offering both B2B and B2C data through APIs, and they can be a great addition to a data universe.

When selecting your data provider, just be clear on exactly what your priorities are, for example, costs, integration, data items, coverage, reliability, so you can assess the different options effectively. At Hokodo, we’re discovering new data sources and data providers every day, and we’ve launched an R&D programme to explore how they can enhance the predictive power of our models.


Once you have your core data sources ready, you can create a data pipeline that allows you to ingest that information from all sources and standardise it into one common data model. You can do this really easily and cheaply with technologies such as Python, Django and AWS.

At Hokodo, we launched an R&D programme to explore how new data sources and data providers can enhance the predictive power of our models.

At Hokodo, integration is at the heart of everything we do and building an integrated analytics database brings a number of advantages:

  • One standardised version of the truth: This has proven to be very useful during the feature engineering phase by ensuring a full audit trail and data consistency;
  • Tracking historical changes: Having one central data source means we can retrieve company data from any point in time, allowing us to perform back-testing extremely efficiently
  • Being able to scale: Using 3rd normal form modelling (database design approach to reduce the duplication of data and ensure referential integrity) allows us to expand the data model or include new data sources in a seamless fashion;
  • Data integrity: It provides greater reliability than using files or data lakes, for example
  • Integration with the product: Our product has its own data model and produces its own data, so we can easily sync the two up for reporting, analytics and machine learning.

Transition to an internal model

Now we have our first generation of predictive models in place, we can slowly pivot to using our own data, once we have enough volume coming in. And as we do so, our models will inevitably become more accurate and more tailored to the product, since they’ve been generated from real customers, rather than obtained from an external source.

Data is integral to building the next generation of insurance companies, and tools and techniques like these are vital for startups like Hokodo to get their products and services off the ground. It’s a similar story in other industries too, where the barriers to entry are being torn down and the old rules have ceased to apply. And this is only the beginning, as it’s still early days for so many data pioneers. The data race is well and truly on.