22.3.2019

How to start a data-driven paytech… without any data

Arnaud Alepee

Head of Data Science

Data is the new gold, and forward-thinking businesses are in a race to become more data-centric. A company that is able to harness data will be able to improve its product offering, provide more accurate pricing and better-tailored services. But, as many discover, building a data-centric business isn’t as straightforward as it sounds.

Take incumbent financers. They aren’t short of data, but where they’re hitting the buffers is in organising and making use of it. They’re grappling with inconsistent data sources, legacy systems, complex data architecture and the headache of executing across vast organisations. Not to mention dealing with regulatory issues on top.

Then at the other end of the scale, a startup like Hokodo has exactly the opposite problem. We have all the freedom and flexibility in the world to design our data architecture from scratch, seamlessly consolidating all our data sources. But we have one big stumbling block — we don’t have vast treasure troves of data to work with. Our credit models, for example, utilise data on hundreds of thousands of companies to ensure their performance and robustness; it is not possible to garner this volume and variety of data from our existing customers.

And we’re not alone — this same data dilemma is plaguing many startups across all sorts of industries. Thankfully, we’ve found a few sneaky shortcuts…

Can we change the question?

When working out what data you require, you first need to think about what question you want to answer. In our case, as providers of BNPL, the key question was: “What’s the probability of a buyer not repaying their loan?”

In an ideal world, we would, of course, have a mountain of loan history to draw on, but as we don’t, we can change the question, asking instead: “What’s the probability that a business will become insolvent?”

Since a company going through insolvency wouldn’t be in a position to pay its creditors, we can use this as our dependent variable for model development. This makes it infinitely easier to get hold of the data that we need.

Maximise open data sources

The next step is to identify the available data sources that contain this information; in our case, a record of liquidated companies in the UK. And as luck would have it, The Gazette has made its official records of all UK business insolvencies publicly available, giving us our main data source for our key dependent variable. A similar database is also available for all EU member states through the European e-Justice Portal.

There are also tonnes of open data sources available for independent variables, for example, in the UK, Companies House provides a variety of information for free through an API. So, in our case, that gives us critical information about insolvent companies, including industry, location and financial statements.

Fill in the gaps with external data providers

Open data sources are a great first port of call, however, they can be quite limited in terms of breadth and quality, which is why it’s good to enrich with external data. There’s a whole smorgasbord of different data providers out there offering both B2B and B2C data through APIs, and they can be a great addition to a data universe.

When selecting your data provider, just be clear on exactly what your priorities are, for example, costs, integration, data items, coverage and reliability, so you can assess the different options effectively. At Hokodo, we’re discovering new data sources and data providers every day.

Integrate

Once you have your core data sources ready, you can create a data pipeline that allows you to ingest that information from all sources and standardise it into one common data model. We achieve this using Python, Django and various packages; the production instance is then hosted by Amazon Web Services (AWS).

At Hokodo, we launched an R&D programme to explore how new data sources and data providers can enhance the predictive power of our models.

‍
At Hokodo, integration is at the heart of everything we do and building an integrated analytics database brings a number of advantages:

One standardised version of the truth: This has proven to be very useful during the feature engineering phase by ensuring a full audit trail and data consistency;
Tracking historical changes: Having one central data source means we can retrieve company data from any point in time, allowing us to perform back-testing extremely efficiently;
Being able to scale: Using 3rd normal form modelling (database design approach to reduce the duplication of data and ensure referential integrity) allows us to expand the data model or include new data sources in a seamless fashion;
Data integrity: It provides greater reliability than using files or data lakes, for example;
Integration with the product: Our product has its own data model and produces its own data, so we can easily sync the two up for reporting, analytics and machine learning.

Transition to an internal model

Now we have our first generation of predictive models in place, we are able to instantly credit score potential customers, offer BNPL and collect data; we can then begin retraining the models with our own data, once we have enough volume coming in. And as we do so, our models will inevitably become more accurate and more tailored to the product since they’ve been generated from real customers, rather than obtained from an external source.

Data is integral to building the next generation of fintech companies, and tools and techniques like these are vital for startups like Hokodo to get their products and services off the ground. It’s a similar story in other industries too, where the barriers to entry are being torn down and the old rules have ceased to apply. And this is only the beginning, as it’s still early days for so many data pioneers. The data race is well and truly on.

SHARE ON: