8-minute read

The promise of big data is delivered in actionable analytics. By applying advances in machine learning and artificial intelligence, we stand at the cusp of true innovation. Of finding ways to truly improve how we serve our customers. And of discovering new answers to some of the world’s biggest questions. We are able to better respond to epic challenges that affect us all, from disease to climate crisis to money laundering. And we are discovering ways to become more inclusive, like offering financial services to the previously unbanked.

By collaborating over data and insights across functions, borders and industries, we get one step closer to these innovative, world-changing solutions.

There’s only one problem.

Regulations and respect for data privacy mean you simply cannot share raw customer data outside your organization—and often you can’t share internally, either.

But what if you could have all the innovation based on real data analytics … without any real data?

Synthetic data is artificially generated data that replicates the statistical attributes of real-world data, but doesn’t contain any personally identifiable information. Since synthetic data retains the statistical characteristics of the original data while remaining utterly artificial, you are able to share it with chosen partners and providers to both leverage their tools and allow them to innovate with the value held in your data. This reduces the risk of exposing raw data as well as the time and money that lengthy provisioning processes usually take.

Logic 20/20 has already written about the opportunity of using synthetic data for data portability. It makes it safe to share the value of your data across organizational and geographic silos and to accelerate business transformation by offering a way to securely move data to the cloud.

Today we will talk about how Hazy has taken the potential to innovate on top of synthetic data to the next level with synthetic transactional data, how this is being used in enterprises right now, and the potential for synthetic data to solve world-scale problems.

 its dependency on a thousand previous points—in sequential order. Small errors can propagate throughout the entire sequence and introduce major deviations. Until recently, figuring out how to generate sequential data that matches the behavior of the raw data has proven a real challenge.

The Hazy data science team, building on the DoppelGANger generator from Carnegie Mellon University, has been able to generate sequential synthetic data with the highest privacy-utility trade-off available. The quality of that synthetic data is measured in three ways:

• Similarity: how similar the curve drawn across a histogram is

• Autocorrelation: the measurable comparison between real and synthetic data

• Utility: the relative ratio of forecasting error when trained with real and synthetic data

Hazy synthetic data satisfies all these metrics and more, while still having the mathematical compliance proof of differential privacy—proof that it can’t be reverse engineered or re-identified through linkage attacks.

This means Hazy is the first synthetic data company to allow organizations to generate and share sequential data that’s high-quality enough to train machine learning models on, while maintaining a measurable, auditable level of privacy.

OK, now that this quality of synthetic data exists, what can you do with it?

Sequential synthetic data can unlock world-scale finance problems

Any organization with extensive customer transaction histories can take advantage of synthetic data, but banks and financial services make for a particularly interesting early adopter of this disruptive force. Some financial institutions have over a century’s worth of customer histories. Open banking then gives them access to additional time-bound datasets like stock measurements, interest rates, and exchange rates. The interrelationships among different transactional datasets can be combined to better model relationships and explore scenarios and model tradeoffs.

Machine learning models can help answer questions like how a product might behave when you have a combination like high interest rates and low unemployment. Or how Product X should be priced if there’s a third wave of the coronavirus pandemic in Country Y in 2021.

And it can be used to identify spending habits—and anomalies—for detecting customers who are newly at-risk to fraud.

In order to innovate and rapidly respond to customer demands, every financial institution needs to work with third-party partners and services. However, whenever you are sharing real customer data with third parties, there’s going to be a risk. This is why it can take months to vet partnership candidates and software integrations, because that’s how long it can take to get data out of the regulatory rigamarole. Then even more time is dedicated to preparing the data to be shared. In this time, these innovative partners may have already found new financial backers—or the competition has already built and released a better solution—and these partners may not have been up to snuff anyway.

By using high-quality synthetic data, organizations are able to cut down this time to provision highly representative—but completely artificial—data with selected providers from months to days. All while mitigating the risk of using sensitive customer data.

Sequential synthetic data is already helping build better finance

Interestingly, some innovation departments at Fortune 500 companies are using Hazy synthetic data to evaluate third-party innovation partners to answer the same question: How can we help customers manage their finances better?

To answer this, these enterprises seek a collaboration between their innovation teams and pilot clients to identify vulnerable customers—by irregular customer spending habits, for example—and to intervene to promote better financial decision making.

At Nationwide Building Society, Technical Lead Alex Mikhalev says that the demand for synthesizing sequential data came from behavior analytics teams, because user behavior is inherently cyclical. Every month (hopefully), you are paid and your bills are paid. Alex says customers can worry about the final balance, but the bank is thinking broader than that, asking questions like “What if there’s a change in this time-bound transaction data?” Accurately synthesizing transactional data is essential to training algorithms that can identify these patterns and pinpoint when these patterns diverge.

The Nationwide rapid innovation team wanted transactional data to give to third parties tasked with analyzing behaviors of users in order to pinpoint those more financially vulnerable customers, so that Nationwide could better serve them.

In order to evaluate these third-party strategic partners, it is necessary to share banking transaction data with them. However, this is a particularly sensitive kind of data that no financial institution is willing to share without significant governance processes. This causes delays in not only evaluating these partners, but in developing and releasing this essential solution together.

The Nationwide innovation team took their original transactional data and fed it into training for the Hazy synthetic data generators on premise, which then generated new, highly representative synthetic data. They then shared the new data with third parties, first to evaluate as potential partners and then to work with accepted strategic partners to build a new platform for better finance.

Nationwide’s Chief Product Owner of Member Data Rob Lee wrote on LinkedIn about the potential of synthetic data:

“Using Synthetic data is the key to really accelerating so much in technology. I believe that we are on the cusp of a completely different way of thinking about data, using the synthetic equivalent allows us to be much much more open with the data in sharing with partners and amongst different teams within Nationwide. After all there is nothing more private for our members than data which has never existed in real life. As we tune the generation of the synthetic data we can create the data at scale that we have never experienced, and yet it reflects reality. Truly amazing.”

Hazy synthetic data lets Nationwide validate how the functionality of potential partner products or services works within the context of Nationwide’s data—without ever allowing that raw data off site.

Most importantly, Hazy cut down this process of provisioning data to third parties from an average of six months to three days. By reducing the time it takes to share data, more projects could be evaluated and approved. It also cuts down on the people and time needed to prepare data.

Alex Mikhalev says that “Synthetic data is the enabler of Nationwide doing things more quickly while mitigating risk.”

He goes on to explain that Nationwide chose Hazy synthetic data because it was proven to be “secure, compliant, performant, and resilient.”

Synthetic data effectively reduced time, cost, and risk for Nationwide Building Society by enabling them to generate highly representative synthetic data for transactions to test out and onboard third-party fintech innovation partners. This allows for those partners to quickly show their innovative machine learning solutions to real-world problems.

Like what you see?

Adam Cornille

Adam Cornille is Director of Advanced Analytics at Logic20/20. He is a data science manager and practitioner with over a decade of field experience, and has trained in development, statistics, and management practices. Adam currently heads the development of data science solutions and strategies for improving business maturity in the application of data.

Follow Adam on LinkedIn