5 reasons to use machine learning for improved data quality

By Vinod Chandrasekharan, Vice President, Product, Capital One Data Insights

January 11, 2023|5 min read

Fintech at Capital One

Data is the air that we breathe in the financial services industry. This data offers all organizations and businesses the chance to glean tremendous insights for improving customer experience, operational efficiency, and revenue – if used to its fullest.

But collecting, protecting, aggregating, and accessing all of an enterprise’s data is incredibly complex. The information often sits in so many departmental silos that it’s difficult to monitor and manage. As such, 77% of data professionals say they experience data quality issues, and 91% of those believe these problems are hurting company performance, a recent Great Expectations survey found.

Machine learning helps data specialists

It’s not that data specialists aren’t good at their jobs. Most are exceptional. The real issue has been that human beings can only do so much. Data teams are increasingly overwhelmed by a constant flood of inaccurate, incomplete, outdated, redundant, and irrelevant digital information.

Turning those challenges into opportunities is a job uniquely befitting machine learning (ML). ML & AI applications are designed to automate and supercharge an organization’s data science and analytics capabilities, helping them make better-informed decisions with far more data at a faster rate.

According to a recent Forrester Consulting study commissioned by Capital One, 53% of data professionals plan to improve business efficiency with ML applications. Yet, the study found that adoption is being held back by struggles with selling the financial merits of ML investments to upper management and by disconnects between the cross-functional teams that would be needed to deploy the technology.

Driving consensus around the importance of ML for solving data quality issues is admittedly no small feat. But Capital One has overcome such roadblocks by educating its teams on the business opportunity behind the technology – and the perils of not adopting it. Our internal education campaign – which can be mirrored by any organization choosing to go down this path – emphasized five key benefits of ML tools for data quality management, including

ML automation

Identifying and resolving data quality issues is time-consuming and costly for organizations, and requires subject matter expertise that takes even more time to build.

Automating data quality processes with ML allows IT to offload these tasks while ensuring greater data scrutiny. According to a Monte Carlo and Wakefield Research survey, 75% of data professionals say it takes four or more hours to detect data quality issues and about nine hours to resolve them. Not surprisingly, Gartner says this poor data quality costs organizations an average of $12.9 million annually.

ML automation liberates IT to focus on more strategic work. At Capital One, for example, this freedom has enabled our IT teams to spend more time championing innovation. We’ve tested, understood, and launched new products more rapidly. We can provide more consistent digital experiences for employees, partners, and customers.

Data quality at scale

A major challenge with data quality is how teams correct issues once they are identified.

The default method would be to write one or several business rules and push them out to address the specific data sets triggering errors. But with data and organizations constantly growing, many teams are burning themselves out while only addressing a portion of the many pressing problems.

ML technology can help organizations overcome this by enabling them to create the rules or logic for addressing them and then efficiently disseminate these solutions at scale. What’s more, it enables organizations to avoid the time and expense of trying to hire lots of IT staffers just to keep data clean and useful.

Intelligent monitoring and alerts

Over-alerting and false positives can be a resource drain for IT teams.

If you’re monitoring 200,000 data feeds, for example, and 99% of those seem to contain clean information, that still leaves 1% or 2,000 feeds that are suspect. With ML, systems can be instrumented to address most issues independently without involving a human counterpart.

That involvement can still happen for more serious issues that require a staffer to intervene or make a judgment call (something people still do better than machines). But using ML in this way can all but eliminate over-alerting.

Strong anomaly and change detection

Anomaly and changepoint detection is critical for locating current or emerging data quality issues.

By definition, anomaly and changepoint detection involves identifying data points, items, observations, or events that disagree with expected patterns or programming.

For example, if retail customers are known to spend more during a big sale but your data indicates they spent far less, that would present a potential anomaly. Anomalies happen all the time and aren’t always caught.

Depending on their severity, those misses can have serious reputational, financial, or cybersecurity repercussions for organizations. Most IT teams deploy some level of data anomaly and change detection, but the automation capabilities of ML make these processes faster and far more efficient.

Precise root cause analysis

Data managers need to know the root of data quality issues to address problems quickly before they grow and impact larger customers.

If data isn’t processing correctly on a website, for instance, it might give customers incorrect information that leads them to make an ill-advised decision or dampens their overall experience.

By automating root cause analysis processes through ML, organizations can more precisely, actively, and even proactively address data quality issues before or as they arise.

Capital One pioneers the way into fintech intelligent monitoring

At Capital One, we’ve done a lot of work to implement ML as part of the tools that help us get the most out of data to better serve our customers and run more efficiently.

For instance, last year, we introduced our open-sourced Data Profiler, which detects sensitive data and provides detailed statistics on large batches of data (distribution, pattern, type, and so forth). It enables you to perform many of the tasks we’ve been talking about here (staff optimization, pushing out rules at scale, intelligent monitoring and alerts, anomaly detection, and root cause analysis).

But it’s just one of the numerous efforts we have underway to innovate around ML and share our learnings with others. Data quality challenges will continue to grow as new sources and formats emerge. Our job is to improve our powers as data professionals right along with that growth.

The continuing evolution of ML technology can serve as a force multiplier for organizations willing to incorporate it. To learn more about ML at Capital One, read our article on how we use machine learning research in finance.