In the world of AI, labeled data matters. It’s key to building state-of-the-art machine learning products, especially “supervised” models, which require labeled data as input to learn. When a self-driving car needs to decide whether or not to stop at an intersection, it leverages what it’s learned from thousands of examples of stoplights. Those images are labeled, usually by humans, along with thousands of other traffic signals that are not stop lights, so that the self-driving car can learn to differentiate traffic signals in the moment.
The journey to optimizing labeled data can be messy, frustrating, and just plain hard. Do your machine learning teams a favor and create products defined by the data. To do this, you’ll need labeled datasets -- and lots of them. When setting out to create your first dataset, here are five items to keep in mind.
1. You may need to collect your own labeled dataset
Public domain datasets have upwards of millions of records available to use, however, most models and use cases require specific human labels that you cannot find online. While a public data set may include millions of images from self-driving cars, one model may be trying to predict red stoplights while the other is trying to detect (and avoid hitting) humans. In order to leverage this data correctly, it must be labeled in different ways, with different objectives in mind.
This concept is even more true when your model requires data you can’t find online, and while more and more datasets are becoming publicly available, the more domain-specific your research, the more work you’re going to need to put in up front to find and label the data.
2. It will be hard to define your labeling schema perfectly upfront
Quality labeled data requires a well-defined labeling schema. It’s impossible to correctly label data when the correct answer keeps changing! BUT -- your model and labeling task will evolve along with the product. In a first iteration, perhaps you’re only distinguishing red lights from green, but in a future iteration, you’ll have to distinguish between red, yellow, green, flashing, arrows, etc. Seeing these changes coming down the road is important and can help guide the right level of granularity for different classifications.
So, what’s the right level? Keep in mind -- too few classifications can create a bad product, and too many can create bad data. Spend time finding the middle ground.
3. The perfect labeling schema does not exist -- there will always be data too ambiguous
Have you ever failed a CAPTCHA? We have. Because no matter how basic a task -- like finding all of the images of storefronts -- 100% certainty is hard to achieve. And though we encourage spending time on the schema to make it as robust as possible, it’s important to acknowledge that the data will not fit into your well-defined categories 100% of the time. That’s okay! If you, as a human, can’t decide which class is correct, how do you expect your model to decide?
4. Different humans don’t label the same input the same way
In a labeling schema with a lot of options coupled with datasets with ambiguous data, getting two humans to label the same way 100% of the time is not going to happen. Finding consistency between human labelers requires them to understand concepts, solve problems, and perform in similar ways. Having a labeling system that allows for multiple labelers to have different answers gives you the chance to pick the best quality data for training without sacrificing the benefits of multiple sets of eyes on your data.
Similarly, people change over time. Even within a single labeler, as the labeling schema changes and as time passes, the labeler’s mental model of the data can shift. To combat these inconsistencies within a single person, it’s important to provide examples and training opportunities that prevent humans from settling into patterns and habits that can manifest as systematic errors.
5. There is no “done” in labeled data
As your self-driving car travels to new frontiers, it needs to be prepared (trained) to perform at a high level in new environments. But how do you, as an ML product leader, know when you have enough labeled data? In some cases, this constant stream of new data is a gold mine, and allows you to target the long tail of your schema -- street signs that are less common, for example. But it can also create a feeling of needing to label everything even when the model is performing as expected.
Deciding when to stop can be difficult, especially because some ongoing labeling is important to monitor how well the model is performing.
Labeling data is hard. Labeling quality data is harder. And labeling quality data at scale is the hardest. But solving the problems listed above unlocks performance in your models that can be used to continuously iterate on your product and create better experiences. We hope that in laying out these problems, we can help start conversations on how to solve them for machine learning products across domains and industries.
The answers to these problems may never be straight-forward, and as AI evolves, so too will the way data gets labeled. But the need to label data in an efficient, accurate, and cognitively easy way will persist.
Explore engineering opportunities at Capital One. #LifeAtCapitalOne