3 key points for why saving data is important yet difficult

Why there’s a mountain of difference between all of the data and most of the data

Molly Sions

October 22, 2018

The importance of data is emphasized so often that it has nearly become a cliché. Variations of The Economists’ eye-catching headline “The world’s most important resource is no longer oil, but data” seem to pop up everywhere, eliciting equal amounts of excitement and fright among technologists and consumers. And yet, while new data use cases are multiplying like plant cells, it can be surprisingly difficult for a system development team to prioritize producing data over other capability builds.

I have certainly experienced this dynamic. Earlier this summer, I made a career transition from data engineer to a backend platform product manager. Having seen the power of data firsthand, I assumed I would have no trouble helping my team become first-class data producers. Once I got up to speed with the business requirements and timelines, though, I came to the conclusion that producing our data had to go on the backburner.

In the weeks since reprioritizing when data production would begin, I have tried to better understand the factors that led to my decision, and build a more resilient internal mythology around data producing. What I’ve settled on are three key avenues to understand how saving data can be so important and, at the same time, so hard to make happen.

#1 The Hidden Influence of Data Models

The first thing to keep in mind is how much an unnoticed gap in data can distort analysis. Imagine you run a convenience store, and your register captures purchase data. Whenever an item is scanned, your register stores the price of the item in a one-column table. Curious to see how sales are going, you run a query against that table and make a simple, guiding discovery: You’re making more money off of items that cost less than $5 than you are off of items that cost more than $5. As a thorough, data-driven decision-maker, you come to the conclusion that you need to stock your store with more items at the price point your customers are looking for.

What if you’d been tracking a different field, though? Imagine that instead of the register, it’s the barcode scanner that captures the data, and the field in your one column table shows the types of items that are selling. Curious to see how sales are going, you run a query against that table and make a simple, guiding discovery: Sales of ice cream pints are through the roof! As a thorough, data-driven decision-maker, you come to the conclusion that you need to take advantage of this spike in demand and raise the price of ice cream from $4.50 to $5.50.

Two different fields, two conflicting paths of action. In both cases, the nature of the data has predetermined the leverage point. The first table captures and accumulates price data — by doing so, it has limited your courses of action to ones that respond to the relationship between price and quantity. In situations where limited data is available, the data model becomes the screenwriter behind your analytics; you don’t see it unless you’re looking for it, and it’s often doing its best to stay in the shadows.

#2 The Road to True Sourcing

The example above is the difference between some systems saving their data and every system saving its data. Systems are built in silos, and it’s architecturally healthy for them to stay within reasonably-specialized lanes. When only some of those systems emit data, analytics unknowingly gravitate toward the silos in which those systems reside, potentially missing the bigger picture.

Resolving this issue is critical to getting to the second, hugely important step: True sourcing. Systems communicate with one another, and every API call consists of some measure of data interchange. If two systems are hugely interdependent and producing everything they know, then there will likely be a level of field redundancy. If one system gets a field from another and then produces that field in its logs, it can seem like splitting hairs to harp on the importance of getting data from the first system, not the second.

The more fields get passed around, the more they mutate. One system’s timestamp is another system’s date, and leading zeroes have a habit of disappearing. The terms “raw data” and “true source” data are often used interchangeably, but the difference is tremendously meaningful. Data can be raw without being true source, and if a system is not producing its own data, other systems are going to speak on its behalf.

If you have ever played a game of telephone, then you know just how enormous the inconsistencies can become when data is passed around over time.

#3 The Unpredictability of New Data Use Cases

Here I am, saying all this, and the fact still remains: Given the choice between adding functionality and producing data, I chose the former. In thinking about why, my mind keeps ambling back towards my time as a data engineer. My team was running a hugely valuable stream processing platform, and nearly every time I approached a new team about getting access to their data, the response was the same: “What do you want to use it for?”

I find that experience incredibly instructive. Not once did I enter the room to anything resembling “We’ve been waiting for a use case like this.” The producers were always surprised by the proposal we had come up with.

The reason I say this is not to express that I work with a bunch of jerks — on the contrary, I am pretty sure I work with the best people in the world. Their skepticism is illustrative of the fact that strong data use cases tend to come from unexpected angles, angles that the producing system itself does not know it covers. If your only experience with baking soda comes from making cakes, then seeing it mix with vinegar will probably surprise you. Furthermore, a use case for data cannot be discovered without the data to inspire it, so the cycle time between when the data becomes available the first time, and when the use case matures, is long. One of the streams my team onboarded had been around for a year before someone found a use case for it, but that use case proved to be incredibly valuable.

Internalizing a romantic view of the “unknown use case” is key to becoming a great data producer. Writing data quality checks, documenting metadata and building pipelines are as difficult as they are time-consuming. You have to be innately confident in the value of your information, and patient as you wait for a return on your investment. Those are hard qualities to achieve if they aren’t being reinforced by the engineering culture. Producing data has to be seen as a requirement instead of a bonus.

Conclusion

Maintaining a data environment that stretches into every part of the business, operates on multiple cadences, and makes itself broadly available while maintaining security standards is extraordinarily tough. On an individual team level, it can feel like a sisyphean task that takes away from the core function of your platform, and on an enterprise level it can feel vulnerable. You become heavily dependent on access controls, metadata, and data producer monitoring.

It feels easy to scale back the amount of pedabytes, but that thought process ultimately removes the power of data to drive action. In the ceaseless hunt for leverage, abundant and multifaceted data is your most valuable asset.

Molly Sions, Director, Product Management

Molly Sions in a Director, Product Management at Capital One.

3 key points for why saving data is important yet difficult

Why there’s a mountain of difference between all of the data and most of the data

Explore #LifeAtCapitalOne

Related Content

Balancing CI/CD: Managing product development data

Here's why you don't necessarily need data for data science

Dask & RAPIDS: The Next Big Thing for Data Science & ML

Footnotes