Data Profiler - An Open Source Solution to Explain Your Data

Gain insight into large data with Data Profiler, an open source project from Capital One

Grant Eden

May 3, 2021

In the field of machine learning, data is a valuable resource and understanding data is one of our most valuable capabilities. As data scientists and engineers, we need to be able to answer these questions with every data intensive project we build: Is our data secure? What does the data say? How should we use the data? At Capital One, we are proud to help answer these questions with the Data Profiler - an open source project from Capital One that performs sensitive data analysis on a wide range of data types.

Introducing the Data Profiler - an open source project from Capital One

Often data streams become so large that it becomes hard, or even impossible, to monitor all the data being sent through. If sensitive data slips through, it could be detrimental and difficult to stop or even notice. The goal of the Data Profiler open source project is to detect sensitive data and provide detailed statistics on large batches of data (distribution, pattern, type, etc.). This data, called “sensitive entities” is defined as any crucial private information like bank account numbers, credit card information, social security numbers, etc.

There are several key features of the Data Profiler, including detecting sensitive entities, generating statistics, and providing the infrastructure to build your own data labeler. For Data Profiler we’ve designed a pipeline that can accept a wide range of data formats. This includes csv, avro, parquet, json, text, and pandas DataFrames. Whether the data is structured or unstructured, the data can be profiled.

An example of how Data Profiler works

Maybe you’re a jeweler buying and selling diamonds with a large database of all your customers and transaction details. Imagine you have a simple structured dataset like this, except with many hundred more rows and columns:

Customer	Carat	Cut	Date	Email	Credit Card Number	Price ($)
Rick Deckard	0.23	Ideal	Jan23	rick@fakeemail.com	XXXX	100.41
Paul Atreides	0.25	Premium	11/06/2020	muaddib@fakeemail.com	XXXX	120.13
Amelia Brand	0.26	Ideal	1/2/2021	interstellar@fakeemail.com	XXXX	133.60
Miles Prower	0.33	Good	February 13th, 2021	babydriver@fakeemail.com	XXXX	124.89
Elio Perlman	0.36	Good	12/06/2020	oliver@fakeemail.com	XXXX	102.63

The Data Profiler can help you learn from your data. Each column in your dataset will have been profiled individually to generate per column statistics. You’ll learn the exact distribution of the price of diamonds, that cut is a categorical column of several unique values, that the carat is organized in ascending order, and most importantly, you’ll learn the classification of each column for sensitive data.

Our machine learning model will then automatically classify columns as credit card information, email, etc. This will help you discover if sensitive data exists in columns they shouldn’t exist in.

Switching out Data Profiler’s machine learning model

The default Data Profiler machine learning model will recognize the following labels:

BACKGROUND (Text)
ADDRESS
BAN (bank account number, 10-18 digits)
CREDIT_CARD
EMAIL_ADDRESS
UUID
HASH_OR_KEY (md5, sha1, sha256, random hash, etc.)
IPV4
IPV6
MAC_ADDRESS
PERSON
PHONE_NUMBER
SSN
URL
DATETIME
INTEGER
FLOAT
QUANTITY
ORDINAL

To use our jeweler example, perhaps you are regularly selling gems and need a model to identify specific gem types. Switching models in the data labeler is very simple as the data labeler is a pipeline that exists within the Data Profiler that can be altered to fit your needs.

Three components make up the data labeler: a preprocessor, a model, and a postprocessor.

A flowchart of the labeler pipeline that depicts arrows from "Preprocessor" to "Model" and "Model" to "Postprocessor."

The preprocessor is designed to take raw data and turn it into a format that the model can accept. The model is designed to take in data, run a prediction (or fit), and output the data to the postprocessor. The postprocessor takes the output and turns it into something usable by the user. One of the most important features of a data labeler is versatility. Meaning being able to switch and modify models as needed. Running multiple models on the same dataset is easy, as choosing a preexisting data labeler to train and predict on takes just a few lines of code:

    # Using the default labeler
import dataprofiler as dp

# load data and data labeler
training_data = dp.Data("your_training_jewel_data.csv")
testing_data = dp.Data(“your_test_jewel_data.csv”)
data_labeler = dp.DataLabeler(labeler_type='structured', dirpath=”my/data/labeler”)

# train your model
model_results = data_labeler.fit(x=training_data['samples'],  
   y=training_data['labels'],
   validation_split=0.2)

# make predictions and get labels per cell
predictions = data_labeler.predict(testing_data)

# saves the data labeler for reuse
data_labeler.save("my/jewel/labeler")

Maybe you want only a specific part of the labeler altered. Want to process the data in a unique way? Change one or both of the processes. Want new labels to classify? Change the model.

The process is streamlined like so:

    # Using a custom labeler
import dataprofiler as dp
from dataprofiler.labelers.character_level_cnn_model import \
    CharacterLevelCnnModel
from dataprofiler.labelers.data_processing import \
    StructCharPreprocessor, StructCharPostprocessor

# Create your own component based on the default architecture
model = CharacterLevelCnnModel(...)

# or load your own component from your own directory.
preprocessor = BaseDataProcessor.load_from_disk(dirpath=”./my/preprocessor”)
postprocessor = BaseDataProcessor.load_from_disk(dirpath=”./my/postprocessor”)


data_labeler = dp.DataLabeler(labeler_type='structured')
data_labeler.set_preprocessor(preprocessor)
data_labeler.set_model(model)
data_labeler.set_postprocessor(postprocessor)

# check for basic compatibility between the processors and the model
data_labeler.check_pipeline()

This makes it easy to switch out any type of model or processor. Perhaps you need a CNN or an RNN or a Regex model to label with--all are possible. A model or processor can be created from the default architecture or loaded from an existing model or processor.

Creating your own data labeler pipeline from scratch

Whether it is the CharacterLevelCnnModel or the RegexModel, perhaps you need more than the default models. How can you incorporate your own model?

The data labeler has been designed so that users can easily build their own from scratch with the framework provided. By inheriting the abstract base classes for the model, preprocessor, and postprocessor, you can create the individual components of the pipeline and have them seamlessly work with the existing architecture. To learn specifically how to implement your own data labeler pipeline, you can read more here! There are a few critical methods to implement, but beyond this, there is no limit to the type of model or processor you can create. This can be extremely useful for when you need to run many different and custom pipelines.

Conclusion

With the Data Profiler, you’ll be able to learn what’s in your data. The ability to create and switch out your own data labeler allows the Data Profiler to directly fit your needs. And since Data Profiler is an open source solution for gaining insights into large data, if there are features you want to see added, make a GitHub issue or create a pull request on the repository. Contributors are more than welcome on this open source project.

Grant Eden, Software Developer

MS in Computer Science and Engineering from Penn State. Software Developer at Capital One performing research in Machine Learning. Skilled in Object Detection and Natural Language Processing. Really good at baking apple pies.