Using DataComPy for comparing pandas and spark dataframes

DataComPy is an open-source python software developed by Capital One

DataComPy is an open source project by Capital One developed to compare Pandas and Spark dataframes.  It can be used as a  replacement for SAS' PROC COMPARE or as an alternative to Pandas.DataFrame.equals(Pandas.DataFrame, providing the additional functionality of printing out stats and letting users adjust for match accuracy.

Building the DataComPy open source project

DataComPy was one of several software programs developed for internal use within Capital One. The purpose of building these features is to produce customer management risk models, where data validation and comparison are critical to ensuring integrity and confidence.

Specifically, as we were using this in models with numerous data sources, we needed to understand any data lineage nuances such as data transformations that had been performed along the way. Hence, building a similar capability to PROC COMPARE was essential to our use of Python.

The tool was completed for internal use in 2017 and after one year of use we started the process of open sourcing DataComPy. Since DataComPy's official release to the open-source community, it has almost 180 stars, 60 forks, and 14 distinct contributors. it has 268 stars, 91 forks, and 18 distinct contributors.

How DataComPy helps data scientists, analysts, and engineers with model builds

DataComPy’s goal is to provide a human-readable output describing differences between two dataframes in Pandas and Spark. It provides descriptive reporting at the column and row level outlining where columns and rows are identical, and where there may be differences. It tries to remain flexible by allowing users to provide tolerances of their matching criteria, custom joining, and several helper functions to enable users to easily compare intersections and unique items from a row and column perspective.

Data scientists, analysts, and engineers will find DataComPy a helpful tool in examining data for model builds. The tool is also helpful for ETL processing and general data comparison between various systems where transformations are being made.

Comparing data between sources can be finicky and cumbersome. We hope DataComPy will help ease the burden on users so they can focus on solving critical business problems.

DataComPy and Pandas overview

Comparing dataframes

With DataComPy and Pandas you can easily compare two dataframes using either  joining columns or on an existing index.

    compare = datacompy.Compare(
    join_columns='acct_id',  #You can also specify a list of columns

# OR

compare = datacompy.Compare(df1, df2, join_columns=['acct_id', 'name'])

# OR

compare = datacompy.Compare(df1, df2, on_index=True)

Reporting summary

The report() method can then be used to provide a human readable summary of the differences.

    DataComPy Comparison

DataFrame Summary

  DataFrame  Columns  Rows
0  original        5     6
1       new        4     5

Column Summary

Number of columns in common: 4
Number of columns in original but not in new: 1
Number of columns in new but not in original: 0

Row Summary

Matched on: acct_id
Any duplicates on match values: Yes
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in original but not in new: 1
Number of rows in new but not in original: 0

Number of rows with some compared columns unequal: 5
Number of rows with all compared columns equal: 0

Column Comparison

Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 7

Columns with Unequal Values or Types

       Column original dtype new dtype  # Unequal  Max Diff  # Null Diff
0  dollar_amt        float64   float64          1    0.0500            0
1   float_fld        float64   float64          4    0.0005            3
2        name         object    object          2    0.0000            0

Sample Rows with Unequal Values

       acct_id  dollar_amt (original)  dollar_amt (new)
0  10000001234                 123.45             123.4

       acct_id  float_fld (original)  float_fld (new)
0  10000001234            14530.1555        14530.155
5  10000001238                   NaN          111.000
2  10000001236                   NaN            1.000
1  10000001235                1.0000              NaN

       acct_id name (original)            name (new)
0  10000001234  George Maharis  George Michael Bluth
3  10000001237      Bob Loblaw         Robert Loblaw

Sample Rows Only in original (First 10 Columns)

       acct_id  dollar_amt           name  float_fld    date_fld
4  10000001238        1.05  Lucille Bluth        NaN  2017-01-01

Quick overview of DataComPy and Spark

Comparing Spark dataframes using DataComPy is very similar to that in Pandas. Compare() is replaced with SparkCompare() and some additional useful features.

    comparison = datacompy.SparkCompare(spark, base_df, compare_df,
                    join_columns =   [('acct_id', 'ACCOUNT_IDENTIFIER'), ('acct_sfx_num', 'SUFFIX_NUMBER')],
                    column_mapping = [('clsd_reas_cd', 'AM00_STATC_CLOSED'),
                                      ('open_dt', 'AM00_DATE_ACCOUNT_OPEN'),
                                      ('tbal_cd', 'AM0B_FC_TBAL')],
                    known_differences= [
                        {'name': 'Left-padded, four-digit numeric code',
                         'types': ['tinyint', 'smallint', 'int', 'bigint', 'float', 'double', 'decimal'],
                         'transformation': "lpad(cast({input} AS bigint), 4, '0')"},
                        {'name': 'Null to *2',
                         'types': ['string'],
                         'transformation': "case when {input} is null then '*2' else {input} end"},
                        {'name': 'Julian date -> date',
                         'types': ['bigint'],
                         'transformation': "to_date(cast(unix_timestamp(cast({input} AS string), 'yyyyDDD') AS timestamp))"}

Column mapping

You can provide column mappings if the names differ between the datasets and also call out known differences and apply transformations on the compare side to help cluster these expected differences in the report output. 

The corresponding output would look like:

    ****** Column Summary ******
Number of columns in common with matching schemas: 3
Number of columns in common with schema differences: 2
Number of columns in base but not compare: 0
Number of columns in compare but not base: 0

****** Schema Differences ******
Base Column Name  Compare Column Name     Base Dtype     Compare Dtype
----------------  ----------------------  -------------  -------------
open_dt           AM00_DATE_ACCOUNT_OPEN  date           bigint
tbal_cd           AM0B_FC_TBAL            string         double

****** Row Summary ******
Number of rows in common: 5
Number of rows in base but not compare: 0
Number of rows in compare but not base: 0
Number of duplicate rows found in base: 0
Number of duplicate rows found in compare: 0

****** Row Comparison ******
Number of rows with some columns unequal: 5
Number of rows with all columns equal: 0

****** Column Comparison ******
Number of columns compared with unexpected differences in some values: 1
Number of columns compared with all values equal but known differences found: 2
Number of columns compared with all values completely equal: 0

****** Columns with Unequal Values ******
Base Column Name  Compare Column Name     Base Dtype     Compare Dtype  # Matches  # Known Diffs  # Mismatches
----------------  -------------------     -------------  -------------  ---------  -------------  ------------
clsd_reas_cd      AM00_STATC_CLOSED       string         string                 2              2             1
open_dt           AM00_DATE_ACCOUNT_OPEN  date           bigint                 0              5             0
tbal_cd           AM0B_FC_TBAL            string         double                 0              5             0

The DataComPy roadmap

Our roadmap currently includes some refactoring of the Pandas and Spark codebase to help unify the experience, as well as sourcing more user feedback to help enhance the current capabilities. The full DataComPy roadmap can be seen here.

We highly encourage new contributors and feedback is always welcome by opening a new issue! Learn more about Capital One’s “open source first” philosophy and how we actively use, create, and contribute to open source software in our work.

Open Source First Since 2014

We're using the power of community to build the bank of the future.

Learn More

Faisal Dosani, Director, Data Science

Faisal Dosani joined Capital One in 2014 and is currently a Director for the Canada Card Data Science team focusing on our machine learning tooling and infrastructure. Faisal spends most of his time collaborating with colleagues across the enterprise to ensure our tooling and data is well managed, robust, and well thought out. Empowering users to solve business problems through effective software and design is his north star. As a side he is a huge Python fan, and enjoys 3D printing.

Yes, We’re Open Source!

Learn more about how we make open source work in our highly regulated industry.

Learn More

Related Content