Customer transaction data: Imputing merchant industry types

Industry codes are required in many financial models, how can machine learning help impute missing or incorrect codes?

Behrouz Saghafi

June 23, 2021

Introduction

Financial institutions often use customer transaction data and merchant records data to build models and a particularly important piece of data for applications involving small-businesses is the SIC code which specifies the merchant’s industry type, such as restaurant, legal, real estate, etc [1]. Despite its importance, the industry code field is often missing or incorrect in the merchant record data. To build accurate models it is therefore essential to impute it using other available information, and to provide a confidence estimate on the imputed value.

First presented as a paper at KDD 2020 Machine Learning in Finance Workshop, this blog post describes a sequence classification approach to impute the industry type from the merchant’s name. Often, industry code is represented as a one-hot encoded vector in a downstream model but using this approach can yield a representation of the code as a probability distribution over all available categories. The resulting representation can then be used by downstream models as well as applications such as entity resolution (a task that involves matching transactions data from different sources). I will give an overview of the approach and the models used, discuss the data, and present some experimental results.

Overview of the sequence-classification-based approach

Our sequence classification approach uses the name of the merchant for predicting the industry type. This approach is motivated by the simple observation that often (but not always) the name of a merchant is indicative of the industry type. Some examples of this are:

Pete’s Pizza → restaurant
Freehold Auto Garage → automotive repair shop
Marlboro Auto Parts → automotive supplies and parts
Law Office of Krish Patel → law office
Joe Yen, MD → doctor’s office

Of course, this is not always the case. For example, a Google search for "Top restaurants in Richmond, Virginia" yields the following three names: The Boathouse at Rocketts Landing, Lamaire, and Croaker’s Spot | Richmond. It is not at all obvious from the names that these are restaurants.

A sequence classification model such as an LSTM [2] or a Bag-of-Words model [3] can be trained to learn the industry type from merchant names. As is clear from the above examples, such a model is not going to be correct 100% of the time. However, such a model would output a probability distribution over the different classes, and we can use that distribution as the representation of the industry type instead of the usual one-hot encoding. Specifically, if there are N different industry types, then the usual representation of a particular industry type would be a one-hot encoded vector of size N. We use this representation when the industry type is not missing. However, when the industry type is missing, we can use the sequence classification model to generate a vector of probabilities (of size N) and use that as the representation. A downstream model (for fraud detection, etc) is then trained on this representation as input (along with other inputs from the given transaction).

Bag-of-words model

In a bag-of-words (BoW) model the ordering of the words is ignored, and only the word frequency in the vocabulary is kept. If the vocabulary contains M words, then the representation is a vector of size M where each entry is the number of times the corresponding word appears in the given text. This representation is then fed to a classifier such as a random forest or a feedforward network. In our experiments, we chose a feedforward network as the classifier.

LSTM model

Unlike the BoW model, an LSTM model relies on the word ordering to make decisions. Because of this, the LSTM captures contextual information and can perform better for many applications. However, since word ordering in a merchant name does not carry additional information (e.g., "Marlboro Diner" vs "Diner Marlboro"), we would not expect the LSTM to vastly outperform the BoW model in all instances, which was confirmed by our experiments.

Experiments with data and preprocessing

For this experiment, we are only interested in two fields: merchant name and the two-digit SIC code. SIC code usually consists of a four-digit code. Using a subset of digits indicates broader industry classifications. In our case, we use the first two digits that indicate major groups. The highest level of classification is the divisions, which refer to the following 12 ranges of SIC codes [1]:

Range of SIC Codes	Industry Division
0100-0999	Agriculture, Forestry and Fishing
1000-1499	Mining
1500-1799	Construction
1800-1999	Not Used
2000-3999	Manufacturing
4000-4999	Transportation, Communications, Electric, Gas and Sanitary Services
5000-5199	Wholesale Trade
5200-5999	Retail Trade
6000-6799	Finance, Insurance and Real Estate
7000-8999	Services
9100-9729	Public Administration
9900-9999	Nonclassifiable

The division for Services, for example, consists of the following major groups:

70: Hotels, Rooming Houses, Camps, And Other Lodging Places
72: Personal Services
73: Business Services
75: Automotive Repair, Services, And Parking
76: Miscellaneous Repair Services
78: Motion Pictures
79: Amusement And Recreation Services
80: Health Services
81: Legal Services
82: Educational Services
83: Social Services
84: Museums, Art Galleries, And Botanical And Zoological Gardens
86: Membership Organizations
87: Engineering, Accounting, Research, Management, And Related Services
88: Private Households
89: Miscellaneous Services

Our dataset for this experiment consisted of 320,000 small business merchants. We used a stratified split of 80-20% to create the training and test data. We also first converted all merchant names to lowercase and removed all punctuation as well as common stop words. A lot of merchant names contain a person’s name within them (e.g., Pete’s Pizza). We used a publicly available list of people’s names and replaced every occurrence of a person’s name with the tag "person".

“Pete’s Pizza gets transformed into "person pizza”

We also replaced words that appeared fewer than "n" times in the dataset with the tag "rare_word" (n=3 in our experiments). Finally, we limited the number of words in a merchant name to eight. Most merchant names have far fewer than 8 words; for those with more than eight words in their name, we picked the first eight words and dropped the rest.

Testing with four classes

As a proof-of-concept, we started with just four industry types:

legal services
automotive repair, services and parking
health services
eating and drinking places

The BoW model produced an accuracy of 88.72% on this reduced dataset and the confusion matrix is shown in Figure 1 where we can see that eating and drinking places are correctly identified 91% of the time, automotive industries are correctly identified 86% of the time, and so on.

The LSTM model produced an overall accuracy of 88.53% on the test data. The confusion matrix is shown in Figure 2 and as expected, the LSTM does not show any noticeable advantage because word ordering does not really matter with this dataset.

Confusion matrix for the BoW model on the reduced 4-class dataset and Confusion matrix for the LSTM model on the reduced 4-class dataset.

Figure 1: Confusion matrix for the BoW model on the reduced 4-class dataset and Figure 2: Confusion matrix for the LSTM model on the reduced 4-class dataset.

Testing with all 83 classes

When tested with all 83 classes, the BoW model produced an overall accuracy of 59.26% This is not surprising given the large number of classes, and the relatively small size of our dataset. The confusion matrix for the BoW model is shown in Figure 3. In addition to overall accuracy, we also computed the top-5 accuracy for BoW model which came in at 81.57%. The LSTM model performed at par with the BoW model. Its overall accuracy is 59.34% and its top-5 accuracy is 81.65%. The confusion matrix is shown in Figure 4.

Confusion matrix for the BoW model on the larger dataset and Confusion matrix for the LSTM model on the larger dataset.

Figure 3: Confusion matrix for the BoW model on the larger dataset and Figure 4: Confusion matrix for the LSTM model on the larger dataset.

Conclusions

As mentioned in the introduction, the industry type is a useful feature for many fraud detection and customer behavior models. Our hypothesis going into this research was that the representation produced by our approach should aid those models, and it is part of our on-going work to verify this. The solution presented here is also useful for entity resolution where we have partial or incorrect information about the same merchant from two different data sources and we need to identify the data as that of the same merchant. Here, knowing that both merchants belong to the same industry type increases the confidence of the match. Another application is marketing where there are restrictions on marketing certain products to certain industry types. If a product should not be marketed to a particular industry type, and if that type shows up in the top-5 predictions for a given merchant, then the system can decide not to market to that merchant.

What we presented here is work in progress and we are exploring several options to improve the performance of the model. You can read the entire paper presented at KDD 2020 Machine Learning in Finance Workshop here.

References

[1] NAICS Association. [n.d.]. SIC codes and counts by division. https://www.naics. com/sic-codes-counts-division
[2] Ralf C. Staudemeyer and Eric Rothstein Morris. 2019. Understanding LSTM – a tutorial into Long Short-Term Memory Recurrent Neural Networks. arXiv:1909.09586 [cs.NE]
[3] Yin Zhand, Rong Jin, and Zhi-Hua Zhou. 2010. Understanding Bag-of-Words Model: A Statistical Framework. International Journal of Machine Learning and Cybernetics 1, 1 (2010), 43–52.

Behrouz Saghafi, Senior Software Engineer

Behrouz Saghafi is a Machine Learning Engineer in the ScaleML Plano team at Center for Machine Learning at Capital One. He develops data-centric Machine Learning solutions for various business problems across different LOBs and deploys them as a service to be used at scale. He also leads the efforts on GraphML usecase and services.