Resource management with Snowflake, Dask & PyTorch

Development of CharCNN models with Capital One’s Tech interns – Part 2

Michelle Kelman, Xaver Davey, Sonu Chopra Khullar, Mohammadamin (Amin) Dashti

June 15, 2023 |9 min read

We all know that training machine learning models requires large amounts of data, but what if the machines available to us can’t handle all the data and computations we need to perform? After all, you can only fit so many shots of espresso in a cappuccino mug without it overflowing. It may seem that training data must fit on the machine used for training, however distributed machine learning provides a workaround!

Data overflowing ram by example of coffee overflowing a coffee cup

Photo by Ephraim Mayrena on Unsplash (https://unsplash.com/photos/uyC4gPZXfKE)

Distributed machine learning allows training on hundreds of millions of data points with limited GPU / CPU resources. The model can handle data and computations that are larger than available resource memory through parallel computations and out-of-memory support.

While this is slower, it can be the difference between successfully working through epochs and training failure. Do note that you should avoid training with data that is multiple times larger than your RAM as this will be very slow.

However, if you are looking to experiment with out-of-memory training, you have come to the right place. Whether you have years of experience with machine learning or are just getting started, fret not, because small experience does not mean you cannot handle big data!

In this article, we will discuss the development of a distributed Character CNN model using three primary Python-supported technologies: Snowflake, PyTorch, and Dask, although these technologies can be used for just about any type of artificial neural network. While using these technologies isn’t easy, they aren’t too difficult to pick up, especially if you have experience with Python.

Before diving into each of these technologies, a few notes:

General SQL knowledge (a query language for relational databases) is helpful with Snowflake.
To replicate our model, follow along by putting the snippets into a .py file or by using a Jupyter notebook (.ipynb) and making modifications as necessary (check out the comments provided).
If your training data is many multiples times larger than RAM, consider using AWS or other cloud platforms to rent a machine with adequate specifications to train your machine learning model. Depending on your model size, this can be a very affordable option.

What is a CharCNN?

CharCNN model with feature mapping and classification

A Character Convolutional Neural Network (CharCNN) is a machine learning model used to classify text by analyzing its features.

The model can be broken down into four essential components:

Input: sample text passed into the model for training and testing
Convolutional and pooling layers: identify and map features of the text
Fully-connected layers: analyze the features of the text and classify them
Output: probabilistic distribution indicating the likelihood that the input contains or is one of the classification types

This model is only a small part of our out-of-memory data pipeline.

Snowflake: Getting the data

Snowflake is a cloud computing based company that offers data warehouses for big data. Our data pipeline begins with a SQL query to one of these warehouses.

There are two options for reading from Snowflake into Python code: snowflake-connector-python and dask-snowflake connector.

For our use case, we used the dask-snowflake connector. The technology is fairly new. However, with just a few lines of code, it is possible to read from Snowflake right into a Dask object, saving time and effort.

In the case that this method doesn't work for your use case, please consider checking the other alternative methods like Snowflake Python Connector, detailed here.

Additionally, if you aren’t even reading data from Snowflake, Dask offers other ready-to-go reading methods.

    # query from Snowflake using dask-snowflake connector
ddf = dask_snowflake.read_snowflake(query, conn_info)

# other query options (in the case you are not querying from Snowflake)
# ddf = dask.dataframe.read_csv(...)
# ddf = dask.dataframe.read_sql(...)
# ddf = dask.dataframe.read_parquet(...)

Dask: preprocessing the data

Dask is an open-source Python library for parallel and out-of memory computing leveraging local and cloud resources. Dask supports parallel computing using dask-gateway clusters with worker nodes that can perform multithreaded parallel computations, as well as out-of-memory computing using lazy execution and Dask Futures.

What is lazy execution?

Lazy execution is the concept of evaluating calculations only when needed. Dask uses delayed functions instantiated with the delayed keyword to implement lazy evaluation.

To then evaluate these functions, one of two functions can be used:

.compute(): executes the computations specified by a delayed function and returns results as a single solved object in memory
.persist(): executes the computations specified by a delayed function and returns the results in the type of the original Dask object running on the Dask cluster

What are Dask Futures?

Dask Futures allow asynchronous tasks to be run across a distributed cluster of machines. The tasks run in the background as needed, and can be distributed throughout multiple workers or threads.

Our implementation uses a combination of lazy execution and Dask Futures throughout the data loading, vectorizing, and tensorizing processes. When collecting data, we began with a Dask DataFrame.

However, we want to end the preprocessing step with a Dask Array as these are most similar in format and easy to convert to PyTorch Tensors, the backbone of PyTorch data representation.

While your needs may vary here based on the format of your raw data and your needs in how the data must be fed into your model, Dask’s filtering and mapping functions prove to be helpful throughout the preprocessing step.

    # filter any unwanted data from the initial Dask DataFrame
ddf = ddf.loc[...]

# map vectorization and tensorization functions to the X and Y columns of the DataFrame, returns a Dask Series (for other return types see Dask DataFrame.map_partitions() documentation)
xds = ddf.map_partitions(vectorize_x, meta=(None, 'object'))
yds = ddf.map_partitions(vectorize_y, meta=(None, 'object'))
# our vectorize functions make use of Dask DataFrame.apply() to perform elementwise operations

# convert the Dask Series to a Dask Array
xda = xds.to_dask_array()
yda = yds.to_dask_array()

When seeking to use these functions on out-of-memory data, do not run .compute() on any of the Dask objects you have defined. Computing the objects will evaluate the operations you have defined on all of your data at once

If your device doesn’t have enough memory to handle all the data and computations performed, your process may get killed. For purposes of testing, either run .compute() on small samples of data or on individual data points accessed through iterating through object indices.

PyTorch: Training the model

PyTorch is an open-source machine learning framework based on the Torch library. PyTorch can be used in combination with Dask, using the dask-ml library and Skorch, and CUDA for optimal training. Once you have a defined PyTorch artificial neural network, training can begin! In our case, we implemented a distributed CharCNN model, as described earlier.

One of the main changes that should be made to a standard model to allow for out-of-memory training is to customize the Pytorch Dataset and Pytorch DataLoader classes to support Dask lazy execution. More specifically, a custom __getitem__ method should be defined for the Dask-Pytorch Dataset, and the custom Dask-Pytorch DataLoader must be designed to support lazy data loading. At this point, .compute() will likely have to be used on individual, lazily-executed data points.

Evaluating our delayed data at this point allows us to retrieve data for training which is immediately passed to the model, making room for new data to be evaluated so computational memory doesn’t exceed available RAM. Of course, this will be slower than in memory computations, but is one of the better options for training with data larger than memory.

GPU training

When training the actual model, there are multiple training device options

The model can be trained locally on CPU
Remotely on GPU
Or on multiple GPUs

There are multiple benefits to using a multi-core GPU over CPU, especially when handling out-of-memory data and computations:

What’s next?

Although this software pipeline is functional, it isn’t the most efficient. Significant improvements can be made to reduce data loading, preprocessing, and training times. Thankfully, though, Snowflake, Dask, and PyTorch each have their own methods for accomplishing a streamlined infrastructure.

However, the fact that our out-of-memory machine learning can already be supported is promising for future research. Maybe in the future, it’ll be possible to fit one thousand shots of espresso in a single cappuccino cup.

Acknowledgements

We’d like to thank the Bank Architecture AI Team for all their support during our internships: Sonu Chopra-Khullar, Kenneth Au, Andrew Lin, Amin Dashti, Apoorva Rautela, Eddy Borera, Padma Maddi, and Sherry Stricko. An extra shout out to Sonu, Kenneth, and Andrew for mentoring us so closely over the summer and all the wisdom they provided over the course of our projects.

Another thank you to Brennan Gallamoza and Calvin Hu for completing our intern pod and making our summers so fun! An additional thank you to Shira Weissmann for helping us get this article published and to Dan Kerrigan and Michael Grogan for all their Dask expertise. We’d also like to thank the TIP Team for assigning us to the most amazing intern team we could have asked for!

Check out Part 1 of this blog series: Emerging vector databases: A comprehensive introduction.

Michelle Kelman, Xaver Davey, Sonu Chopra Khullar, Mohammadamin (Amin) Dashti, Retail Bank Machine Learning Architecture Team

Michelle and Xaver were part of Capital One’s Technology Intern Program (TIP), focusing on machine learning. At the time of their internship, Michelle was a rising senior at the University of Texas, Dallas studying computer science, and Xaver was a rising senior at the University of Wisconsin, Madison studying computer science. Sonu is a Director of Machine Learning at Capital One. Amin is a manager of Data Science at Capital One.

Resource management with Snowflake, Dask & PyTorch

Development of CharCNN models with Capital One’s Tech interns – Part 2

What is a CharCNN?

Snowflake: Getting the data

Dask: preprocessing the data

What is lazy execution?

What are Dask Futures?

PyTorch: Training the model

GPU training

The model can be trained locally on CPU

Remotely on GPU

Or on multiple GPUs

What’s next?

Explore #LifeAtCapitalOne

Explore #LifeAtCapitalOne

Related Content

Emerging vector databases: A comprehensive introduction

10 common Machine Learning mistakes and how to avoid them

Logistic Regression for Machine Learning