Your starter kit guide to launching into machine learning

Learn how to launch your team into solving critical problems with machine learning.

Kellye Rogers

October 16, 2023|15 min read

The way machine learning (ML) solves thorny problems might seem like magic to some, but ML is accessible to anybody who’s eager to learn. If you’ve already learned the fundamentals and embarked on your ML journey, this article will help you find an entry point to get involved in your organization. We’ll cover the essentials you need for your journey and equip you with a framework you can use to launch your ML practice and solve complex problems with a team.

Unpacking your ML starter kit

To help get you started, I've created an ML starter kit that includes the items you’ll need to succeed in ML. I’ve included a glossary of terms later in the post to be clear on how I am using those terms in the context of the starter kit. You need a good team around you to get results, especially in the early days, and these tools will help your team hit the ground running and start making an impact in your organization.

Building a team of data scientists and engineers

The larger body of ML is getting a lot of attention right now; you might’ve heard leaders in your company discuss AI strategies. You might be in the process of reskilling for AI yourself or leading an upskill effort at your organization. You might even be expanding your team of engineers and data scientists already. That's a good start, but to achieve success in ML, you'll need everybody facing the right direction and working on the right tasks. This starter kit will help you.

Data scientists need to collect, process, analyze and make predictions based on data. Their goal is to build models that can extract a clear, bright signal of insight from a cacophony of noise. This is where your data scientists contribute most to the growth and development of your ML practice. If AI is on your roadmap and you have a team of data scientists involved, it’s their responsibility to create algorithms designed to learn patterns and correlate the data, which is at the heart of machine learning and what AI uses to build models that can help people and organizations draw insight and make predictions.

It's this powerful outcome that makes data scientists so valuable: They can use AI as a tool to gain insight from data that informs the decisions the business makes.

Storytelling skills

We don't normally think about soft skills like storytelling as part of hard sciences like machine learning, but we should. Machine learning is a collaborative effort, and everyone you talk to needs to know what you're doing and why. Some of your teammates may not be technical experts, but they still need a clear understanding of what your model does and why that’s valuable to people. Even those who don’t understand data science, ML or AI need to relate and understand what problem you’re solving for them.

A good story will explain a problem that the person is familiar with, and it helps them understand the reason behind what you believe you can solve for them. Because of how important these things are, the power of storytelling is inseparable from the power of your AI.

Consider someone shared the same information with you in these two different scenarios:

The first is getting an email in your cluttered inbox that says, "My algorithm will remove your spam.”

The second is meeting with an engineer who tells you, "You're probably like most people who opened their inbox this morning and had to manually delete 100+ emails that you already knew you didn't want to read. I created a way to get the emails you actually want to read to the top of your inbox and get rid of all the stuff you don’t care about that wastes your time everyday."

Both scenarios describe a spam filter algorithm, but I want to talk to the person in scenario #2. That person is more likely to get buy-in from leaders, teammates and end-users to develop and deploy their algorithm.

That’s the power of storytelling, and you’ll need it if you want people to actually use and value your machine learning model.

All the data!

Machine learning algorithms are only as good as the data they're fed, and the algorithms need plenty of data. Because of this, you and your team need to have strong data management skills. There has to be a method to how you gather your datasets. Without some kind of defined process, you're bound to get lost in the information sources you're dealing with.

Exactly where you find your datasets will be slightly different from task to task and across industries, but the general workflow looks roughly the same across disciplines. In order to train your ML model, you're going to need access to big datasets from diverse sources. Based on the details of the problems you're trying to solve and the predictions you're going for, what's the data you need? Brainstorm on what sources can be good for you and where you would look if you didn't have ML. More is better here, so don't be shy about adding to your data stockpile from whatever reliable source you have.

This is for everybody. Your company might offer dedicated data collection tools or resources, but every member of your team can and should participate. Product managers might look for data in places a data scientist wouldn’t consider, data analysts might have unique ways to harvest data. I find myself doing research on many of the projects I work on too. The important thing to remember is that anybody can do it. Once you get the hang of solid data collection, it'll become second nature.

After compiling your datasets, prepare to answer these basic questions:

Where is the data coming from?
How can we consume the data?
What types of data are present?
What ontology is there?

Managing datasets in machine learning

It’s also important to know how to manage these large datasets. Here are some best practices for mapping your data ecosystem in a way that will make sense to you and others:

Find the biggest ones first and name them. You need to have some sort of ontology.
As you start, try to build a map of your data ecosystem in a way that will make sense to you and to the others working with your information.
Use plain language and apply basic grouping as much as possible. Name files in ways that accurately describe the data.
Keep your classifications basic and intelligible, and always keep the general patterns of your map in mind.
If you have access to data platforms, learn them and understand important data storage concepts.

Once your data is collected and labeled, get ready to use it. Collect your sets using the various tools you have available, and lean on your data stewards for detailed insights into the process. Working with the people on your team who have their hands directly on the data can give you a deeper understanding of what you're working with.

ML signals

The point of ML analysis is to separate signals from noise. Noise is all of the chaotic data jumbled together, which is probably how it comes to you, while the signal is the valuable insight you're looking for. Think of the pinging dot on a radar screen. All the random fuzz around that ping is meaningless noise, while the bright green dot is the object you're looking for. You can think of signals as a mechanism for conveying useful data, and it's what your algorithm tries to identify, such as customer intent prediction, places where you can economize or even fight money laundering.

Noise is inherent to signal processing. It is introduced during the capture, storage, transmission or processing of the information. We measure noise with the signal-to-noise ratio, which are two numbers that compare the contribution of the desired signal to the contribution of background noise. The higher the noise, the lower the quality of the signal and your ratio will be low too. Showing improvement in the signal-to-noise ratio over time, and increasing signal strength is a sign of a maturing machine learning practice - that’s the goal!

Training models

You’re building ML models to spot signals in all that noise, but first, they need training. There’s a lot of research and literature about training models. It’s your team’s responsibility to monitor what’s out there and the latest developments happening. Know the differences between the different types of models and learning and choose the latest techniques that make the most sense for your use case.

Cloud-based ML tech

A lot of your tech will be cloud-based, and ML focused. The new generation of technology powering a shift into cloud storage and the modern data stack has given us a golden age of accessible, affordable, ready to use ML tools. For example: Amazon SageMaker is one of the most popular cloud-based machine-learning platforms you're bound to run across as you learn machine learning; it enables developers (or aspiring ones) to create, train and deploy machine learning models on the cloud. In addition to a large suite of ML-focused tools, Google’s MLOps products like Vertex and AutoML help streamline deployments. Azure’s automation capabilities and responsible AI solutions can help you build fair, explainable and performant applications - all keystones for building trust.

Somebody to talk data science with

There's something exciting about being able to talk data with somebody about an exciting new thing you are learning and passionate about. I found a great group of people to talk data science with early in my own journey around 2020, and bouncing ideas off these great friends was extremely helpful for me. Remembering this, I strongly recommend getting involved in groups (or starting your own enthusiasts club) and making those contacts as early as you can since every conversation about your ML programs helps you learn more, and the sooner you start, the better prepared you'll be for your journey.

Launching your ML practice

The point here is there’s a lot that can go into your starter kit, and you might not need all of it right away. The more you can cram into your bag and familiarize yourself with, the better prepared you are for the next stages of your journey!

The ML starter kit is just that — it’s a starting point. Once you have your starter kit made up, your focus should shift to how to launch an ML practice that effectively solves critical problems for your enterprise and generates value for your business. You need an ML strategy and a framework to support it.

Rationalizing your portfolio of models

The first step is finding a way to rationalize what you’re working on in a way that everyone can understand — engineers, data scientists, product executives, operation managers and more. Since you can’t solve all the problems all at once, there has to be a way to focus on what matters most to your enterprise. This is especially difficult because you're probably going to accumulate a lot of models in the beginning, some of which won’t work very well. This is okay! As you progress, you'll evaluate your models and pare away the ones you use the least, along with the ones showing the least accurate output.

These are some of the questions I use with my team to help us evaluate our portfolio:

What problem are we solving, and what's the criticality?
How are users consuming the signal we produce, and how is it solving the problem?
How are our models integrating together?
What's the cost in cash and headcount? Is there a more efficient use of resources?
What's the development cycle time?

These questions will help you assess which models to keep using or stop altogether. As you cut, try to evaluate each model based on tangible measurables. It’s a good idea to judge your models’ success based on metrics such as signal-to-noise ratio, accuracy, precision and F-score. This will help you answer one of the most important questions: Which models belong in your portfolio?

Before discussing ways to mature your ML practice, let’s quickly unpack some of the measurables we can use to evaluate performance.

What is accuracy in machine learning?

Accuracy is the most basic way to measure models’ success. This is simply a measure of how many times a model makes an accurate prediction across the entire dataset.

Good accuracy in machine learning is subjective. After doing this for a few years, my team and I will consider anything higher than 70% as a great model performance. An accuracy measure of anything between 70%-90% is realistic and achievable — and we obviously strive for excellence in 90% or better, staying consistent with industry standards. This is a decent rough measure, but you need finer control to make more informed keep/discard decisions.

What is precision in ML?

Precision is a bit more complex than accuracy, and it’s paired with a second concept, recall. Precision is a measure of how many positive statements made by the model are correct, while recall is a measure of what percentage of total positive results in the whole set were actually found. Together, these are a decent measure of how good your model is at finding what it’s looking for and correctly marking it.

What is an F-score?

The F-score is a way of balancing precision and recall to maximize both. It’s basically an average that’s optimized toward whatever your goals are. If you’re coding in Python, you can compute F-score metrics with a lot of flexibility depending on the parameters you want to experiment with.

In addition to these measurables, we’ll also use opportunity scoring to evaluate our models and help us prioritize. We do this with a four-point score from 1-Low to 4-Critical in each of these five areas:

Priority of the task
Opportunity and process improvement
Data quality
Criticality
Signal-to-noise ratio for outputs

While this evaluation takes time, it’s crucial for long-term success. At the same time, you should focus on finding the win. In whatever domain you work in, identify a short-term, low-risk win to stake your team’s reputation on and prove that you can deploy ML value. Once you’ve proved value to stakeholders and evaluated, rationalized and prioritized your models, you can begin maturing your ML strategy.

Maturing your ML practice

Maturing your ML practice will ensure you work on the problems that most benefit from your solutions, and it helps get a better cost-to-value ratio. As both your ML practice and models mature, you’re going to zero in on much more efficient problem-solving solutions.

Early on, you’re likely to be working with broad problem definitions, since your models haven’t found the most efficient constraints yet. You may not have the information you need to assess signal strength, so it can be hard to know how reliable your results are. There’s typically little feedback or assessment of process integration, no cost/benefit analysis yet and little to no formal leadership approval as you work through the processes.

A mature environment works on a clear and well-documented problem statement. It has a solid baseline assessment of data and pipeline. At this stage, you should have your process mapped and significant stakeholder use documented, as well as a reliable cost/benefit analysis to help guide your efforts. Leadership should be a formal part of the process now, as should an iterative stakeholder feedback loop to make continuous improvements going forward.

"Improved model quality" image provided by Ming Waters, Sr. Director, Data Engineering at Capital One, 2023.

Machine learning process flow

All this improvement helps ensure you're working on the most productive parts of your process flow. To keep this up over time, I’ve outlined a six-step cycle that you can use over and over to continue refining your ML processes:

Identify the problem: What signals do we need to solve the problem?
Define success: Do we want an enhanced signal, alerting, risk mitigation or increased confidence?
Validate signals and data: Is there stable and validated data to build high-quality models?
Model development and performance validation: What are the model validation metrics? Have the stakeholders had a chance to review? Where are we falling on the confusion matrix?
Process integration and validation: Are we integrating these processes successfully? Are we getting user feedback and KPI validation for what we're doing?
Monitor: Keep an eye on model performance as your team continues to refine.

When one cycle is complete, decide whether to retire your model or run it through another cycle of retraining for continuous improvement.

Expanding your ML portfolio

The last step in our progression from novel use to a mature ML practice is to expand the portfolio. Scalability is one of the great advantages of ML models, but to grow fast, you need the good news to spread beyond your initial teams. Ask your company to set up a training program to get an ML-trained pipeline of colleagues established. Consider establishing a community of practice. As you learn more, your models will get more sophisticated and useful, which will cause them to grow along with you. That growth requires more people, and if nothing else, you should have fun training new data geeks to bounce ideas off.

You might also discover that your starting models for delivery and ownership need to shift. This is normal, and the changes occurring here can improve your success in a more mature model. You could, for instance, place your product team in a leadership role where they’re responsible for stating the problem your ML needs to solve and then in defining the criteria for what counts as success.

Your own ML team can then work up the initial ask from the product team and build out the process mapping and the various model development elements. Process integration should also play an important piece on the side of the ML team.

Finally, you can share build and integration responsibilities with the ownership team, which should be working closely with your ML team. You can even shift responsibilities for monitoring and maintenance to them, which frees up your team to focus on core competencies, process integration and expanding the portfolio.

Expanding machine learning use cases

As you grow your ML practice, new uses for your ML models will inevitably arise. Whether you've just started to learn machine learning or you're already applying transfer learning to train models, the framework I’ve covered will help you systematize your approach to problem-solving with ML.

Using this framework, you can build your ML team, develop your first models, get started assessing their accuracy and nurture your most productive models to maturity. Eventually, even if it’s your first time using ML for automated problem solving, you will have developed a grasp of what machine learning can do for you, what problems it’s good at solving and how to efficiently deliver consistent, reliable results across multiple teams.

Glossary of terms

Resilience - The ability of systems to resist, absorb and recover from or adapt to an adverse occurrence during operation that may cause harm, destruction or loss of ability to perform mission-related functions.

Site Reliability Engineering (SRE) - An approach to perform operations for continuously delivered cloud-based applications.

Technology operations - The process of implementing, managing, delivering and supporting IT services to meet the business needs of internal and external users.

Artificial intelligence - The theory and development of computer systems able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making and translation between languages.

Machine learning - The use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyze and draw inferences from patterns in data.

Machine learning model - A program that can find patterns or make decisions from a previously unseen dataset.

Data set - A collection of related sets of information that is composed of separate elements but can be manipulated as a unit by a computer.

Kellye Rogers, Director, product management, cloud & productivity engineering

Kellye leads Capital One's data & insight product team in stability engineering and operations. Previously, she led tech transformations at MSNBC, Estee Lauder and Northwell Health.