An introduction to must-know machine learning algorithms

A series that presents algorithms in a way that balances technical info with approachable examples.

Capital One Tech

May 29, 2019

Capital One Data Engineer Madison Schott recently came to us with an idea - to create a series of blog posts introducing machine learning algorithms in a way that balanced technical info with approachable examples and language.

“I wanted to write introductions to different machine learning algorithms because I myself was looking for articles like these. I wanted to read something that was technical yet easy enough for beginners to understand. A lot of articles I read about these algorithms rambled on with technical jargon or lacked real world examples.” - Madison Schott

Originally published on our Medium publication, the algorithms Madison dives into are:

K-nearest neighbors (KNN)

If you’re familiar with machine learning and the basic algorithms that are used in the field, then you’ve probably heard of the k-nearest neighbors algorithm, or KNN. What is KNN? KNN is a model that classifies data points based on the points that are most similar to it. It uses test data to make an “educated guess” on what an unclassified point should be classified as. Read more on k-nearest neighbors (KNN) on Medium.

K-means clustering algorithm

K-means clustering is another basic technique often used in machine learning. The k-means clustering algorithm assigns data points to categories, or clusters, by finding the mean distance between data points. It then iterates through this technique in order to perform more accurate classifications over time. Since you must first start by classifying your data into k categories, it is essential that you understand your data well enough to do this. Read more on k-means clustering on Medium.

Naives bayes classifiers

Naives bayes classifiers are a group of machine learning algorithms that all use the Bayes’ Theorem to classify data points. The reason they are called “naive” is because they each assume features of a data point as being completely independent of one another. Naives bayes classifiers use the probabilities of certain events being true—given other events are true—in order to make predictions about new data points. Read more on naives bayes classifers on Medium.

Random forest algorithm

We all have to use this decision making process multiple times, every single day. In the machine learning world this process is called a decision tree. You start with a node which then branches to another node, repeating this process until you reach a leaf. A node asks a question in order to help classify the data. A branch represents the different possibilities that this node could lead to. A leaf is the end of a decision tree, or a node that no longer has any branches. Read more on random forest algorithm on Medium.

Artificial neural networks

ANNs have been around since 1943, when they were first introduced in Warren McCulloch and Walter Pitts’ paper “A Logical Calculus of the Ideas Immanent in Nervous Activity”, describing how networks of artificial neurons could be used to solve various logic problems. However, they were not widely used until 1969 when a technique called backpropagation was created. This technique allows the artificial neurons to adjust themselves when the solution they come up with is not the solution that was expected. Read more on artificial neural networks on Medium.

“I hope these articles help people understand how different algorithms work and how they make certain decisions. I want anyone, with tons of technical experience or none at all, to be able to read these and understand what they are doing and the difference between the algorithms.” - Madison Schott