What is a Cluster? An Overview of Clustering in the Cloud


Computer clusters, and in particular Kubernetes clusters, have seen a substantial rise in adoption in the last decade. Startups and tech giants alike are leveraging cluster-based architectures to deploy and manage their applications in the cloud. But what is a cluster? What is the relationship between clusters and containers? And why might you want to consider using a cluster to host your own application?

In this post, I’ll provide an overview of computer clusters, lay out the advantages and disadvantages of using a cluster in place of a single machine, and describe how enterprises are using clusters today.

What is a Cluster?

At a high level, a computer cluster is a group of two or more computers, or nodes, that run in parallel to achieve a common goal. This allows workloads consisting of a high number of individual, parallelizable tasks to be distributed among the nodes in the cluster. As a result, these tasks can leverage the combined memory and processing power of each computer to increase overall performance.

To build a computer cluster, the individual nodes should be connected in a network to enable internode communication. Computer cluster software can then be used to join the nodes together and form a cluster. It may have a shared storage device and/or local storage on each node. Typically, at least one node is designated as the leader node, and acts as the entry point to the cluster. The leader node may be responsible for delegating incoming work to the other nodes and, if necessary, aggregating the results and returning a response to the user.

Ideally, a cluster functions as if it were a single system. A user accessing the cluster should not need to know whether the system is a cluster or an individual machine. Furthermore, a cluster should be designed to minimize latency and prevent bottlenecks in node to node communication.
 

Types of Cluster Computing

Computer clusters can generally be categorized as three types:

  1. Highly available or fail-over
  2. Load balancing
  3. High performance computing

As you will see in the next section, the three types of clusters closely align with the potential benefits that clusters offer. When applicable, I’ll reference the related cluster type after explaining the particular benefit and how a cluster provides it. It’s also important to note that a cluster can be more than one of these three types. For example, a cluster hosting a web server is likely to be both a highly available and load balancing cluster.

Four Key Benefits of Cluster Computing

Cluster computing provides a number of benefits: high availability through fault tolerance and resilience, load balancing and scaling capabilities, and performance improvements. Let’s expand upon each of these features and examine how clusters enable them.
 

1. High Availability

There are a few important terms to remember when discussing the robustness of a system:

  • Availability - the accessibility of a system or service over a period of time, usually expressed as a percentage of uptime during a given year (e.g. 99.999% availability, or five 9’s)
  • Resilience - how well a system recovers from failure
  • Fault tolerance - the ability of a system to continue providing a service in the event of a failure
  • Reliability - the probability that a system will function as expected
  • Redundancy - duplication of critical resources to improve system reliability

An application running on a single machine has a single point of failure, which makes for poor system reliability. If the machine hosting the application goes down, there will almost always be downtime while the infrastructure recovers. Maintaining a level of redundancy, which helps improve reliability, can reduce the amount of time an application is unavailable. This can be achieved by preemptively running the application on a second system (that may or may not be receiving traffic) or having a cold system (as in, not currently running) preconfigured with the application. These configurations are respectively known as active-active and active-passive configurations. When a failure is detected, an active-active system can immediately failover to the second machine, while an active-passive system will fail over once the second machine is live.

Computer clusters consist of more than one node running the same process simultaneously, and are therefore active-active systems. Active-active systems are usually fault-tolerant because the system is inherently designed to handle the loss of a node. If a node fails, the remaining node(s) are ready to take in the workload of the failed node. With that said, a cluster that requires a leader node should run a minimum of two leader nodes in an active-active configuration. This can prevent the cluster from becoming unavailable if a leader node fails.

In addition to being more fault tolerant, clusters can improve resilience by making it easy for recovered nodes to rejoin the system and return the cluster to its optimal size. Any amount of system downtime is costly to an organization and can create a poor customer experience, so it is critical that a system be resilient and fault tolerant in the event of a failure. Using a cluster can improve the resilience and fault tolerance of the system, allowing for higher availability. Clusters with these characteristics are called “highly available” or “fail-over” clusters.
 

2. Load Balancing

Load balancing is the act of distributing traffic across the nodes of a cluster to optimize performance and prevent any single node from receiving a disproportionate amount of work. A load balancer can be installed on the leader node(s) or provisioned separately from the cluster. By performing periodic health checks on each node in the cluster, the load balancer is able to detect if a node has failed, and if so it will route incoming traffic to the other nodes in the cluster.

Although a computer cluster does not natively load balance, it enables load balancing to be performed across its nodes. This configuration is referred to as a “load balancing” cluster, and is often simultaneously a highly available cluster.
 

3. Scaling

There are two classifications of scaling: vertical and horizontal. Vertical scaling (also referred to as scaling up/down) involves increasing or decreasing the resources allocated to a process, such as the amount of memory, number of processor cores, or available storage. Horizontal scaling (scaling out/in), on the other hand, is when additional, parallel jobs are run on the system.

When maintaining a cluster, it's important to monitor resource usage and scale to ensure cluster resources are being appropriately utilized. Luckily, the very nature of a cluster makes it trivial to horizontally scale — the administrator simply needs to add or remove nodes as necessary, keeping in mind the minimum level of redundancy to ensure the cluster remains highly available.


4. Performance

When it comes to parallelization, clusters can achieve higher performance levels than a single machine. This is because they’re not limited by a certain number of processor cores or other hardware. Additionally, horizontal scaling can maximize performance by preventing the system from running out of resources.

“High performance computing” (HPC) clusters leverage the parallelizability of computer clusters to reach the highest possible level of performance. A supercomputer is a common example of an HPC cluster.

Clustering Challenges

The most obvious challenge clustering presents is the increased complexity of installation and maintenance. An operating system, the application, and its dependencies must each be installed and updated on every node. This becomes even more complicated if the nodes in the cluster are not homogeneous. Resource utilization for each node must also be closely monitored, and logs should be aggregated to ensure software is behaving correctly. Additionally, storage becomes more difficult to manage; a shared storage device must prevent nodes from overwriting one another and distributed data stores have to be kept in sync.

Containerizing the application and hosting the cluster in the cloud, as I’ll explain next, can help alleviate some of these challenges.

Clusters in the Cloud

Before the public cloud, computer clusters consisted of a set of physical machines communicating via a local area network. Building a computer cluster involved thoughtful planning to ensure it would meet present and future requirements, as scaling a physical cluster could take weeks or even months. Also, on-prem or self-managed clusters weren’t resilient in the event of regional disasters, so there had to be other safety measures in place to ensure redundancy. For example, leveraging a second energy supplier and hosting nodes across two physical locations would prevent supplier or region specific blackouts from taking down the cluster.

So, what is a cluster in cloud computing? Simply put, it is a group of nodes hosted on virtual machines and connected within a virtual private cloud. Using the cloud allows for much of the overhead involved in setting up a cluster to be entirely bypassed. Virtual machines can be provisioned on demand, allowing clusters to scale in minutes. Infrastructure can be quickly updated, providing the flexibility required for a cluster to adapt to changing needs. And finally, deploying nodes across multiple availability zones and regions can improve user latency and cluster resilience.

Simply put, clustering in the cloud can greatly reduce the time and effort needed to get up and running while also providing a long list of services to improve the availability, security, and maintainability of the cluster.

Containers and Their Relationship with Clusters

Containers have eliminated many of the burdens of deploying applications. Differences between local and remote environments can largely be ignored (with some exceptions, such as CPU architecture), application dependencies are shipped within the container, and security is improved by isolating the application from the host. The use of containers has also made it easier for teams to leverage a microservice architecture, where the application is broken down into small, loosely coupled services. But what do containers have to do with computer clusters?

Here's a common scenario. Your organization is developing a simple web based application. The front-end and back-end are built as microservices, running independently from each other as standalone containers and communicating over HTTPS. Now it's time to deploy your application.

The first solution you try might be to provision a virtual machine on the cloud to run your containers. This works, but there are a number of drawbacks. Performance is limited by the resources provisioned to the VM, and scaling the application is likely to be difficult. Additionally, if the VM or the hardware hosting the VM fails, the application will be unavailable until either a new machine is provisioned or traffic is routed to a fail-over server.

Fortunately, a cluster solves both of these issues.

Deploying containerized applications across the nodes of a cluster can substantially improve the availability, scalability, and performance of your web application. Running multiple containers per node increases resource utilization, and ensuring an instance of each container is running on more than one node at a time prevents your application from having a single point of failure.

However, this leads to another problem: container management. Managing containers in a cluster of ten nodes can be tedious, but what do you do when the cluster reaches a hundred, or even a thousand nodes? Thankfully, there are a number of container orchestration systems, such as Kubernetes, that can help your application scale.

To read more about containers and how they compare to VMs, check out Containers vs. VMs: What’s the Difference & When to Use Them.

What is a Kubernetes Cluster?

Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. A Kubernetes cluster is a group of nodes running containerized applications that are deployed and managed by Kubernetes. It consists of a set of nodes that make up what’s called the control plane (similar to the leader node(s) in a generic cluster), and a second set of nodes, called worker nodes, that run one or more applications.

Kubernetes is a powerful tool that can simplify application deployment to a cluster, create additional “pods” (groups of one or more containers) as traffic increases, self-heal failed pods, dynamically react to networking changes, load balance, enforce security rules, and more. It is designed to be resilient, scalable, and performant, natively leveraging the benefits of cluster architecture.

As a graduated project of the Cloud Native Computing Foundation, Kubernetes has seen incredible growth since its release by Google in 2014. Because it is an open-source system, individuals and organizations are free to use it as they please, with many developing their own open-source tools that work with or alongside Kubernetes. There are few restrictions on where a Kubernetes cluster can be deployed (on-prem or cloud), and with the release of the Container Runtime Interface, Kubernetes supports a variety of container runtimes.

Unfortunately, Kubernetes can be difficult to implement and even harder to maintain at scale. This is especially true for enterprises that have unique efficiency and complexity challenges. If you’d like to learn more about Kubernetes, containers, and clusters at scale, read Kubernetes at Enterprise Scale: What You Need to Know.

Containers and Clusters at Enterprise Scale

We’ve covered how a cluster based architecture can help improve the resilience, scalability, and performance of a system, as well as the challenges that computer clusters introduce. We also discussed how containers and container orchestration systems can help mitigate some of these challenges. It’s all quite complex, and building a secure and stable system is especially difficult at enterprise scale.

Kubernetes in particular, despite its extensive list of features, lacks the reliability and ease of use necessary for enterprises to rely on it as their sole container orchestration solution. That’s why Capital One created its own. Built on top of Kubernetes, Critical Stack is a container orchestration platform that eliminates the configuration challenges that come with containerized applications, helping enterprises leverage the benefits of containers and clusters while meeting their own specific needs. If you’re interested in getting started with containers and clusters, container orchestration tools like Critical Stack may be the right starting point for you.

I hope this post provided enough information to kick start your journey into clusters, containers, and Kubernetes. An understanding of each of these topics will prove valuable as the need for fast and reliable systems continues to grow.


Aaron Nordhoff, Software Engineer, Critical Stack Team

Aaron Nordhoff is a Software Engineer on the Critical Stack product development team. He graduated from Purdue University in 2018 with degrees in Mechanical Engineering and Computer Science, and joined Critical Stack at the start of his second year with Capital One. You can connect with Aaron on LinkedIn (https://www.linkedin.com/in/aaron-nordhoff/).


DISCLOSURE STATEMENT: © 2020 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.

Related Content