How to define a holistic container management strategy

Jason Burks

June 29, 2020

If you’ve clicked on this post, you’re probably here because you want to deploy software using Linux containers, as made popular by Docker in the last decade. Or, you might be trying to put your Windows software in containers, which has become increasingly popular as well. Either way, it’s important to recognize that using containers is about more than simply building a container image and running it somewhere. We use the term container management to describe all of the tools and processes that encompass the lifecycle of a containerized application.

When you only have a handful of containers to worry about, “container management” may amount to little more than learning some docker build and docker run syntax, creating an account on Docker Hub, and opening your favorite text editor. However, once you start to run larger applications or try to serve more users, managing software in containers requires careful planning and tool selection. Failure to plan your container management strategy up front can lead to developer and operator frustration, poor performance, and security issues: all of the problems you hoped to solve with containers in the first place.

Containers can be a wonderful tool for improving developer productivity, increasing deployment density, and limiting the blast radius of a vulnerability. But simply getting your software into a container brings you few, if any, of these benefits. Instead, you need to think about your container lifecycle holistically, from start to finish. The pipeline that supplies your containers, as well as the scanning and observation tools that let you know they’re running properly, are just as important as the container runtime and orchestration tools in your container management system.

In this post we’ll discuss:

Four major components of a container management strategy
Why you should approach container management holistically
Important things to consider for each component of your container management strategy
How to plan a container management strategy

Let’s get started!

The Four Major Components of a Container Management Strategy

To effectively understand container management solutions, it is helpful to define the various components that make up a container management strategy:

Container Image Supply Chain - how your containers are built
Container Infrastructure and Orchestration - where and how your containers run together
Container Runtime Security and Policy Enforcement - how you ensure your containers do what you want, and nothing else
Container Observability - runtime metrics and debugging

Let’s take a closer look at each of these:

Container Image Supply Chain

Before a workload can be run as a container, it needs to be packaged into a container image. The Image Supply Chain comprises all of the elements that make images available for your execution environment to pull and run. This includes any libraries or components that make up the containerized application, CI/CD tools that test your code and package it as a container image, application security testing tools (SAST, DAST, etc) that check for vulnerabilities and logic errors, registries and mirroring tools for hosting container images, and attribution mechanisms like image signing to validate images in your registries.

Container Infrastructure and Orchestration

Once you have a container image that you want to run, you need somewhere to do that. This means both the computers on which your containers run and the software that schedules them to run. If you’re working with just a few containers, decisions about where to run the container image, what else should or should not run alongside it, and the best way to manage storage and network connectivity can be made manually. At scale, however, these sorts of decisions should be left to an orchestration tool like Kubernetes, Swarm, or Mesos. These platforms can take a request to run workloads, make decisions about where to run them based on resource requirements and constraints, and then actually start up those workloads on that target. Even better, when a workload fails or runs low on resources, it can restart or move the workload as needed.

Container Runtime Security and Policy Enforcement

Making sure your container is running in a place that meets the resource requirements and constraints you’ve set for it is necessary, but not sufficient. It’s equally important for your container management solution to perform ongoing validation and ensure that your workload is complying with all of the security and other policy requirements of your organization. Runtime Security and Policy Enforcement tools include functionality for detecting vulnerabilities in running containers, handling vulnerabilities that are detected, confirming that workloads are not running with unnecessary or unintended privileges, and validating that they are only able to connect to other workloads that they should be able to talk to.

Container Observability

Finally, once your container image has been executed somewhere by your orchestration tool, and is being well managed by your security and policy enforcement tools, you need to know what your container is doing and how well it’s doing it. Container Observability covers logging and metrics collection for both your workloads and the tools that are running it.

The Importance of a Holistic Container Management Strategy

Container management platforms are usually assembled out of disparate tools from multiple sources. Some container management software vendors or container management services attempt to address all four major components of effective container management. However, many organizations already have tooling that provides at least some of the required functionality and would prefer to avoid wasting existing licenses or making wholesale changes across their infrastructure just to run containers.

When selecting tools from multiple sources, it is critical to understand what needs each tool does and does not meet. This holistic approach is necessary to prevent gaps and avoid duplication of effort. For example, scanning images as part of your build pipeline with DAST tools and then scanning them again with those same tools while the container is running is a waste of CPU cycles in your runtime environment. Likewise, aggregating logs or metrics with both your orchestration tools and separate host-based agents can be wasteful of CPU cycles as well as storage and network resources.

Let’s explore some of the questions you should be asking yourself when considering what pieces of container management software should make up your overall container management strategy. The following is not exhaustive, but it will hopefully get you thinking about just how many decision points go into designing or selecting container management services.

Four Things to Consider When Designing a Container Management Strategy

Image Supply Chain: Garbage In/Garbage Out

Since you are reading this post, chances are that your organization is already building, and/or at least using, containers. But do you know where they came from? Do you know what’s in them? Are your existing images still good?

You’ll notice that most of these are essentially policy questions, and there are tools to help you answer them. Your existing CI/CD tooling can likely be adapted to build container images, often as a final step after compiling your code. But even this compiling step should be considered as part of your supply chain: do you know where your compiler came from? Many organizations are turning to container-based toolchains to help solve this problem.

Important features of your supply chain include the ability to:

scan container images in your registry, both for security issues and policy compliance
cryptographically sign or otherwise verify that the image hash you’re using has been scanned and is approved for use
mirror images from well-known public registries so that you can perform your own scanning and also insulate yourself from outages of these services
attribute images to the team(s) that produced them

Some more advanced container image supply chains also take into account multi-region redundancy, separation of image push endpoints from image pull endpoints (pull endpoints are read-only in this scenario), and the ability to revoke or remove images that no longer meet compliance or security requirements.

Finally, it’s important to consider disaster recovery for your image registry. As mentioned above, it is wise to insulate yourself from registry outages. Mirroring external registries is only part of the equation, though. You also want to make sure that you have a high-availability plan in place for your internal registries, as well as a proper backup and recovery process. A highly-available, fault-tolerant container management platform does not just mean the runtime environment!

Infrastructure and Orchestration: More Than Just The Computers

Compute

When thinking about container management, it is important to realize that there is more to the infrastructure story than standing up a handful of servers and installing something like Kubernetes or Docker Swarm. In fact, one of the most important decisions is what computers to use. This doesn’t mean which CPU or how much RAM so much as physical servers vs. VMs vs. cloud computing. While bare metal might yield the best raw performance, hardware fails, and replacement costs (both time and money) should also be factored into this decision. The VMs vs. cloud discussion could occupy a whole blog post in itself, but at its core, this is really a question of who manages the hypervisor. The ability to automatically replace failed nodes is a big plus, as long as your container management software can leverage these replacement resources without human intervention.

Storage

Storage is another critical consideration. This includes both the storage that the operating system uses, as well as any storage used by the containers themselves. Many people target stateless applications for their first containerized workloads, but eventually you’ll need to store something somewhere. Data persistence needs to be treated differently when using containers. Whenever possible containers should be run with read-only (root) filesystems. That does not, however, mean that a container should never be able to write data to storage. For those use cases where mutable storage is a necessity, there are a few different strategies to think through.

First, you should consider what type of storage you actually need. Can you outsource the storage concern to your cloud provider instead by using something like AWS RDS? If not, do you really need block storage (i.e., disk), or can an external object store like AWS S3 meet your needs? If an external object storage service can meet your performance and durability requirements and your governance and compliance needs, you’re in luck--you may not have to worry about managing persistent storage in your containers.

If you do need block storage, there are a few important things to know. Within your image, you should define mutable storage as separate volume(s) so that your root volume can remain read-only. Doing so also allows you to leverage local host-path mounting during development, which both simplifies the developer workflow and allows for re-use of data during rapid development cycles. In production, however, you should avoid host-path mounting (except where the workload explicitly needs access to something on the node). Host-path mounting explicitly ties a container to a node, which limits the ability of your orchestration tool to reschedule the workload elsewhere. Using an external storage provider allows your persistent storage to be re-mounted when a workload, or even an entire node, is replaced. Many external storage services can be provisioned on demand, support separate snapshotting, and some even allow dynamic expansion as needed.

Networking

Network connectivity, both within the containerized environment and outside of it, is also a very important item. Kubernetes, for example, supports a number of different Container Networking Interfaces (CNIs) that each provide different capabilities. Questions to consider here are whether they have the ability to set traffic control policies (and at what OSI layer), how encryption between workloads as well as between workloads and external entities is handled, and how to manage getting traffic to your containerized workloads. The performance implications of these decisions also plays a key role. For example, interpod encryption in Kubernetes via the Cilium CNI might meet a security requirement, but is it worth the significant drop in networking throughput?

Backups

Backups are still an important part of operations in a containerized environment, but what you back up changes a little. Immutable, read-only container filesystems do not need to be backed up, as they can be recreated from the original container image quite easily. Backups or snapshots of persistent storage do still need to be considered. If you’re using a cloud provider, you also need to contemplate failure domain and regional recovery scenarios, depending on the provider’s capabilities. For instance, if you’re using AWS, you may need to use S3 replication to make sure your EBS snapshots are available to restore in another region in the event of a total regional outage.

Container Security: You’re Only as Secure As Your Most Vulnerable Container

One of the great benefits of (well-implemented) containerized software is that it can reduce the attackable surface of your application. It does not, however, eliminate it completely. This means that it is necessary to think about how you will observe your application while it’s running to minimize security risk. Images that contain no vulnerabilities at build time can become vulnerable containers as new flaws are found in your code or supporting libraries, so simply scanning as part of your build pipeline is not enough.

Regular re-scanning of images in your registry is a good first step, as long as your container management platform allows you to understand where images with newly-discovered flaws are running so you can replace them. Container runtime scanning agents, much like traditional host-based agents, can detect signatures from known vulnerability databases; but signature-based systems result in a never-ending game of catch up as new vulnerabilities are discovered. To combat this, some newer tools instead focus on detection of anomalous behavior at the system call level. As these types of tools mature, they could make a real difference in the security posture of your workloads because they rely on actual observed behavior and not an up-to-date signatures file.

Containers also change the security patching landscape. Container images are immutable, so running containers should not be patched for vulnerabilities. Instead, an updated image which remediates the vulnerability must be built and tagged appropriately, and running instances of that container must be replaced with the new image. While it is, strictly speaking, possible to allow updating the binaries or libraries in a running container, this is an anti-pattern. If at all possible, containers should be run with a read-only filesystem to reduce attack surface. Additionally, the management effort to update hundreds of running containers and validate that they are all “patched” is significantly higher than the effort required to replace those containers from a new image with any modern container orchestration tool like Kubernetes.

Container Observability: What’s Going On In There?

Containers provide a very portable packaging for an application and a level of abstraction from the underlying operating system that permit very dense deployment (lots of containers on a small[ish] number of systems). This makes traditional log and metrics gathering a lot more complicated. Add to that the fact that the orchestration tooling likely has its own logs and metrics, as does your networking layer, and your security and compliance scanning tools, and there’s a lot to make sense of in a containerized environment. This topic is worthy of its own blog post, and a quick web search will yield several.

One very important element of observability is that it’s critical to externalize your logs and metrics in a containerized environment. Containers come and go, and in many cases even the nodes they run on come and go, so relying on local storage is not recommended. There are many solutions available for log and metrics aggregation, and newer products are even starting to apply machine learning to these data to provide valuable insights into how your applications are running.

How to Plan a Container Management Strategy

Having talked about the components of a container management system and what it means to look at the problem holistically, it’s time to review three recommendations for how to formulate a container management strategy. This is more than just a matter of picking Kubernetes or Docker, or cloud or bare metal. There are human factors to consider as well.

Developer Workflow Is Important

The single most important part of your container image supply chain is the developers producing the code that will run in those containers. While many software vendors are now providing their products in containers, this usually means your own software engineers who are building and shipping containerized applications. Assembling a bunch of best-of-breed tools for building, running, and managing containers won’t pay dividends unless the switch improves developer productivity and satisfaction, or at least does not make either suffer.

Adopting containers as part of your development pipeline does not necessarily mean that you have to re-image your workflow from the ground up. In fact, many of the same considerations that existed in the pre-container days still apply. Most software engineers want to write good code, test that it works as expected, and then “ship it”. This means that your container management strategy should attempt to remove as many other steps from their workflow as possible. CI/CD has proliferated because it lets developers develop, and this is still the case in a containerized world.

It’s very likely that your existing CI/CD tooling already supports container image building, but it’s important to realize that building container images can introduce a few new twists when it comes to compiling code. For a container CI/CD pipeline to be successful, it needs to support multi-stage builds (this allows you to use containerized build chains) and properly leverage image layer caching. Re-pulling several megabytes from your registry may only add a few seconds to a build, but if your build runs repeatedly and those megabytes aren’t changing, that wasted time adds up.

See Best Practices for Building Containers for more information on container build optimization.

DIY, Managed Services, or Packaged Products

Developer satisfaction is important, but it’s also wise to consider the SRE team(s) operating your container management software. Migrating from bare metal or VM-based deployment methodologies to containers can involve a significant learning curve for SRE teams, so it’s worthwhile to choose tools that help smooth this curve.

DIY Kubernetes

In the world of container management, Kubernetes is quickly becoming the de facto standard for orchestration and scheduling of containers. Most products that solve for the other facets of container management we’ve discussed in this post (image supply chain, runtime security and policy enforcement, and observability) will integrate readily with Kubernetes. Kubernetes is, of course, open source software. If your SRE team has the technical ability and desire to implement it themselves, they can. That doesn’t, however, mean that you should automatically choose to build it yourself.

Managed Kubernetes

Kubernetes is hard to implement well. As a result, a number of solution providers offer packaged products or managed services to make adopting Kubernetes easier. All of the major cloud providers now offer a Kubernetes service that reduces the operational burden on SRE teams at the possible cost of limited configuration flexibility and/or vendor lock-in. Organizations that are heavily invested in a specific cloud provider’s ecosystem may find this path works well for them. Other organizations may find a fully-managed service, where all they do is provide container images and let the service provider worry about running them, may make the most sense based on cost and organizational capacity.

Third Party Orchestration Products

A third approach is a packaged product from a vendor that can be installed on your infrastructure (cloud or otherwise), such as Capital One’s own Critical Stack. These products can offer several possible advantages over DIY or cloud provider offerings, such as access to additional configuration options or cluster components, enhanced features or functionality, implementation support and training, post-installation product support, and reduced risk of cloud provider lock-in.

Learn more about container orchestration with Critical Stack now.

Infrastructure Considerations

Lastly, it is important to account for your organization’s current and future infrastructure strategy and how it fits in with your container management strategy. If you're all bare metal today, but planning to move to VMs or a cloud provider in the next year, your container management solution should be able to accommodate both the present and future environments. Likewise, if you’ve already moved all-in on public cloud, you probably want to make sure your choice of tools supports a few cloud options, but maybe bare metal support isn’t an important feature for you.

The infrastructure considerations extend beyond compute, as detailed above. Your choice of container management platform needs to support existing network infrastructure and the storage capabilities available to your organization. If you have policy enforcement, monitoring, and alerting tools in place today, your ideal solution should be able to leverage them. Moving to containers can be a big shift for your developers and operations teams, so reducing the complexity by continuing to use existing tools where possible can save time and money.

Closing Container Management Thoughts

Attempting to package the entire topic of container management into a tidy container (pun intended) is not simple, and the above has really only scratched the surface of effectively managing containers in a large production environment. There are many important facets of efficient container management, and this space is evolving rapidly. However, for those who like a list of important decision criteria, I’d offer the following top five items to consider when deciding your organization’s containerized future:

Developer and operator/SRE satisfaction is critical
Your choice of tools needs to support your infrastructure of today and tomorrow
Level of re-use of your existing tools and processes with containers
Where you fall on the build-buy-managed service spectrum
Security, patching, and monitoring looks different with containers

I hope this post has helped you explore some of the many decision points involved in building an effective container management strategy. Thanks for reading!

Jason Burks, Director of Customer Success, Critical Stack

Jason Burks leads the Solutions Architecture, Customer Success, and Customer Support teams for Critical Stack at Capital One. He’s been working with containers since Solaris 10’s Zones, and with Docker since before the very first DockerCon in San Francisco. Before joining Capital One he was an Enterprise Cloud Architect at GE Appliances, where he helped design and build their internal container management platform on Apache Mesos and then helped migrate it to Amazon ECS. Jason holds a Computer Science degree from Rensselaer Polytechnic Institute. You can connect with him on LinkedIn (https://linkedin.com/in/jason-burks).