Cloud Cost Optimization: Definition and Strategies
These best practices can help you effectively manage resources, balancing spend and performance in the cloud
Migration to the cloud offers many benefits. Enterprise-grade infrastructure and services are available for everyone, not just large businesses with huge IT budgets. But at some point, users of cloud infrastructure and service providers like AWS, Google Cloud, and Azure need to understand how to optimize cloud costs. Not every small business stays small, and thinking big leads to planning, providing unique insights, and potentially big wins.
What is Cloud Cost Optimization?
Cloud cost optimization is finding ways to run applications in the cloud, performing work or providing value to the business, at the lowest possible cost, and using cloud providers as cost-efficiently as possible. Optimization as a practice ranges from simple business management to complex scientific and engineering areas like operations research, decision science and analytics, and modeling and forecasting.
Why Cost Optimization?
Every organization trying to meet some goal, profit or otherwise, needs to minimize overhead, the cost of goods and services it produces.
Consider a corporate data center ecosystem and a web app “stack” consisting of a web frontend, application layer, and database backend. Every component and communication channel for the application must be sized to meet a maximum-demand event, like payday or Black Friday. The web stack might include 20 web servers behind a load balancer, 20 application servers behind another load balancer, and a database cluster. All this infrastructure might be duplicated in a geographically separate datacenter in either an active-active or active-standby configuration. Replicating the existing on-premises application to the cloud as quickly as possible without any redesign for cloud computing capabilities, one ends up paying for servers that spend most or all of their time sitting idle. A data center web app is sharing communication infrastructure, like leased-lines, routers, switches, load balancers, and firewalls with other apps, and ultimately, this shared infrastructure is sized to handle the heaviest traffic expected, with some upper limit of available bandwidth and latency constrained by cost. Storage and backup systems, power and cooling, and physical space for all the IT would be planned for future user and workload growth. Simple “lift and shift” to the cloud costs more than necessary, and when the dust settles, cloud cost optimization benefits become clear because minimizing overhead is a fundamental business process.
A new business starting off might design better systems in the cloud with the pay-as-you-go cost model in mind, but changes, entropy, and lack of cost awareness will create opportunities for cost optimization.
Cloud Cost Optimization Early Days
As hundreds of data center-filling applications moved into the cloud, our software engineers realized how infrastructure as code could be managed with automation. AWS offers cost-saving suggestions through the Trusted Advisor tool. Although simple scripts can be written as needed to manage cloud resources, managed automation enabled us to report and act upon cloud resources at scale. Cleanup of unused resources was always a priority, and there was some anticipation of why cost controls were needed, and how they could be implemented. For example, resources for most development environments are only used during the day, so we automated the ability to shut down resources in bulk at the end of the business day, and restart them the next morning. With the capability to report resource properties, usage, and Cloudwatch metrics for resources, we can search for and reduce or eliminate inefficiency and waste. Reviewing the pricing models, we learned that a small savings realized for hundreds or thousands of cloud resources could add up to thousands or millions of dollars over the course of a year.
Cloud Cost Optimization Strategies
Here are some strategies for developing a cloud cost management plan or program. These strategies should be considered on an ongoing basis to achieve maximum and sustainable results.
- What to do
- Cloud native design
- Operational refinement
- Capacity reservations and volume discounts
- Cloud Cost Management: How to do it
- Managing, organizing, communicating, and educating
- Plan and Prepare to Track Spending per Cost Center
- Billing and Pricing Review
- Software and Automation Needs
Cloud Native Design
Design and build more cost-efficient systems to replace existing ones. The cloud native idea is to employ every cost advantage to be gained by leveraging capabilities that are unique to the cloud environment. An example of this is auto-scaling. It is unheard of for a traditional load-balanced server pool to be billed only for the servers in use. Every server purchased for the pool is paid for in advance, and ongoing; server hardware plus data center space, power, and connectivity. A great cloud native advantage is being billed only for the servers that are actively running in the pool. Cloud auto-scaling means that capacity paid for is not greatly in excess of capacity being used.
The AWS Well Architected Tool offers recommendations based on architectural best practices for the cloud. AWS also provides many architectural examples through whitepapers and documentation, and experts who can give thoughtful system design advice. Leveraging resources like these from cloud providers is a great strategy for optimizing cloud native design and reducing cloud costs.
Cloud native design is a great way to optimize costs, but it requires skills and experience. Like open-source software, existing cloud infrastructure designs provide guidance. Most cloud infrastructure designs are variations of existing designs rather than being radically innovative and unique.
On the general subject of systems design, seek clarity about functional vs. non-functional requirements; performance is not always the top priority! DevOps in the cloud enables speedy delivery and innovation, not necessarily cost savings. In the cloud, engineers have control over costs. Optimizing purely for cost sacrifices performance and/or quality, and rarely is lowest cost the primary goal for a new product or service. Cloud native design is part of designing and evaluating designs from a cost perspective.
The other approach is to maximize cost efficiency of existing systems without design changes, including unused resource cleanup and rightsizing. Rightsizing analysis focuses on resource usage vs. capacity to determine whether or not you are paying too much for unused capacity or capabilities. This is usually a service-specific study, where one or more specific qualities of the service are taken under consideration. For example, one might question why every EC2 in the organization uses a 50 GB boot volume if usually a small percentage is used. At enterprise scale, cleanups and rightsizing should make use of automatic remediation where possible. Reporting to provide high visibility for these no-regrets cost-saving opportunities is very valuable.
Capacity Reservation and Volume Discounts
Reserved instances and on-demand capacity reservation feels a bit like returning to the corporate datacenter where forecasting and budgeting for capacity is needed, but a big distinction is that the needed forecast is primarily for “baseline” utilization, since the goal is to pay upfront for capacity that will definitely be used. Service providers offer big discounts for capacity paid for in advance, which leads to even better cost optimization. Further, cloud service providers may provide volume discounts for bigger customers, so it is worth knowing your options.
Managing, Organizing, Communicating, and Educating
Ultimately, people make cloud cost optimization happen or not. Establish a program to review, monitor, and control cloud costs, empowering technical, financial and managerial team members in every line of business to work together, share accountability and champion the cause. For example, a cost optimization program could create a consultancy which conducts seminars and trains the engineering community on critical topics. For example, our program held “hothouses”, where engineering teams were brought into a training and working session for one or more days, giving them space to learn about and implement a range of optimizations, from the simple to complex. A training course was also developed to bring cloud cost awareness to the entire organization and was even integrated into the onboarding process for newcomers.
Billing and Pricing Review
Fortunately, cloud provider billing provides detailed information about what is being paid for. The high-level breakdown or itemization of costs is the map to savings. Likely the highest spend will be compute, storage, and value-add managed services like RDS. Prioritize the highest-spend services for a detailed analysis. For example, AWS EC2 (compute) is often the highest-spend category on the bill. Prioritize cost optimization for the teams spending the most. Perhaps the savings realized by the biggest spenders is greater than the budget of the smaller ones. Understanding the pricing of everything the cloud vendor offers in great detail is very productive, because this allows better judgments to be made about what should be purchased or avoided. For example, in AWS, going down one instance size within the same class of instance, like m5.2xlarge to m5.xlarge, reduces the rate by 50%.
Plan and Prepare to Track Spending per Cost Center
Teams that individually have financial responsibility for their own cloud budgeting and spending need a way to track it. One could give each cost center its own AWS account, and in this case reporting is easiest. However, when a single account has many application teams, each with its own budget, there will need to be a way to tie the costs to the teams responsible. In this case it is imperative to mandate a standard method for identifying ownership of cloud resources. Resource naming standards eventually fail, so additional properties in the form of resource tags or labels are normally needed. Think of resource tags as the cloud equivalent of barcode labels. Resource tags are arbitrary key:value pairs, properties that can be added to cloud resources for descriptive purposes. ie: a tag called "Department" or "Cost Center" can be used for describing the ownership of every cloud resource. If you have the opportunity to plan this all in advance, mandate a tagging standard that applies to everything as retroactive tagging is difficult. Tough decisions have to be made if some problem component lacks ownership in a production environment. For a tagging standard, the minimum viable product is cost centers aligned to how granular the cost reporting must be, anywhere from individual user to department.
It is common for a large enterprise to have a configuration management database, and a tag identifying that a cloud resource belongs to an app that has been permanently shut down is very useful. Metadata about cloud resources can also inform the intensity of cost optimization efforts, with the most critical resources being given the most tolerance for underutilization and idleness. Finally, an “Owner Contact” tag for every resource is useful when cost centers are very large and having a conversation about a resource is needed.
Software and Automation Needs
Graphing cloud spend over time by expense type and cost center is vital for understanding trends and progress with cloud cost optimization. Daily, monthly and yearly data and graphs made available to everyone in the organization encourages healthy transparency and competition.
Metrics for every aspect of cloud resource utilization vs. capacity are needed for effective optimization. Capital One developed a tool specifically for right-sizing EC2 instances using history for the “four corners of utilization metrics”: CPU, memory, disk and network. Emphasis is on the recent history, but utilization peaks are captured and factored into an instance type recommendation. Alerting for unusual spikes in costs or thresholds exceeded is very useful.
Management, Reporting, and Cleanup
Ensure that everything in your cloud is being used and not unidentified or forgotten! Examining the overall cloud fleet composition also leads to insights for capacity reservations and finding potential opportunities for further cost optimization. For example, knowing that over half your EC2 fleet is m4.4xlarge suggests there is a potential cost savings of 75% if those instances are reserved for the coming year. It also begs for questions like “Are the m4.4xlarge used because this is the correct size for the workload or is some widely-copied infrastructure template spitting out m4.4xlarge because it only uses one size?” Note that the EMR service has minimum instance size limitations, so there will be cases where some larger instance types are required.
Using automation, one can implement cost-saving “levers”, cost-saving changes that do not impact infrastructure design. The following examples are specific to AWS but the principles should apply with other cloud providers.
- Off-hour Shutdown for ASG, EC2, RDS
- ASG Dimmer - reduce ASG size during non-working hours in nonprod
- EC2 Fleet Upgrade - push to latest generation instances. Promote cost-efficient platforms like AMD, ARM
- S3 - Educate teams to understand their specific S3 data access requirements, and to design bucket lifecycles to expire objects or move them to less-expensive access tiers
- Idle resources - Study a resource and decide what properties, events, and metrics constitute an unused resource. Create automation to find and remove idle resources.
Also recognize that two types of automation are useful:
- Event-driven actions. AWS Lambdas can perform a process as a reaction to an API event
- Batch operations. Scheduled or ad hoc.
Well-trained, if not experienced, cloud architects, engineers, and developers can affect costs directly. A business transitioning to cloud technology cannot assume that the learn-as-you-go approach for its staff will successfully deliver robust and cost-effective solutions. Cloud provider training introduces cost awareness and cloud native design principles. Keep an eye out for new information from cloud providers as well. New services, whitepapers and cloud cost management best practices are changing the landscape almost daily. Professional certifications like AWS Solutions Architect are an excellent investment for the entire staff. "Cloud Native" philosophy is about applying a broad and deep understanding of cloud provider service and resource products, primarily what functionality is provided, and how it optimizes cloud spending. Managed service offerings like RDS beat self-managed solutions by reducing complexity, toil, and therefore labor costs.
Final Thoughts on Cloud Cost Optimization
When it comes to cloud cost optimization, there are many levers to pull and lots of data, so remember that it is a set of processes to be managed over time.
Observing the cloud and associated spend for an organization, one can get the lay of the land and prioritize actions. Delete unused and forgotten resources. Perform rightsizing reviews to understand whether the correct amount of capacity is being paid for. Review tradeoffs between spending, performance, reliability, redundancy, and spare capacity, looking at basic services like computers and storage. Engage with cloud vendors for advice, capacity reservations, and volume discounts. Cost efficiency should be included in the considerations of design and engineering cloud applications, using the cloud-native philosophy. Unite and employ the perspectives and strengths of management, finance, analytics and engineering with the common goal of cost efficiency.
Learn from the experts!
DISCLOSURE STATEMENT: © 2022 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.