15 cloud cost optimization best practices

Learn 15 cloud cost optimization best practices to manage resources efficiently and optimize performance in the cloud.

Dana Lutz

December 18, 2024

Migration to the cloud offers many benefits. Enterprise-grade infrastructure and services are available for everyone, not just large businesses with huge IT budgets. But at some point, users of cloud infrastructure and service providers like AWS, Google Cloud and Azure need to understand how to optimize cloud costs. Not every small business stays small, and thinking big leads to planning, providing unique insights and potentially big wins.

What is cloud cost optimization?

Cloud cost optimization is finding ways to run applications in the cloud, performing work or providing value to the business, at the lowest possible cost and using cloud providers as cost-efficiently as possible.

Cloud cost optimization strategies can involve managing operational efficiency, reserving capacity for higher discounts and selecting the right size and type of computing services. This requires continuous monitoring, analysis and adjustment to ensure that a company's cloud spending is aligned with its business needs.

Optimization as a practice ranges from simple business management to complex scientific and engineering areas like operations research, decision science and analytics, and modeling and forecasting. With the right talent and tools in place, companies can gain visibility and knowledge into their cloud spend to make informed business decisions.

Why is cloud cost optimization important?

Every organization trying to meet some goal, profit or otherwise, needs to minimize overhead, the cost of goods and services it produces.

Consider a corporate data center ecosystem and a web app “stack” consisting of a web frontend, application layer and database backend. Every component and communication channel for the application must be sized to meet a maximum-demand event, like payday or Black Friday. The web stack might include 20 web servers behind a load balancer, 20 application servers behind another load balancer, and a database cluster. All this infrastructure might be duplicated in a geographically separate data center in either an active-active or active-standby configuration. Replicating the existing on-premises application to the cloud as quickly as possible without any redesign for cloud computing capabilities, one ends up paying for servers that spend most or all of their time sitting idle. A data center web app is sharing communication infrastructure, like leased-lines, routers, switches, load balancers and firewalls with other apps, and ultimately, this shared infrastructure is sized to handle the heaviest traffic expected, with some upper limit of available bandwidth and latency constrained by cost. Storage and backup systems, power and cooling, and physical space for all the IT would be planned for future user and workload growth. Simple “lift and shift” to the cloud costs more than necessary, and when the dust settles, cloud cost optimization benefits become clear because minimizing overhead is a fundamental business process.

A new business starting off might design better systems in the cloud with the pay-as-you-go cost model in mind, but changes, entropy and lack of cost awareness will create opportunities for cost optimization.

Cloud cost optimization early days

As hundreds of data center-filling applications moved into the cloud, our software engineers realized how infrastructure as code could be managed with automation. AWS offers cost-saving suggestions through the Trusted Advisor tool. Although simple scripts can be written as needed to manage cloud resources, managed automation enabled us to report and act upon cloud resources at scale. Cleanup of unused resources was always a priority, and there was some anticipation of why cost controls were needed and how they could be implemented.

For example, resources for most development environments are only used during the day, so we automated the ability to shut down resources in bulk at the end of the business day, and restart them the next morning. With the capability to report resource properties, usage and Cloudwatch metrics for resources, we can search for and reduce or eliminate inefficiency and waste. Reviewing the pricing models, we learned that a small savings realized for hundreds or thousands of cloud resources could add up to thousands or millions of dollars over the course of a year.

Cloud cost optimization best practices

Here are 15 best practices for developing a cloud cost optimization program. These best practices should be considered on an ongoing basis to achieve maximum and sustainable results.

What to do
- Cloud native design adoption
- Operational refinement
- Capacity reservations and volume discounts
How to do it
- Organize and manage a cloud cost optimization program
- Plan and prepare to track spending per cost center
- Review billing and pricing
- Leverage software and automation
- Provide skill training

1. Adopt cloud native design to leverage cloud-specific capabilities

Design and build more cost-efficient systems to replace existing ones. The cloud native idea is to employ every cost advantage to be gained by leveraging capabilities that are unique to the cloud environment.

The AWS Well-Architected Tool offers recommendations based on architectural best practices for the cloud. AWS also provides many architectural examples through whitepapers and documentation and experts who can give thoughtful system design advice. Leveraging resources like these from cloud providers is a great strategy for optimizing cloud native design and reducing cloud costs.

Cloud native design is a great way to optimize costs, but it requires skills and experience. Like open-source software, existing cloud infrastructure designs provide guidance. Most cloud infrastructure designs are variations of existing designs rather than being radically innovative and unique.

On the general subject of systems design, seek clarity about functional vs. non-functional requirements; performance is not always the top priority! DevOps in the cloud enables speedy delivery and innovation, not necessarily cost savings. In the cloud, engineers have control over costs. Optimizing purely for cost sacrifices performance and/or quality, and rarely is the lowest cost the primary goal for a new product or service. Cloud native design is part of designing and evaluating designs from a cost perspective.

2. Enable auto-scaling to handle dynamic workloads

Auto-scaling is a key strategy for managing cloud resources efficiently, allowing systems to automatically adjust resource allocation based on demand. Instead of provisioning excess resources to handle peak traffic or workload spikes, auto-scaling dynamically expands capacity during high-demand periods and reduces it when demand subsides. This ensures you’re only paying for the resources you’re actively using, which helps minimize the costs associated with idle infrastructure and over-provisioning.

It is unheard of for a traditional load-balanced server pool to be billed only for the servers in use. Every server purchased for the pool is paid for in advance, and ongoing; server hardware plus data center space, power and connectivity. A great cloud native advantage is being billed only for the servers that are actively running in the pool. Cloud auto-scaling means that the capacity paid for is not greatly in excess of the capacity being used.

3. Clean up unused resources

Maximizing cost efficiency involves a thorough cleanup of unused resources within your cloud environment. Regularly reviewing your infrastructure helps identify resources that are no longer in use or needed, which can significantly reduce unnecessary expenses. Automating the cleanup process where possible is key, as it allows for the ongoing identification and removal of these resources without requiring constant manual oversight. Implementing reporting mechanisms provides high visibility into these unused resources, making it easier to manage and streamline costs effectively.

4. Rightsize resources for optimal efficiency

Rightsizing is a crucial strategy for ensuring that your cloud resources are appropriately allocated to match actual usage. This involves analyzing resource consumption against capacity to identify whether you are over- or under-utilizing certain capabilities. For instance, it might be unnecessary for all EC2 instances to use a standard 50 GB boot volume if most instances only require a fraction of that size. Conducting service-specific rightsizing analyses allows organizations to adjust resource allocations based on actual demand, ultimately leading to cost savings and improved efficiency in resource management.

5. Leverage capacity reservation and volume discounts

Reserved instances and on-demand capacity reservation feels a bit like returning to the corporate data center where forecasting and budgeting for capacity is needed, but a big distinction is that the needed forecast is primarily for “baseline” utilization, since the goal is to pay upfront for capacity that will definitely be used. Service providers offer big discounts for capacity paid for in advance, which leads to even better cost optimization. Further, cloud service providers may provide volume discounts for bigger customers, so it is worth knowing your options.

6. Choose the right storage solutions

Cloud storage costs can accumulate quickly if the appropriate storage options are not carefully selected for your data. It’s important to match storage solutions to the frequency of data access, the size of your datasets and the performance requirements of your workloads. Many cloud platforms offer tiered storage solutions, allowing you to allocate frequently accessed data to faster, more expensive storage and archival data to lower-cost, slower tiers. By selecting the right storage tier and implementing data lifecycle policies, you can ensure efficient use of storage without compromising on performance or accessibility.

7. Define and set a budget for cloud spend

Defining a clear budget for cloud spend is key to ensuring your cloud resources are used cost-effectively. Start by setting a budget that outlines how much your organization is prepared to invest in cloud services based on current needs and future growth plans. Define spending limits for each team or project, and consider forecasting future costs as your usage scales. By setting these financial boundaries upfront, you can manage expectations and prevent over-provisioning. Use a tool that allows you to track costs more transparently, so you can easily adjust your budget as needed while still keeping your cloud spend aligned with your business goals.

8. Establish a program to monitor cloud costs and educate all lines of business

Ultimately, people make cloud cost optimization happen or not. Establish a program to review, monitor, and control cloud costs, empowering technical, financial and managerial team members in every line of business to work together, share accountability and champion the cause.

A cost optimization program could create a consultancy which conducts seminars and trains the engineering community on critical topics. For example, at Capital One our program held “hothouses”, where engineering teams were brought into a training and working session for one or more days, giving them space to learn about and implement a range of optimizations, from the simple to complex. A training course was also developed to bring cloud cost awareness to the entire organization and was even integrated into the onboarding process for newcomers.

9. Review billing and pricing from cloud vendors to make informed purchasing decisions

Fortunately, cloud provider billing provides detailed information about what is being paid for. The high-level breakdown or itemization of costs is the map to savings. Likely the highest spend will be compute, storage and value-add managed services like RDS.

Prioritize the highest-spend services for a detailed analysis. For example, AWS EC2 (compute) is often the highest-spend category on the bill.
Prioritize cost optimization for the teams spending the most. Perhaps the savings realized by the biggest spenders is greater than the budget of the smaller ones. For example, in AWS, going down one instance size within the same class of instance, like m5.2xlarge to m5.xlarge, reduces the rate by 50%.

Understanding the pricing of everything the cloud vendor offers in great detail is very productive, because this allows better judgments to be made about what should be purchased or avoided.

10. Plan and prepare to track spending per cost center

Teams that individually have financial responsibility for their own cloud budgeting and spending need a way to track it. One could give each cost center its own AWS account, and in this case reporting is easiest.

However, when a single account has many application teams, each with its own budget, there will need to be a way to tie the costs to the teams responsible. In this case it is imperative to mandate a standard method for identifying ownership of cloud resources. Resource naming standards eventually fail, so additional properties in the form of resource tags or labels are normally needed. Think of resource tags as the cloud equivalent of barcode labels. Resource tags are arbitrary key:value pairs and properties that can be added to cloud resources for descriptive purposes. For example, a tag called "Department" or "Cost Center" can be used for describing the ownership of every cloud resource. If you have the opportunity to plan this all in advance, mandate a tagging standard that applies to everything as retroactive tagging is difficult. Tough decisions have to be made if some problem component lacks ownership in a production environment. For a tagging standard, the minimum viable product is cost centers aligned to how granular the cost reporting must be, anywhere from individual user to department.

It is common for a large enterprise to have a configuration management database, and a tag identifying that a cloud resource belongs to an app that has been permanently shut down is very useful. Metadata about cloud resources can also inform the intensity of cost optimization efforts, with the most critical resources being given the most tolerance for underutilization and idleness. Finally, an “Owner Contact” tag for every resource is useful when cost centers are very large and having a conversation about a resource is needed.

11. Use software for financial tracking and resource monitoring

a. Financial tracking

Graphing cloud spend over time by expense type and cost center is vital for understanding trends and progress with cloud cost optimization. Daily, monthly and yearly data and graphs made available to everyone in the organization encourage healthy transparency and competition.

b. Resource monitoring

Metrics for every aspect of cloud resource utilization vs. capacity are needed for effective optimization. Capital One developed a tool specifically for right-sizing EC2 instances using history for the “four corners of utilization metrics”: CPU, memory, disk and network. Emphasis is on the recent history, but utilization peaks are captured and factored into an instance type recommendation. Alerting for unusual spikes in costs or thresholds exceeded is very useful.

12. Automate management, reporting and cleanup

Ensure that everything in your cloud is being used and not unidentified or forgotten! Examining the overall cloud fleet composition also leads to insights for capacity reservations and finding potential opportunities for further cost optimization. For example, knowing that over half your EC2 fleet is m4.4xlarge suggests there is a potential cost savings of 75% if those instances are reserved for the coming year. It also begs for questions like “Are the m4.4xlarge used because this is the correct size for the workload or is some widely-copied infrastructure template spitting out m4.4xlarge because it only uses one size?” Note that the EMR service has minimum instance size limitations, so there will be cases where some larger instance types are required.

Using automation, one can implement cost-saving “levers” and cost-saving changes that do not impact infrastructure design. The following examples are specific to AWS but the principles should apply to other cloud providers.

Off-hour shutdown for ASG, EC2, RDS
ASG Dimmer: Reduce ASG size during non-working hours in nonprod
EC2 Fleet Upgrade: Push to latest generation instances; promote cost-efficient platforms like AMD, ARM
S3: Educate teams to understand their specific S3 data access requirements and to design bucket lifecycles to expire objects or move them to less-expensive access tiers
Idle resources: Study a resource and decide what properties, events and metrics constitute an unused resource; create automation to find and remove idle resources.

Additionally, there are two types of automation that are useful:

Event-driven actions: AWS Lambdas can perform a process as a reaction to an API event
Batch operations: Scheduled or ad hoc

13. Provide ongoing training for cloud architects, engineers and developers

Well-trained, if not experienced, cloud architects, engineers and developers can affect costs directly. A business transitioning to cloud technology cannot assume that the learn-as-you-go approach for its staff will successfully deliver robust and cost-effective solutions. Cloud provider training introduces cost awareness and cloud native design principles. Keep an eye out for new information from cloud providers as well. New services, whitepapers and cloud cost management best practices are changing the landscape almost daily. Professional certifications like AWS Solutions Architect are an excellent investment for the entire staff.

The cloud native philosophy is about applying a broad and deep understanding of cloud provider service and resource products, primarily what functionality is provided and how it optimizes cloud spending. Managed service offerings like RDS beat self-managed solutions by reducing complexity, toil and therefore labor costs.

14. Consider a multi-cloud strategy

Adopting a multi-cloud strategy could allow organizations to optimize costs and avoid dependence on a single cloud provider. By leveraging the strengths and pricing models of multiple providers, businesses can select the most cost-effective services for their specific needs. This approach helps in balancing costs across platforms, enabling organizations to take advantage of discounts, promotions or pricing structures that are favorable for certain workloads. For example, with Snowflake's multi-cloud capabilities, data can be seamlessly processed across different cloud environments, giving businesses the flexibility to choose where their data workloads run.

15. Minimize data transfer costs

Data transfer costs can escalate rapidly and become a significant line item in your cloud budget, particularly for applications that necessitate large volumes of data movement between different regions or services. To minimize expenses, aim to keep data transfers within the same region whenever possible. In many cloud platforms, data transfer between services or resources located in the same region is often free or incurs significantly lower charges compared to transfers between different regions. This approach not only helps reduce costs but can also enhance performance and lower latency since data does not have to travel over long distances.

Developing an efficient cloud cost optimization strategy

When it comes to cloud cost optimization, there are many levers to pull and a lot of data, so remember that it is a set of processes to be managed over time. Below is a summary of cloud cost optimization best practices to manage your resources efficiently:

Observing the cloud and associated spend for an organization to get the lay of the land and prioritize actions.
Delete unused and forgotten resources.
Perform rightsizing reviews to understand whether the correct amount of capacity is being paid for.
Review tradeoffs between spending, performance, reliability, redundancy and spare capacity, looking at basic services like computers and storage.
Engage with cloud vendors for advice, capacity reservations and volume discounts.
Cost efficiency should be included in the considerations of design and engineering cloud applications, using the cloud-native philosophy.
Unite and employ the perspectives and strengths of management, finance, analytics and engineering with the common goal of cost efficiency.

At Capital One, we went all-in on the cloud in 2020, becoming the first U.S. bank to exit data centers. These cloud cost optimization best practices have been key to helping us optimize our cloud environment and save 27% on our Snowflake costs.

Additional resources

FinOps Foundation - Cloud financial operations, a new paradigm!
AWS Cost Optimization Best Practices
AWS Billing and Cost Management Documentation
EC2 pricing
- On-demand pricing
- Instance Type Explorer

Dana Lutz, Master Software Engineer, US Card

Managing Capital One’s cloud since 2016, including evangelism, automation and reporting for cost optimization.