Caches Are Key to Scaling
Choosing the correct cache technology for… profit
July 19, 2018
As data and applications continue to get larger and faster, sometimes we need to make the data readily available. Depending on the need, we may store, or cache, that data in different ways.
Today, I want to bring you, my readers, together and talk about the concept of extremely fast data access, using caches to back up high traffic APIs and message consumer/producers… to make cash. (Get it? Cache? Cash? Yeah, I did that.)
A primary reason to set up caching outside of your database is to reduce load within your database engine. While scaling is easier than ever in the cloud, it still costs money to scale. Even if you’re using open source databases, you’re still going to pay for compute and storage needs. Caching can help reduce the load, saving cash right off the bat. (See how I tied that in there?)
At Capital One I lead a large group of engineers across an enterprise program that delivers customer’s digital messages. That’s a fancy way of saying email, SMS and push notifications to the major phone platforms. We send an amazing number of messages every day on behalf of a wide range of internal applications. My guidance across the board is, “Cache. Cache all the things.”We process a lot of data and we need as fast of access to this data as possible. Since we have so many applications and data sources, we have different patterns and different data available depending on the process that triggers a message.
Let’s walk through a few examples of caching architectures to assist in building blazing fast message bus-based code or extremely responsive APIs. But first, the key to scaling a data processing engine is to set up queues and pass data asynchronously between applications. A big distinction in the asynchronous pattern would be that it allows us to scale in a very different way than if things were synchronous, or ‘blocking’ from start to finish.
When your code is blocking, you’ll tend to have to scale vertically (bigger machines) and horizontally (more machines). When we go asynchronous, with a bunch of tiny microservices, we’re able to scale horizontally without as much of a need to scale vertically. This is ripe for a lot of modern patterns including containers or serverless functions-as-a-service (FaaS e.g. AWS Lambda).
Follow me here if you want to keep an eye on more details of scaling horizontally with micro services or functions-as-a-service!
The different implementation styles below are different ways to build caches. We’ll look at building a local cache for a single application first. One application uses one local/in memory caching service to process the data it needs to process. Then we’ll scale a little bit more by using an external application, such as Redis, to store our cached data. This can be really useful when scaling a piece of code horizontally. Finally, we’ll look at building a strategically placed, distributed cache that will give advantages to all consuming components.
Let’s dive in.
A local cache is the easiest thing you can build to speed up your application. Simply grab external data, store it locally, and then look it up later using some sort of primary key. This works for applications that are continuously running like a web application or a streaming data platform. There are multiple ways this could work.
My previous article, Blazing Fast Data Lookups in a Microservices World, focused on building a blacklist. We were primarily concerned with the existence of data in a set. We weren’t associating data with a key or using storing data beyond checking whether or not we had a specific value. All we needed to know was if data existed, and based on that we could make a decision.
This is a very common use case, but we can do more to add value in our application by solving other problems. As developers, we would typically associate that kind of ‘question’ to a boolean data type. Some examples of when we could use a data structure like the trie we built in the last article.
- Is this email in my blacklist?
- Is this a known bad domain name?
- Is this email sender in my spam list?
- Is this file hash in my allowed list?
Since we’re using pretty generic data structures, we can also attach more data at the end of each path. That data can be used as a lookup. For a blacklist, you can add meta-data such as when the email address was added to the list or maybe a counter of how many times it was seen.
When choosing the right structure to store all your data, there are things to consider.
- How large is your data set?
- How long are your ‘keys’?
- Can they be hashed to consistent length?
- What data type are you referencing?
- Do you need to add and remove items over time?
Look at other structures like the y-fast trie or a double array trie. If your data set is small, then use a simple in-memory cache. This could be a Python frozenset or a Java HashSet. The important component is being able to find the data efficiently. Remember Big O notation? Let’s refresh our memory.
Big O is a way of notating the number of iterations or operations needed to find a piece of data in a collection. O(1) means it takes a constant number of operations no matter how many items are in a collection. That’s the best. O(n) means it could take up to n iterations, where n represents a linear increase in operations based on the number of items in the collection. Finding an element in a standard array would be notated as O(n) since the time it takes to find an item in the array increases linearly when more items are added. O(n²) is bad, as the lookup time isn’t linear with the number of elements in the list, but it grows exponentially.
While we’re at it, O(log N) is also good. This translates to using a tree of some sort, typically a binary tree structure to limit the branches your code must iterate to find a piece of data.
The best way to find data in a set is when you can represent it with O(1). One piece of data, one element. This is accomplished by using an index of data. Let’s say we needed to look up a value in a huge list of integers. Here’s a naive implementation in C:
Ok, that won’t work. To get one element in the array, we had to pre-allocate 64,535 memory slots. I’m not even sure that code works, but hopefully you get the point. It’s the same issue with the trie implementation in my other article — there is just too much overhead for a small data set.
Smart Hashing Algorithm
Now, consider a smart hashing algorithm. If each element of a data set can be passed through a hashing algorithm, then put in the Set, the data can be found. The assumption here is that the hash set can be bound in a much smaller size. Once the data is bound down, then the set can be stored in memory.
The trie structure is considered O(log N) since the searching is not a simple one operation lookup, but the application needs to recurse (or iterate) down different branches of the data structure when finding something. This picture is a good reference from the previous article to demonstrate looking up data in a trie.
These are good, but they’re only locally available. In a microservices, or horizontally scaled environment, that can cause some issues. First, each instance of the application needs to read the data to have it in a ready state. If we’re building a large reference data set we’ll typically pull it from a database. A basic implementation would need each instance to connect to a central database, query all the data, and then be ready for business. For a very large data set, that could put a pretty big load on your database.
Local — Remote Cache
One of the huge advantages of a centralized cache is that you only have to load it once and many clients can connect to the same data store to pull cached values. In this article, we’ll talk about using Redis or Elasticsearch for this. For single key (String) value pairs, let’s look at Redis.
Redis is a key-value store with a few great features built in. For our purposes here, we’ll focus on the standard string value (String key, String value) and the hash value (String key, String subkey, String value). For instance, if you want to store a user’s email address mapped to their web session, you would set the key to the session id, and the value to the email address using the ‘SET’ command.
`redis-cli> set session-[session_id] [email]`
`redis-cli> set session-jsessionid23420934820394234 firstname.lastname@example.org`
For all of your applications, you can have them connect to Redis, and pull the user’s email address by the session id.
`redis-cli> get session-[session_id]`
`redis-cli> get session-jsessionid1234`
Using the standard key-value (string, string) inputs, you can map a key to a JSON string, encrypted data in ASCII/Unicode, an integer with atomic operations or just a string.
Redis can support an amazing throughput and number of transactions per second, making it an ideal distributed cache. I’ll call out one caveat. I’ve never used the distributed deployments of Redis for high availability or failover, so unless you dive into that realm don’t consider a single Redis node to be failure proof. In a cloud environment like AWS, consider it to be a cache for your Availability Zone (AZ).
More Distributed, Yet Not — Distributed Cache
If the local distributed cache idea can be thought of as a shared cache for consuming applications, I want to throw out another idea.
Using Redis as a back end for caching is excellent. Your application can even use a local cache with a technology like Guava, or a simple map data structure. Typically, the code will check its local cache, then check the distributed cache, and if there’s a miss there it can go to the source; often a database. That’s code you will have to write over and over (I’m ignoring libraries, wrappers and all that jazz for now). I’ve dealt with a lot in enterprise applications where you’ll see many applications reaching out to the same database to retrieve/update data. This is an anti-pattern in today’s microservice focused architectures, but it’s still a pattern that will likely survive for years to come. The worst-case caching scenario here would include your three web apps, your call center app, as well as your IVR all implementing a caching layer.
In this scenario, it would be a big challenge understanding which code makes changes. Each application needs to know about naming schemas. Each application has to be modified when a DNS change is made. This is not maintainable for an extended period.
Let’s change it up a little and put an API layer that I like to call the “CRUD” (Create, Read, Update, Delete) layer. This CRUD layer will force all of the consuming applications through a single point to access the database. Don’t worry, you can still distribute this API layer for resiliency. The CRUD application becomes an intentional bottleneck for traffic to ensure a consistent set of business operations on your data.
This is the single application that needs to rely on caching of data. The cache can be local, distributed etc., but now your end applications are unaware of the data coming from the database. There are a ton of benefits for having a data access layer like this. In addition to a single source of business logic, you’ll have a significantly easier time making any application changes if you have to change your database. Did I mentioned a single point to add a caching layer?
The latency here will be reduced from network overhead plus database query time, to network overhead only when the cache hits. For high volume applications, this can save a huge amount of workload on your database.
Let me give you a few more considerations before wrapping up. Caching data is critical to speedy response. Every time a database has to read data off disk, you’re going to hit a disk IO bottleneck. There are fewer things slower than that. All engineers should be familiar with:
These numbers are critically important when speed is a concern. Network and disk are slow. Memory is fast. Databases will cache frequently accessed records. Know your database, and determine if you’re wasting too much time building a cache on top of it. Keep it in mind that cache, by definition, means it’s not the true source of data. This would manifest in a real world environment by having the cache get loaded at time n, the database getting updated at time n+1, and the cache being read at time n+2 with a successful cache hit. The data is now stale.
If that’s acceptable in your environment, then great! Caching is for you. Two things to consider:
How long does your data live in the cache?
- Come up with an expiration policy, and set a time to live (TTL) on each record. Psst. Redis has an ‘expire’ command to set a TTL on a key.
Do you need to worry about invalidating data in the cache?
- If you’ve built a CRUD layer like I explained above, your code can update the cache pro-actively when a POST, PUT or PATCH API call is made to update your data. Choose to either update the cache, or delete the record from the cache, so it can be lazy loaded later!
To quote my Capital One colleague Jon Bodner
It’s not just what cache to use, but what to cache as well
What’s the point of caching? I did some foreshadowing with the title — caches save cash. The more time it takes to get data from a database, the ‘slower’ each application is. To support the throughput we need, we need to scale horizontally. When we cache things, we don’t have to wait as long. Therefore we don’t have to scale as wide. Therefore, we save money.
Tactically, there are as many design options as there are words in this article. Hopefully you have some new ideas or insights into building your application to be as performant as your customers need it to be.
Links and Articles
- I (https://medium.com/@chrisfauerbach) recently posted about super high speed data types for building a black list. https://medium.com/capital-one-developers/blazing-fast-data-lookup-in-a-microservices-world-dd3ae548ca45
- James Higginbothom (https://medium.com/@launchany) just posted an awesome article about determining when to use an API vs a streaming data solution. https://medium.com/capital-one-developers/choosing-between-rest-web-apis-and-message-streaming-8e2f4813a058