Resiliency at the Edge
Resiliency Patterns leveraged by Capital One’s Edge Engineering Team
September 19, 2018
Capital One leads the financial industry in transitioning from brick-and-mortar to cloud-based, digital products. Traditionally, financial industry technology norms have tended to support private data centers, host antiquated mainframe systems, and thereby, were intrinsically slow to adopt modern technologies. As part of our technology evolution, Capital One is working to incorporate agile methods to include delivery automation, decoupled systems (i.e., via well-defined APIs), and microservice architectures. As with any system-wide transformation with highly active transactional systems, challenges will arise in this transition.
Most current architectures consider the risk of service failures by employing techniques such as load balancing, clustering, and infrastructure redundancy. These patterns, while effective, only take into consideration a full outage of a particular service or component. These types of issues can be easily detected and resolved in comparison with nuanced issues found within feature-based disruptions. Finding out when a microservice is not working as intended, performing poorly, or contributing to downstream failures can be a challenge. But they need to automatically self-identify and self-correct in order to minimally disrupt the customer experience.
The mobile teams at Capital One proudly provide our customers with award-winning digital experiences via products like the Capital One mobile application. When designing distributed microexperiences (i.e., user experience-focused microservices), the Edge engineers evaluated pioneers in the field such as Netflix and Amazon Web Services to understand how they implemented resiliency in their environments.
In this blog, we’ll focus on various resiliency patterns that we implemented at the Edge — the smart technology layer between the external-facing infrastructure and our business services — to improve our customer experience and defend our downstream services.
We will consider four of our resiliency patterns that allow services to serve our customer experience and protect our internal resources. These resiliency patterns help to prevent issues from cascading to upstream or downstream systems. These four patterns are Client-Side Load Balancing, Circuit Breaking, Fallback, and Bulkhead.
Client-Side Load Balancing Pattern
With microservice architectures, the client-side load balancing pattern is preferred in comparison with server-side load balancing. The reason is because it easily scales, handles updates efficiently, and eliminates scenarios such as bottlenecks or single points of failures. The load balancing capability is pushed to each of the clients, distributing the responsibility of load balancing.
This pattern includes a discovery service, such as Netflix Eureka or HashiCorp Consul. On startup, a service instance registers with the discovery server, publicizing its location. The discovery service will cache the location and health status of the service instance. A discovery client will then look up the instances of a particular service from the pool of healthy instances.
If a service instance is responding slowly or throwing errors due to load or an issue with a downstream service, the client-side load balancer component will detect and provide corrective action. The unhealthy instance is removed from the resource’s pool, preventing it from being consumed in the future. Even if the discovery service goes down, a local copy is maintained on the client so that the connectivity can continue with essentially current information from the cached copy.
Circuit Breaker Pattern
This pattern prevents a client from continuing to call a service that is failing or experiencing performance issues. Software circuit breaker patterns are modeled after electrical circuit breakers. Electrical circuit breakers detect power surges and break connections to prevent the propagation of a power overload that can damage the circuit, or the devices connected to the circuit.
In the case of software services, a circuit breaker is used in conjunction with consuming a resource. If the call takes too long, the circuit breaker will cut the call allowing it to “fail fast.” If there are too many failing requests to a remote resource for a given call, the circuit opens. The open circuit allows the service to continue to operate, prevents the failure from cascading to other systems, and provides the failing service time to recover.
Currently, we use Netflix’s Hystrix as our circuit breaking technology. Hystrix wraps all calls to external systems, resources, or dependencies in a HystrixCommand or HystrixObservableCommand object which typically executes within a separate thread. If the call takes longer than established thresholds, a timeout will occur based on the 99.5th percentile performance of all the requests for a given resource. Hystrix will maintain a pre-configured thread for each dependency. If the thread pool becomes full, requests directed to that resource will be immediately rejected instead of queued up, preventing the overloading of downstream services. Hystrix Circuit Breaking continuously measures successes, failures, timeouts, and thread rejections so it knows when to close the circuit. Once the circuit is closed requests are automatically reestablished.
The Fallback Pattern consists of detecting a problem and then executing on an alternative code path. This pattern is used when the original code path fails and provides a mechanism that will allow the service client to respond through alternative means. Other paths may include static responses, stubbed fallbacks, cached responses, or even alternative services that provide similar information. Once a failure is detected, perhaps through one of the other resiliency patterns, the system can fallback.
For example, some of Capital One’s digital products use a map feature that displays the locations of our ATMs, bank branches, and other physical service outlets. Typically, this service will return a list of locations dynamically. However, in the case of a failure, our services will revert to a local cached list of locations. The user experiences a map display with the locations, instead of an error. In most situations this will result in a better experience for the user.
Services can use multiple distributed resources in order to display the response to a user request. The Bulkhead Pattern compartmentalizes these calls so that poor performance of one service does not negatively impact the results of other services, and in the end, the user experience. The Bulkhead Pattern is based on a familiar concept implemented in ship designs. Ships are divided into watertight compartments in order to keep water from spreading from one compartment to other areas in the ship during a hull breach. Each of these compartments is called a “bulkhead.” This way if the ship’s hull is compromised the risk of the ship sinking is reduced.
Hystrix provides an implementation of the Bulkhead Pattern by limiting the number of concurrent calls to a component. This way the number of threads that are waiting for a reply from the component is limited. In a system without this type of segregation, the threads could run out, causing a cascading effect disturbing the performance of other dependencies. The separate thread pools in this scenario act as bulkheads. If a request stops responding (i.e., hangs), or starts performing poorly, the threads for that particular service will be depleted but other services will continue to be responsive thus minimizing the impact.
These four patterns — Client-Side Load Balancing, Circuit Breaking, Fallbacks, and Bulkheads — are essential implementations to promote resiliency, but it can be a challenge to visualize the status during operations. Our team uses a counterpart capability from the Netflix’s Hystrix libraries, called the Hystrix Dashboard. This dashboard provides a view of each call or HystrixCommandin real-time, reducing the time it takes to gain an understanding of operational events. The dashboard provides a single view sorted by specific criteria that helps draw attention to important events based on volume, error rates, circuit-breaker status, and overall health of a circuit. As a further extension, we enabled alerting based on the HystrixCommand’s circuit breaking frequency for automatic notification regarding the operational conditions.
Figure 1.0 depicts the graphical representation of one circuit from the dashboard, along with an explanation of the data points available. The dashboard is also able to aggregate metrics from several MicroExperiences or microservices and can give overall traffic per clustered instances by using the Hystrix Turbine Service.
Figure 1.0 Hystrix Circuit Representation the Dashboard
Life at the Edge
As part of Capital One’s Edge engineering team, our mobile microexperiences orchestrate millions of requests a day, which translates into billions of calls through the Capital One ecosystem; all done in the name of providing our customers with award-winning financial products and services.
Our job is to provide users a seamless experience that maintains high availability and resilience even when major or minor failures are encountered. We continuously work on innovative solutions to advance distributed applications and client resiliency within a high-volume environment.