The Five Principles of Successful Software
Follow these practices to mitigate failures and build systems that last
What is Engineering?
We call ourselves “software engineers,” but I don’t know how much time we spend thinking about what engineering is. We think about engineers being people who design and build things, whether they are bridges or circuits or software. While that is true, it’s not the building part that makes it engineering.
From an early age, people are driven to build things, whether it’s a Lego house or a “Hello, world” program. As we grow and learn, we build more complicated things, like real houses and programs that process critical data. What separates the engineers is that they consider how to build things that survive no matter what the world throws at them. At its heart, engineering is all about how to handle failure.
Because no matter how perfect our plans are, things are going to go wrong.
Failures come in many sizes. Not every engineering failure is a rocket exploding or a healthcare.gov web site that can't launch. If you have a service that needs to be rebooted periodically due to a memory leak, that's still a failure. If you have a web site that is so hard to navigate that it drives customers away, that's also a failure. As engineers, it's up to us to do everything we can to make sure we've designed and built our software as correctly and robustly as possible.
That brings us back to system design. We have to design our systems so that they not only meet the functional requirements that are outlined by product managers. We have to design systems that are resilient to failure. So, how do we do that?
There are three magic words that you'll hear from senior engineers. "Well, it depends." The longer you spend writing software, the more you realize that there are no hard and fast rules that guarantee success. However, there are five principles that I've found helpful for building resilient systems. They are:
- Boring is Good
- Stick to Your Core Competencies
- Non-functional Requirements are Requirements
- Design for Testability
- Nothing is Forever
You might notice that there's nothing here about microservices or Kubernetes or any particular technology at all. I'm not going to tell you how many availability zones to deploy to or whether you should pick REST or GraphQL. That's because system design principles are at a more fundamental level. They guide your technology choices, not the other way around.
So, let's go through those principles one-by-one.
1. Boring is Good
First, boring is good. I've written about boring technologies before. Being boring is fundamental to good engineering. It means relying on the accumulated knowledge of everyone who has come before you. Chances are, you aren't the first person to come across a design problem. Research common patterns and successful systems. See how other people have solved this problem and how they decided between implementation options. Find reports on failures and how people redesigned their systems in response. In short, stand on the shoulders of giants.
The originator of the idea of being boring might be the English writer and philosopher GK Chesterton. He is known for a thought experiment called Chesterton's Fence. I'll paraphrase what he said, because 150 year old English philosophers are wordy.
Suppose there's a fence blocking a road. You might say, "This fence serves no purpose, it gets in my way, let's tear it down." But Chesterton said that's wrong. If you don't understand why the fence is there, you absolutely should not tear it down. You need to figure out why the fence was put there. Once you do, then you can consider taking it down.
How does this apply to system design? First, be sure you need to build a new system. If, once you understand what the old system is doing, you realize that it's functioning as designed and as needed, leave it alone. There's a tendency for engineers to rebuild things using the latest technologies, just because they are interesting. Some engineers even pick their tech stacks so they can add something new and cool to their resume (but not you, I’m sure). Avoid the temptation.
Second, if you do build something, be careful when adopting new techniques and technologies. Take a look at Choose Boring Technology, a great blog post written by Dan McKinley in 2015. He introduced the idea of "innovation tokens" to limit the amount of new things you want to try out. He thought that a small company could safely spend up to three innovation tokens. For a team at a big enterprise, I wouldn’t spend more than one on a project per year.
There are people who can push the limits on technology. However, for most of the projects happening at enterprises - where stability is valued over coolness - those innovation tokens are going to matter.
2. Stick to Your Core Competencies
Choosing boring technologies isn’t enough. You also must focus your team on building things that no one else can. While it can be a lot of fun to build software and infrastructure, don’t build anything that was already created by someone else, either within your company or outside.
Be merciless in your quest for reusing other people's stuff, because reuse buys you time. Time is the most precious thing on a project because there's no way to get it back. Time spent researching how to solve a problem is an investment, because it's likely to save you a great deal of time later. Meanwhile, time spent rebuilding something that already exists means you have less time for everything else - including the things that only you can do.
The cost of recreating existing systems isn’t a one-time loss of time. Over the not-so-long term, maintaining your system costs your team far more time than the initial development. If you are using technology that's widely shared, your costs are spread out over all of those projects. Everything that's unique to your project is a cost in time that is only carried by your team.
Reusing other people's work is so important that you should go out of your way to adapt the systems you must design to the systems and code that you can reuse. One of the best ways to avoid surprise failures is to take advantage of the experience of others. Some problems seem deceptively simple and you might think it's easier to recreate solutions than to adapt your design ideas, but those existing systems have already been battle tested against corner cases that you haven't even imagined.
There are trade offs when reusing existing technologies. The more you depend on a particular library, API, database, or operating system, the harder it becomes to switch from it. Do your best to prepare for a replacement. If a technology no longer meets your needs or becomes unsupported, you'll need to find a way to decouple. Be sure there are clear boundaries between your stuff and your dependencies. Think of it as drawing up a legal contract. You need to specify not only how you will work together, but also how you break things apart.
3. Non-functional requirements are requirements
Non-functional requirements are the things that must be true in order to support the functional requirements. These include:
- How much time does your system have to return a response?
- How many simultaneous requests does your system need to handle?
- What should the system do if it can't handle any more requests?
- What should the system do when one of its dependencies is not responding?
- How much is running this system going to cost and how do I minimize that cost?
- Is it OK to compromise on the speed of responses or the number of simultaneous requests to save some money?
It's important to properly define these non-functional requirements. Engineers should provide input on feasibility (you can’t require 50ms response time on requests that go from the US to Europe, because networks aren’t that fast), but the product owner is responsible for knowing what the performance, uptime, and cost requirements are.
Be careful to avoid unintended consequences when specifying non-functional requirements. Imagine you are defining the non-functional requirements for an API. You want to make sure that it responds quickly to user requests, so you specify that the system must return a response in an average of 50 milliseconds. That sounds OK, but averages hide variances. A system where half the requests take 1 ms and half take 99ms is not the desired goal.
A better approach is to use a percentile, saying that you want 99% of requests to take 50 ms, but even this is incomplete. You don’t want a system where 99% of the requests take 50ms and 1% take a second.
Instead, you should bound your worst case time, specifying that 99% of requests should take 50ms or less and if a request is taking longer than 100ms, return an error. That helps you define a performance envelope, the total amount of time allowed for a request made by an end-user. When the performance envelope is known, product managers can split up the available time among all the services required to handle a request. This tells you if you can meet the performance needs of your users.
Non-functional requirements are hard because they are all about how your system handles failure. Describing the golden path is easy, as is building it out. Thinking about all the ways things could go wrong means thinking about the ways in which you could make a mistake. Be aware that the mistakes are not just in bugs in your code. You also have to think about the ways that your system's third-party dependencies could fail you. It is hard to not be defensive. It's human to worry that admitting a potential mistake makes you look bad. But it's better to figure these things out before the failures happen, because failures will always happen.
4. Design for testability
So how do you know if your system meets your requirements, both functional and non-functional? You test it. When you're under time constraints, and we are always under time constraints, the temptation is to do as little testing as possible. The thing is that testing always happens --the only question is whether it happens in a QA environment or if it's done in production by your customers. Users are great at finding all the bugs that you never looked for. All that time you thought you saved by not testing or only testing the golden path was an illusion. If you thought you were time constrained before, imagine how much less time you have to fix a problem when it's live.
Even if you do want to avoid having your customers find your bugs first, you can't spend forever testing your system before releasing it. One reason why people don't test before production is that it can be hard to configure all of the things your system depends on. So you need to figure out how to design a system that's easy to test. It comes back to thinking about modularity and how your software systems fit together.
Many people recommend using Test Driven Development, or TDD, to ensure that code is testable. The idea behind TDD is that you should write your tests first, then write stub code that fails the test, and finally write the implementation that makes your test pass. While this might be ideal, it is still uncommon for teams to follow this strictly.
Whether or not you use TDD, you need to write testable code. So, how do you do it? Here are some tips:
- Don’t hard code configuration information. Store it in configuration files or environment variables.
- Separate the code that loads configuration information from your business logic, so you can test logic independent of the environment where it will eventually run.
- Avoid global state, because it’s harder (but not impossible) to put in different values to test failures.
- Just as you should make sure that configuration information is injected into your application when it starts, you should use dependency injection to associate your applications’ components. This allows you to wire in mocks and stubs for testing.
- Most importantly, make sure your tests cover things going wrong. What happens if a service doesn’t respond? What happens if a user provides incorrect input? What happens if the load gets higher than expected? All of these things will happen in production. It’s up to you to make sure that you know what will happen before then.
If you aren’t sure what kinds of failures can happen and your system involves multiple processes or services, look at the fallacies of distributed computing. These principles were first described at Sun Microsystems in the early 90s. The fallacies are things that people take for granted when designing systems that come back to bite them in production.The first three in particular are ripe for testing:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
5. Nothing is forever
Finally, let's talk about change. The blessing and curse of software is that it's easy to change. I love using bridges for analogies, because they are so different from software. Changing physical infrastructure is slow, painful, and expensive. Once a bridge is built, no one comes back six months later and tells the builder they have a week to hook it up to a different road 10 miles down the river. However, every software engineer can tell a story about adapting a working system or library to work with an unexpected component at the last minute. The only software systems that aren't modified are the ones that no one uses.
You need to design your systems for change, or there are going to be weird failures when, not if, you are asked to do so. Luckily, all of the things we've talked about so far enable us to make changes. If you've used well-understood technologies, properly isolated third-party dependencies, considered non-functional requirements, and have tests to validate that you've got everything right, this becomes vastly easier. Don't compromise these principles when the inevitable new features come along. It's easy to make one-off exceptions for changes, but you should figure out how to best make the new components and code fit into what's already there.
Another reason to design for change is that you might need to change back. Here's something else that all of us will experience at some point: releasing an update that breaks production. Once this happens, you need to redeploy the old system while you figure out what went wrong. If backwards compatibility is broken, you're going to have additional downtime. In order to ensure backwards compatibility, support old APIs alongside new ones. If you have a database, make sure that any database changes are backwards-compatible by never redefining or removing columns or tables.
There's one other change that you need to design for: the eventual replacement of your system. Software is such a new field that we are still figuring out how to do things correctly. Don't take it as an insult when someone replaces your code. If you create a system that lasts five years, that's impressive. One day, every bit of code that you write will be turned off and part of your job in designing your system is to make that as seamless as possible. Hopefully, the person replacing your system is following Chesterton's Fence and understands your system and what it does so that the replacement is both justified and an actual improvement. If not, send them a link to this article.