A 5-Point Framework For Python Performance Management

Performance testing -- like sailboat racing -- depends on the conditions along the racecourse

I used to live on a boat. A question that comes up is "How fast does it go?" It's a pleasant enough question with a tricky, evasive-sounding answer. It's a sailboat, meaning the speed varies -- to an extent -- with the wind. It also varies with the sails being used, weather and sea state, and the relative angle between boat and wind. So answering "it depends" always felt awkward.

It's not that I wanted to be evasive, but measuring performance is nuanced. There's no simple "it goes this fast" for boats. Nor is there one when writing software in Python. When we add libraries like numpy and Dask into the mix, what are we really measuring? What should we be measuring?

What’s perhaps more central here, is any question about "speed" is not necessarily the right question to ask. The use cases for a big, live-aboard sailboat are focused on where it can go, not how fast it gets there. And even the "Where can it go?" question turns out to be more complex than it seems at first. Failing to ask the right questions can leave you high and dry. Literally. There are hard lessons to be learned when talking about performance of sailboats. Or Python applications.

I want to suggest that Python’s performance, like a boat, is intimately tied to configuration choices -- i.e., sails -- and specific use cases -- i.e., wind direction. Further, I want to suggest that questions about volume of data and scalability are also important, and sometimes overlooked. Additionally, we should consider asking about simplicity and maintainability. When looking at Python, we have to be cognizant of Python as a package deal: it starts with a low barrier to entry, and navigates through an easy-to-use language, and folds in a rich library of add-on packages. Narrowing the conversation to only talking about performance can miss the additional benefits of a language like Python.

In this post I want to outline a five-step process to managing performance testing. I think it’s helpful for Python applications, where performance is often brought up, but it applies widely. We’ll start the process with a definition of some use cases, then formalize them in Gherkin, review them with stakeholders, implement them in Python, and finally, make the whole thing an ongoing part of our deployment process. Why use the Gherkin language? I think the language for describing a test applies to many test varieties, including integration, system, acceptance, and performance testing. I’d like to start by looking at the goals of performance testing, and why it's sometimes hard to ask appropriate questions. From there, we’ll be able to move into a method measuring Python application performance.

Performance Testing Goals

We often start our Python performance journey from a safe harbor. We have built an application and now we want to be sure it will perform well at scale. This should lead to the pretty obvious question of "What scale are we talking about?"

The question of "scale" is a bit vague. It seems to be more important than simply trying to measure speed. This leads to several approaches to scalability testing. Here are two extreme positions:

  1. Run the application or service or platform at increasing workloads until performance is "bad." I'm going to call this exploration.
  2. Set levels for a workload and targets for performance to see if the application meets the performance requirement under the given workload. I want to call this testing to distinguish it from the less constrained exploration.

I've worked with folks who didn't like reducing performance or scalability to a few pass/fail test cases. They preferred the data gathering, and engineering aspects of a more exploratory approach. I don't want to label exploration as wrong, but I want to distinguish open-ended exploration from a test case with a defined finish line.

3 sailboats floating on blue water

Photo by me

Exploration is why poking around in a kayak or stand-up paddleboard is fun. You can find lots of cool, secluded places, and do some serious birding. But you don't sail a 42' sailboat that's your home into anything without a clear, deep, wide, and marked channel. These criteria are essential, and often established by government agencies like the US Coast Guard. Test cases include "Is the channel on a published nautical chart?" and "Are the marks in good repair?" and "Is the charted depth adequate?" and "Is there anything in the way?" Any failure for these test cases, and we have to turn the boat around and head for deep, safe water.

For a big sailboat, the question isn't "How fast?" It's more like "How far?" It's a different performance question, related to destinations, more than speed. As with deploying application software, we have to be sure we're asking the right questions.

When we shift our focus a bit, it's clear that "speed" is one aspect of the more inclusive concept of scale. We have to consider workload, transaction mix, network overheads and latencies, and the ordinary chaos of a web-based architecture.

When getting started on a new project, it's hard to be sure everything is accounted for. I want to provide a framework that allows for -- and encourages -- active growth and learning. We want our test cases to reflect our best understanding of the problem, the solution, and the user experience.

Python Performance Testing

I want to suggest a five-step process for doing Python performance testing.

  1. Define the use cases and the performance targets. This creates a kind of "box" that our application has to live in. In some cases, the box metaphor is really apt because there's a rate vs. volume tradeoff, making them inverses of each other.
  2. Codify the performance targets into Gherkin. These are often relatively simple-looking scenario outlines. In some cases, there may be multiple performance-related features. In other cases, performance is part of an overall acceptance test suite, and there's only one feature.
  3. Review with product owners to be sure they understand how the Gherkin describes the user experience. This involves an ongoing effort to prioritize and change what we know and what we'll define as acceptable. And yes, this is a lot like moving the finish line during a race. It's common in sailboat racing to adjust the course to reflect evolving weather conditions.
  4. Now comes the programming part. The behave tool (https://behave.readthedocs.io/en/latest/index.html) relies on Python functions to implement the text of the Gherkin test scenarios. With the Gherkin step definitions, and an environment module, you can run the behave tool, this builds your performance results.
  5. Make the performance testing part of the CICD pipeline. I think continuous integration requires continuous performance testing. Otherwise, how can you be sure the application is acceptable before deploying it?

We'll take up the steps individually.

Step 1 - Define the Use Cases

For software performance, it's imperative to define concrete thresholds for volume of data, number of users, transaction mixes, and other specific attributes. As with a sailing race, there needs to be a starting line, a course, and a finish line. Some courses involve a long or complex layout requiring a lot of maneuvering, other courses may have a shorter and simpler layout but require the racers to do more laps.

Our software application's testing thresholds might start as a single, simplistic, lofty goal like hoping for 20 milliseconds for most transactions. As we consider the actual work the application does, we may adjust this target to allow timelines under peak workloads to slip to 500 milliseconds. The more we identify potential complications in the scenario details -- things like, database access, network delays, cache collisions -- the more nuanced and complex our time thresholds and workloads become.

It can be difficult to assign specific thresholds. How slow is too slow? While it's nice that some enterprise-oriented analytical work should be available in the decision-maker's in-box by 9AM each day, slipping as much as an hour doesn't cause too many problems. How late is too late?

These questions can be difficult to answer, making it hard to fabricate specific measurements and thresholds. This work of setting thresholds can be so hard, some folks prefer to give up on the performance testing, and switch over to "exploration" mode where they poke at the software until they feel like it might be called "fast enough" or "too slow."

I prefer to define a finish line. And then move it around.

It seems unfair, but, I think defining a concrete test case fits better with automated CICD. Even though the finish line needs to be moved, having a target is not the same as having no finish line at all. Without a target, engineering effort can be wasted on things that were already “fast enough.”

We might wind up with a table describing a workload metric, n, and a test case result, s, to be sure things really did work. We might add more to this table, like workload names, or perhaps timing objectives. For the purposes of this blog post, we'll start with a single time threshold for all of these workload levels.

|         n |         s |
|        89 |        44 |
|    10_000 |     3_382 |
|   100_000 |    60_696 |
| 1_000_000 | 1_089_154 |

Your application may have more complex workload factors. You may skip the "did it work?" test results. You may have different times for each case. But the idea is to define a box into which the application must fit. Once the box is defined, the shape and size may change as lessons are learned.

Step 2 - Codify in Gherkin

Every activity has its own language. Sailing is no different from software languages, or test languages. "Helm's a-lee" doesn't mean much to non-sailors. For sailors, however, this is a signal that the boat’s about to turn, the direction of heel will change, the boom(s) will move, there will be a noisy luffing of sails, and people will start hauling sheets through winches. There's a lot of technical details, all in the language of sailors and boats.

Gherkin is a great language for describing the test case setup and the expected results. The essential unit of description is a Scenario, which has three kinds of steps.

  • Given steps describe setup and pre-conditions. The language used (other than the first word, "Given") should be the language that's appropriate to the application and the problem domain. It should reflect the user experience we're trying to gauge. In a later section, I'll show you how the language used here is mapped to Python code. The mapping from natural language to implementation is flexible, so the important rule is "be consistent in your terminology and usage."
  • A When step (there's often only one) describes running the operation or transaction or set of transactions. Again, the natural language element is mapped to a step definition written in Python. Be precise, but also try to stick to terminology focused on your users.
  • Then steps describe expected results.

Here’s what a scenario might look like in practice:

Scenario Using the Data Model Filter 2:
  Given An upper limit of 89
  And   The Data Model Filter 2 algorithm
  When  A request is made
  Then  The row count is 44
  And   Performance is under 500ms

The scenario's Given steps describes two pre-conditions: an upper limit and a specific data model filter algorithm. The When step is a vague-sounding “A request is made”; this relies on context to provide details. The final Then steps describe two expected outcomes, one is a correct answer, and the other is a performance threshold.

Scenarios have titles to provide needed context information. Scenarios are also grouped into Features, this lets us provide additional context. The idea is to keep the scenario steps small and flexible.

Pragmatically, performance testing should involve a number of scenarios at different scales. Gherkin lets us write a Scenario Outline where we can use placeholders to expand the outline into concrete scenarios. This saves us from copy-and-paste superficially similar scenarios.

Here's an example of a Feature with one lonely Scenario Outline. This fills in placeholders in the scenario from the Examples table. Note that the threshold for the tests is a blanket 500 ms (we’ll come back to this).

Feature:  Data Filtering App

  Scenario Outline: Using the Data Model Filter 2
    Given An upper limit of <n>
    And   The Data Model Filter 2 algorithm
    When  A request is made
    Then  The row count is <s>
    And   Performance is under 500ms

    Examples:
    |         n |         s |
    |        89 |        44 |
    |    10_000 |     3_382 |
    |   100_000 |    60_696 |
    | 1_000_000 | 1_089_154 |

You can see where a Given step takes a value of <n> from the table, and a Then step takes the corresponding value of <s> from the table of examples. The natural language text (like "A request is made") will be handled by later step definitions, the essential rule is to be consistent.

(I've omitted punctuation. Some folks prefer to end clauses with commas or semicolons.)

The performance threshold, 500 ms, is simply written into a Then step, and applies to each of the scenarios. Is this appropriate? If we need to shift the finish line, we may need to expand the table to include the threshold as another column. We might use two separate scenario outlines for slow, complex transactions and fast, simple transactions. We might want to have separate feature files for different parts of a complex application, with different kinds of scenarios.

What's essential is the step of formalizing performance, of creating a specific workload and a specific performance target. This makes it sensible to run an automated test suite and look at the failures to see what more we can learn. Perhaps there's a bug to fix. Or perhaps, the finish line was simply too far away from the starting line and didn't match the experience of our users.

This takes us from defining the test scenarios to reviewing them with product owners and other stakeholders.

Step 3 - Review with Stakeholders

I think product owners need to be comfortable reading and commenting on Gherkin. They should be able to write it as well. This can help them refine and rewrite the language for the tests. This kind of thinking is often helpful because it captures deeper insights into how the application solves the user's problem.

(If we change the Gherkin, we may also have to change the step definitions in Python. We'll look at the Gherkin mappings in the next section.)

The interaction with product owners and users via the concrete Gherkin language can be helpful for formalizing vague notions and clarifying nebulous expectations. It also helps to identify and organize special cases, exceptions, and movable finish lines. Often these nuances are important, and tracking them -- visibly -- in feature files makes them visible and concrete.

A sailboat, no matter what it looks like, rarely has any ropes; a rope is something used on land. Each piece of cordage you see around the boat is generally some kind of line, most of them are sheets or halyards. Yes, sailors have specialized terminology, and that's because the job each line does is dramatically different. Releasing a sheet is kind of annoying because the sail makes a lot of noise and the boat slows down. Release a halyard and you let a fifty-foot-tall pile of dacron fall all over everything. Language matters, and it's important to be sure everyone agrees on the meanings of the words we're using.

Gherkin lets us use features as one way to organize scenarios. Most tools, like behave, execute tests using directories of Gherkin-language feature files, allowing us sophisticated hierarchies of features and scenarios. The behave tools also lets us supply tags for features and individual scenarios, giving us even more control over what tests must pass, and what tests are still part of exploration, with finish lines that may need to be moved.

The behave tool even lets us define multiple environments. All this flexibility gives us a lot of ways to describe (and test) our application software. I think this flexibility is a bridge between formalized pass/fail testing and less structured exploration. We can have features that aren't tested, but are used to explore performance.

I'm a big fan of using tags for the form @rel_x as shorthand for “Release_x” to set priorities around features and scenarios. All the @rel_2 features need to be ready for the next release. The @rel_3 feature tests may not all pass right now. The @wip tag seems to be helpful for marking tests that cannot pass because the work is in process.

Step 4 - Build Step Definitions and Environment

Once we have some Gherkin, we need to implement it. The behave tool splits the implementation details into two parts:

  • Step definition modules
  • An environment module

I like to have a directory tree that looks like this:

the_app
 +-- benches
 |   +-- features
 |   |   +-- important.feature
 |   +-- steps
 |   |   +-- definitions.py
 |   +-- environment.py
 +-- src
 |   +-- my_app.py
 +-- tests
 |   +-- test_my_app.py
 +-- pyproject.toml

I've piled the performance tests in benches, as in "benchmarks." The tests directory has unit tests. The name “tests” seems very inclusive, so it can be a little misleading. I prefer keeping the unit tests and performance tests separated from each other because the unit tests need to be run every time there's a change whereas the performance benchmarks might be deferred until release time.

The idea of a common configuration applies to a sailboat's layout as well. On most boats, the main halyard emerges from the main mast, runs through turning blocks and fairleads before arriving at a rope clutch and a winch in the cockpit. On many boats the lines are visible on deck making the layout easier to understand. We want our project directory to be similarly simple and common.

Benchmark tests seem to belong with integration and acceptance tests. Because stakeholders are involved, the audience for these tests is different from unit tests. They're subject to a certain amount of finish-line movement as user knowledge grows and product experience grows. This, too, suggests they should be kept separate from unit tests.

The features files are Gherkin, as shown above. The environment.py file can be empty. The behave tool needs to have a file with this name, but we don't need to put anything in it. The steps directory, similarly, can start out empty. The behave tool can provide templates for the step definitions we need to write.

We can use behave to generate a template for the step definitions. Each of the Gherkin steps that doesn't have a matching definition in a module in the steps directory will create an error. It will also create a suggestion for what the step definition should look like.

benches % PYTHONPATH=../src behave

Feature: Project Euler Problem #2 Command Line App # features/perf1.feature:1

  Scenario Outline: Better Fibonacci Generator -- @1.1   # features/perf1.feature:12
    Given An upper limit of 89                           # None
    And The Better Fibo Generator                        # None
    When Sum is computed                                 # None
    Then Answer is 44                                    # None
    And Performance is under 500ms                       # None

This will show all the scenarios. At the end of the scenarios, there's this summary:

0 features passed, 1 failed, 0 skipped
0 scenarios passed, 10 failed, 0 skipped
0 steps passed, 0 failed, 0 skipped, 50 undefined
Took 0m0.000s

You can implement step definitions for undefined steps with these snippets:

@given(u'An upper limit of 89')
def step_impl(context):
    raise NotImplementedError(u'STEP: Given An upper limit of 89')

The last batch of lines in the output from behave will be a report on any needed step definitions. The report will detail each Gherkin step not yet defined; and their provided as a function that can be pasted into a module in the steps directory. The first time you run behave there may be a large number of these little blocks; copy and paste them into a module to get started.

When we look at the function place-holders, there will be @given, @when, and @then decorators in the step definitions. In the original Gherkin, we used 'And' to inherit the step type from the previous step. This creates a slightly more natural flow using "Given this And that And the other". We could use only Given in the Gherkin, but it looks more stilted than normal when we do. Gherkin "Given this And that" leads to @given(u'this') and @given(u'that') step definition decorations. The idea is to have a relatively transparent mapping from the test cases to the implementation in Python.

(The u'' string uses a prefix to state unambiguously that the string is Unicode. This is the default, so it seems superfluous.)

It's important to note that behave isn't very clever about handling complex quotes. If your Gherkin text for a step uses quotes, it's best to use " to avoid a tiny bit of complexity from dealing with '. Specifically, the first time you use a generated step with a ' in it, you may see syntax errors in the generated code. It's not hard to fix the generated step definitions; it only has to be done once as they're never going to be generated again.

Also, the function names are all step_impl. The names don't matter, and aren't really used. The decoration is used to pull the code apart into a different structure for use by behave.

Once you have all the steps defined, you can rerun behave and see that all your steps are now failing because they're not implemented. This is a big step forward. Now we're ready for the next leg of the journey, which is to more fully define how we do the performance testing and gather benchmark information.

Step 5 - Build Testing Details

The behave tool works by examining each step in a scenario and locating a step definition that matches the text of the step. When we have a Scenario Outline, each unique substitution from the Examples table leads to a unique copy of the step from the outline.

In our example, this processing approach means there will be four individual Given variants, including @given(u'An upper limit of 89'). The behave tool gives us several alternative matchers that can be used to extract meaningful values from the step text, avoiding the need to have four nearly identical steps.

The default matchers lets us change the decorator and function to look like this.

@given(u'An upper limit of {limit}')
def step_impl(context, limit):
    context.cli_options.extend(["--limit", str(int(limit))])

The {limit} extracts the parts of the text defining the step. This text is provided to the implementation function, along with the testing context. We do an int() conversion to make sure it's an integer, and then convert it back to a string because that's what's required later. We save the value into our testing application's cli_options attribute inside the test context object.

Because the context argument value is an ordinary Python object, we can add attributes to it all we want. We do have to avoid those used by behave.

We can use a before_scenario() function in the environment.py module to create an empty context.cli_options list. This isn't part of the step definitions, this is separate, the only two lines of code in the environment.py module.

def before_scenario(context, scenario):
    context.cli_options = []

When defining the steps, there are a few common patterns to follow. Generally, our Given steps will save setup information in the context. As the setup becomes more complex, the associated before_scenario() function in the environment module may also become more complex.

Our When step will use the setup information to run the application or service or framework we're testing. In our case, it might look like this:

from my_app import main

@when(u'Sum is computed')
def step_impl(context):
    buffer = io.StringIO()
    with contextlib.redirect_stdout(buffer):
        start = time.perf_counter()
        main(context.cli_options)
        end = time.perf_counter()
    context.output_text = buffer.getvalue().splitlines()
    context.run_time = end-start

Since the application we're testing has a main() function, we can leverage that. There are a lot of alternatives, depending on what kind of code is being tested and how you're testing it. Here are a few alternatives:

  1. Use the subprocess module to run the application as if being run from the command line. This can be essential for non-Python applications.
  2. Use the subprocess module to run Docker commands to start a Docker container running the application. The performance measurements need to be understood as reflecting the dockerized container, running in another host.
  3. Use a tool like Terraform to build cloud resources, deploy, and start the application. This kind of cloud-native testing can take time to start and stop, so a large suite of performance tests may take a long time to run. This may lead to creating a subset of scenarios for certain critical features.
  4. If we're using AWS, we might be using the boto3 library to build cloud resources for our tests.

This isn't all. These are just a few of the many techniques available. What's important is that we're going to leverage the power of Python to run the application. Python offers a number of libraries, and we may wind up doing some fairly clever stuff.

A good practice is to capture the output in the When step, stuffing it into the context. The Then steps can then examine these results to see if the test passes or doesn't pass.

Having a test suite is part of the solution to performance testing. Using the test suite is also essential, we'll turn to that, next.

Make Performance Testing Central

A performance test, like unit tests and acceptance tests, is easy to set aside. A product owner might try to argue that functionality matters and tests can come later after there's a working product. This argument seems to suggest a profound failure to understand the "working" part of "working product."

If there are no automated tests, it's difficult to say the product "works."

Automated unit tests, performance tests, and acceptance tests are the minimum required before offering someone software (or a service, or a platform). This is similar to sailing a 23,000 pound boat into someplace new. There are essential test cases that must be passed, including "Is the channel on a published nautical chart?" and "Are the marks in good repair?" and "Is the charted depth adequate?" and "Is there anything in the way?"

On my boat, I've failed to have adequate testing in some pretty alarming ways. One of the worst was sailing in the inlet at Ocean City, Maryland. The entrance is wide open, it was a beautiful day. BUT. We did not really have all of our test cases for navigability fully understood.

"Can you match the marks with the chart?" was something that wasn't one of our test cases at the time. We blundered in, a great hippopotamus of a boat, and failed to honor a buoy that marked a division in the channel. A buoy that guarded a patch of shallow water. A buoy that guarded a patch of water getting shallower as the tide ran out.

Some passing fishermen helped pull us off the shoal. It worked out well for us. No damage to the boat, and important lessons learned on test cases to apply next time we were confronting unknown waters.

I think it's important to phrase everything as a test, and use tools -- like Gherkin and behave -- to formalize those tests. We want to avoid "it depends" answers; we WANT concrete, measured results. We want automated tests that provide consistent, and unassailable results.

Since we didn't take the time to fully match the buoys and marks to the chart, we were sailing into unknown waters. We can avoid this with a five-step process: define the acceptable outcomes, formalize the scenarios in a language like Gherkin, review the scenarios with people that answer for the user's experience, implement automation with tools like behave, and integrate the automation into the development process.

We need to avoid misleading questions like "How fast is it?" and focus on more useful questions related to the user experience. We shouldn't tolerate subjective guesses as answers to questions like "Will this scale to our projected workload?" We can't accept a subjective answer to "Do you think the boat can make it?" For us, the answer turned out to be "no." A few minutes of additional testing would have saved an hour of being pulled off the shoal.


Steven F. Lott, Programmer. Writer. Whitby 42 Sailor.

Yes, We’re Open Source!

Learn more about how we make open source work in our highly regulated industry.

Learn More

Related Content