arrow_back

Development

20 Mar 2024

How to assess the value of a unit test: The four pillars of an ideal test

Ahmet Ildirim

Senior Backend Engineer at Talon.One

access_time_filled

9 minutes to read

Jump to

In our series of tech blog posts, we’re hearing from our R&D team members on the decisions made and lessons learned from the past eight years of building the Talon.One product.

In this article, Ahmet Ildirim, Senior Backend Engineer at Talon.One, shares practical unit testing principles and patterns that help identify what makes a valuable test.

What makes a unit test useful?

At times, I find myself spending an excessive amount of time fixing tests after making minor changes in the code. This has led me to question the fragility of tests and whether there’s a method to analyze a test to determine if there is something wrong with it.

So, I started searching for resources that specifically address this topic. To my surprise, most articles and books focus on the fundamentals of testing and do not go beyond that. But at last, I saw a passage from a book, Unit Testing Principles, Practices, and Patterns by Vladimir Khorikov in search results:

“This book takes you to that next level. It teaches a scientific, precise definition of the ideal unit test. That definition provides a universal frame of reference, which will help you look at many of your tests in a new light and see which contribute to the project and which must be refactored or removed.”

This was exactly what I was looking for, an objective method to tell whether a test is useful. In this post, I will share what I learned from the book, specifically a method for determining the value of a test.

Unit Testing Principles, Practices, and Patterns by Vladimir Khorikov

Image source

Almost all the ideas and concepts I will share are from the book. Of course, I will share a small portion that I found unique. If these ideas piqued your interest at the end of the read, I encourage you to check the book and have a full understanding.

The examples in the book are C#, and some chapters are most likely about stuff that you are already familiar with, still, it is a great read because it teaches how to reason about the value of tests.

The four pillars of an ideal test

Khorikov proposes that a good test has four foundational attributes/metrics that we can use to analyze any automated test, whether unit, integration, or end-to-end:

Protection against regressions
Resistance to refactoring
Fast feedback
Maintainability

How to use test attributes

We assign a score between 0 and 1 for each attribute in the test. These four scores, when multiplied together, determine the value of a test. If a test gets zero in one of the attributes, its value turns to zero as well: Value estimate = [0..1] [0..1] [0..1] * [0..1]

Of course, it is impossible to precisely measure what score a test has. But still, we can use these attributes to articulate a bad feeling when we see a test. It is a bit of an educated guess and instinct!

func TestValue()
{
    ... Protection against regressions: VERY BAD
    ... Resistance to refactoring: GOOD
    ... Fast feedback: GOOD
    ... Maintainability: GOOD
}

The value of the test above would be close to 0 even if it has scored well on 3 metrics. Scoring very badly on any metric is enough to fail.

So, what do we do when a test can’t pass this evaluation? We either remove or refactor it.

Let’s now concentrate on each attribute and understand their meanings.

1. Protection against regressions

The first attribute of a good unit test is the ability of the test to reveal the existence of a bug.

Regression is a type of software bug that occurs when a previously functional feature stops working as intended after the introduction of new code. One of the main purposes of writing unit tests is to detect regression bugs in the first place.

The larger the codebase, the greater the likelihood of encountering a regression bug. That is why having good protection against regression is a must, not up for debate. Without this protection, you will be overwhelmed by a growing number of software errors.

To evaluate how well a test scores on the metric of protecting against regressions, you need to take into account the following:

The amount of code that is executed during the test
The complexity of that code
The code’s domain significance

Generally, the larger the amount of code that gets executed, the higher the chance that the test will reveal a regression.

Complexity and domain significance also matter. This means that a test that validates complex logic of a significant domain will score high while a test of a trivial code that is not likely to fail score low. If a code isn’t likely to fail, then its test won’t provide much value because it will score low on this metric. Tests that cover trivial code don’t have much of a chance of finding a regression error, because there’s not a lot of room for a mistake.

2. Resistance to refactoring

The second attribute of a good unit test is resistance to refactoring—the degree to which a test can sustain a refactoring of the underlying application code without failing.

Imagine you developed a new feature, tests are passing. Then, you proceeded to refactor the corresponding codebase to enhance it. It looks great now. Except, the tests are failing. You look closely to see what is broken, but the code is working perfectly. The test is written in a way that even the smallest changes to the codebase break it. This situation is called a false positive.

To evaluate how well a test scores on the metric of resisting refactoring, you need to look at how likely a test is to generate false positives. The fewer, the better.

What causes false positives

To reduce false positives, a test must be decoupled from the implementation details of the underlying code. You need to make sure that the test verifies only the result of the system: its observable behavior, not the steps to get there. Tests should view the application code from the end user’s perspective and verify only the outcome meaningful to that end user.

Let’s check two examples that look very different, but surprisingly similar.

Example 1: absurdly fragile

This is a simple function in Go that calculates factorial:

Go//factorial.go
package factorial

func CalculateFactorial(n int) int {
    if n == 0 {
        return 1
    }
    return n * CalculateFactorial(n-1)
}

Here is the test for this function:

Go// factorial_test.go
func TestCalculateFactorial(t *testing.T) {
    //Read the code file
    file := readFile("factorial.go")

    assert.Equal(t, file,
        `package factorial

        func CalculateFactorial(n int) int {
            if n == 0 {
                return 1
            }
            return n * CalculateFactorial(n-1)
        }`)
}

This absurd test will fail even with the simplest change to the code, and it is the worst case of a false positive generating test. I know that no one would try to write a test like this. But bear with me for a second example that doesn’t look that stupid, but in practice not that different.

Example 2: Reasonable looking but still useless

The renderer code below is just responsible for printing the message header, body, and header.

Gopackage main

import (
    "fmt"
    "strings"
)

type Message struct {
    Header string
    Body   string
    Footer string
}

type SubRenderer interface {
    Render(message Message) string
}

type MessageRenderer struct {
    SubRenderers []SubRenderer
}

func NewMessageRenderer() *MessageRenderer {
    subRenderers := []SubRenderer{
        NewBodyRenderer(),
        NewHeaderRenderer(),
        NewFooterRenderer(),
    }

    return &MessageRenderer{
        SubRenderers: subRenderers,
    }
}

func (mr *MessageRenderer) Render(message Message) string {
    var result strings.Builder

    for _, renderer := range mr.SubRenderers {
        result.WriteString(renderer.Render(message))
    }

    return result.String()
}

type HeaderRenderer struct{}

func NewHeaderRenderer() *HeaderRenderer {
    return &HeaderRenderer{}
}

func (hr *HeaderRenderer) Render(message Message) string {
    return fmt.Sprintf("<h1>%s</h1>", message.Header)
}

type BodyRenderer struct{}

func NewBodyRenderer() *BodyRenderer {
    return &BodyRenderer{}
}

func (br *BodyRenderer) Render(message Message) string {
    return fmt.Sprintf("<b>%s</b>", message.Body)
}

type FooterRenderer struct{}

func NewFooterRenderer() *FooterRenderer {
    return &FooterRenderer{}
}

func (fr *FooterRenderer) Render(message Message) string {
    return fmt.Sprintf("<i>%s</i>", message.Footer)
}

Here is the test that verifies the order of sub-renderers:

Gofunc TestMessageRenderer_UsesCorrectSubRenderers(t *testing.T) {
    mr := NewMessageRenderer()

    renderers := mr.SubRenderers

    if len(renderers) != 3 {
        t.Errorf("Expected 3 sub renderers, but got %d", len(renderers))
    }

    if _, ok := renderers[0].(*HeaderRenderer); !ok {
        t.Errorf("Expected renderers[0] to be of type *HeaderRenderer")
    }

    if _, ok := renderers[1].(*BodyRenderer); !ok {
        t.Errorf("Expected renderers[1] to be of type *BodyRenderer")
    }

    if _, ok := renderers[2].(*FooterRenderer); !ok {
        t.Errorf("Expected renderers[2] to be of type *FooterRenderer")
    }
}

This test looks fine at first, but it is trying to verify the implementation details of the underlying code. And this makes it very susceptible to false positives. Try changing the position of sub-renderers in the constructor. The code will work, but the test will fail. This test is useless because it is not testing the observable behavior of the system under the test.

An ideal test for this code would look like this:

func TestMessageRenderer_RendersCorrectly(t *testing.T) {
    mr := NewMessageRenderer()

    message := Message{
        Header: "Hello",
        Body:   "World",
        Footer: "Goodbye",
    }

    expected := "<h1>Hello</h1><b>World</b><i>Goodbye</i>"
    actual := mr.Render(message)

    assert.Equal(t, expected, actual)
}

This test only verifies the output of the test and it scores well on the metric of resistance to refactoring.

3. Fast feedback

The third attribute of a good unit test is fast feedback - an essential property of a unit test. The faster the tests, the more we can have them in the test suite and the more often we can run them, which in return speeds up the development cycle.

When tests are designed to run quickly, the feedback loop is shortened, and the tests can detect bugs immediately after the code is broken. The sooner the bugs are detected, the easier and cheaper it is to fix them. In other words, fast-running tests can significantly reduce the cost of fixing bugs.

4. Maintainability

Finally, the fourth pillar of a good unit test, the maintainability metric, evaluates maintenance costs. This metric consists of two major components:

How hard it is to understand the test - This is related to the size of a test and how readable it is. Generally, tests that are longer tend to be more difficult to read and maintain.
How hard it is to run the test - If a test requires out-of-process dependencies like a database or another service, it will require extra effort to keep them running and ready.

What to sacrifice

In an ideal world, we would like to maximize all four metrics. Unfortunately, these attributes are interconnected. This means that to maximize one, we need to sacrifice another.

The first three attributes are mutually exclusive. Just like the CAP theorem, we can’t maximize all of them at the same time.

Imagine you want the best possible protection against regression. To achieve that you will most likely couple the implementation details of code with the test. This means you get a very low score on the metric of resistance to refactoring.

So, instead of solely maximizing a single metric, we aim to find a healthy balance between them. The first 2 attributes are essential, which means we want to have a decent score for them. And if we have to compromise something, it should be between fast feedback and maintainability.

Let’s see how the decision to make a sacrifice will impact various testing concepts.

The test pyramid

You should all be familiar with the test pyramid principle. It suggests that an optimal test suite should primarily consist of mostly unit tests, followed by integration tests, and have the fewest number of end-to-end tests. Why is that? Well, let’s try to explain using four attributes of a valuable test.

Image source

Unit tests

They are the most balanced type of test: they are fast and maintainable, and if implemented correctly, they can provide both protection against regression and resistance to refactoring. This means we get the most value out of them in a test suite, and that is why we can have more of them relatively.

Integration tests

These are not as fast and maintainable as unit tests, because you need to maintain dependencies. They provide good protection against regression and resistance to refactoring. We may need them, but not as many as unit tests.

End-to-end tests

These are slow and not maintainable. They provide the best resistance to refactoring but come with a large cost. They can be necessary to ensure the functionality of the system’s critical parts, a necessary evil, but we need to make sure to keep their numbers to a minimum.

Styles of testing

When we test a part of the system, we use three styles of testing:

Output based testing
State based testing
Communication based testing

Let’s analyse these styles using four attributes of testing

Output based testing

In output-based testing, tests verify the output the system generates. This style of testing assumes there are no side effects. It produces tests of the highest quality. Such tests rarely couple to implementation details and thus are resistant to refactoring. They are also small, so more maintainable.

State based testing

Tests verify the final state of the system after an operation is complete. Extra caution is required to avoid exposing private state while enabling unit testing as state-based tests tend to be larger and less maintainable than output-based tests.

Communication based testing

In communication-based testing, tests substitute the system’s dependencies with mocks and verify that the system calls those collaborators correctly. Extra care is needed to prevent fragility when verifying communications that cross the application boundary and have visible side effects on the external world. Communication-based tests are less maintainable than output-based and state-based tests. Mocks are needed, which makes tests less readable.

Conclusion

In his book "Unit Testing Principles, Practices, and Patterns", Khorikov emphasizes that tests aren't just assets; they are the first class citizens of a project and liability. So, we should aim to get the maximum value from tests with minimum maintenance costs.

Using the four pillars of good tests, we can measure the value of any test objectively. The higher the value of a test, the more frequently we can run them and the more we can have them in a suite.

Looking to learn more about our R&D practices at Talon.One? Get an inside look at how the R&D department is structured, plus how we build & scale our teams, in our blog post.

Plus, we're hiring! Check out our open positions here.

Talon.One's definitive guide to customer loyalty

Monthly loyalty newsletter

Join thousands of marketers and developers getting the latest loyalty & promotion insights from Talon.One. Every month, you’ll receive:

check_circle

Loyalty and promotion tips

check_circle

Industry insights from leading brands

check_circle

Case studies and best practices

Isabelle Watson

Loyalty & promotion expert at Talon.One

The World's Most Powerful Promotion Engine

BERLIN

Wiener Strasse 10
10999 Berlin
Germany

BIRMINGHAM

41 Church Street
B3 2RT Birmingham
United Kingdom

BOSTON

One Boston Place, Suite 2600
02108 Boston, MA
United States

SINGAPORE

1 Scotts Road, #21-10 Shaw Centre
228208 Singapore
Singapore