Voyage:The Properties of QuickCheck, Hedgehog and Hypothesis

The Properties of QuickCheck, Hedgehog and Hypothesis

If you're writing software, chances are that you're also writing tests for that software in some form or another (if you're not, you really should be). As far as testing frameworks go, there are quite a lot to choose from. They are often language specific, with various setup, tear-down and assertion steps, but most boil down to "run my code with some specific input and see that it works".

This is at face value a good approach, but limiting your testing to some small known cases can severely limit your ability to detect failures on unexpected input. This is where an approach called "fuzz testing" comes in; in a nutshell, "fuzz testing" is about exercising your code with random input and checking whether the code still works as expected. There is a downside to this, however: due to the random nature, failure cases can be extremely difficult to trace & debug, since the generated inputs are often not well-structured. This is where a technique called "Property Based Testing" (in the following abbreviated as "PBT") comes in: By generating input conforming to a given set of properties, the random input can be "tamed" a bit, by ensuring for a given test case only e.g. random even numbers are generated. While this ensures that input is more well-behaved, in the sense that it's less likely to encounter failures due to known-bad input, the generated test examples can still be very complicated to human eyes.

This problem of "too complicated test examples" was first tackled by QuickCheck in 1999, with more recent work in the form of Hedgehog and Hypothesis refining the concept through different approaches. There are also various others that have tackled this problem in the literature, but in the interest of brevity and due to me not being familiar with them, I'm going to focus on the three most well-known libraries, QuickCheck, Hedgehog and Hypothesis. This article is not intended to give an introduction to the libraries themselves, or how to use them to write tests. There are numerous resources online for that purpose already. Rather, I'll try to build some intuition for how each of these testing frameworks work conceptually and what their trade-offs are, in hopes of making the topic more accessible to others.

If you're just here for a quick "what's the differences" kind of thing, there is a TL;DR at the bottom. How much that will help you if you're not already familiar with these frameworks is left as an exercise for the reader :)

Naming

There are a number of (sometimes conflicting) definitions of "property testing". For the purposes of this article, property based testing refers to not only testing that some predicate holds, but also that the input is generated in some manner smarter than simple random generation, e.g. by taking failures into account that subsequently guide generation of new values. I'd classify simple random generation of input values as purely fuzzing.

The general flow

The general flow of property based testing frameworks is:

while should_generate
    # generate some object
    obj = generate_obj()

    # test the predicate
    holds = predicate(obj)

    # if it holds, just generate a new one
    if holds
        continue
    end

    # if it doesn't hold, shrink, and test the shrunk object until we're done
    while should_shrink
        shrunk_obj = next_shrink(obj)
        holds = predicate(shrunk_obj)
        if holds
            continue
        else
            # in order to find shrinks of shrunk objects
            obj = shrunk_obj
        end
    end
end

We start out with an initial example generated by some generate_obj function and we test the predicate. If it holds, we restart, it it doesn't, we start repeatedly shrinking & testing the shrunk objects.

Shrinking? (Added on 2024-03-14)

"Shrinking" may sound a bit odd, but in the context of PBT it refers to transforming one object into a "smaller" one. "Smaller" is pretty arbitrary here - it can refer to literally a smaller footprint in memory, it can refer to fewer set bits, or any other metric you might want to come up with. The only important thing is that shrinking an object ends up at a fixed point that can't shrink anymore. There may be multiple such fixed points, in which case it's also important that any one given object always deterministically shrinks to the same fixed point (for a given seed used for the pseudo random generator), lest the shrinking process is no longer reproducible.

Both should_generate and should_shrink are some form of cancellation of the entire process. This can be a timeout, an upper limit for how many samples should be taken, a combination of the two, or something else entirely. Suffice it to say, their purpose is to not get stuck in an infinite loop. If the user code in predicate loops endlessly, this will of course need to be cancelled through other means, but that is outside the purposes of this article.

Not production ready

The above pseudocode is only intended to illustrate the general flow of generation, testing & shrinking - no library I'm aware of actually uses a loop like that, in part because next_shrink is really ambiguous. Please don't use this as a template for how your own library should work.

QuickCheck and type based generation

QuickCheck is the "grandfather" of property based testing frameworks. Originally developed for the Haskell programming language, it was (to my knowledge) the first property based testing framework with shrinking (the act of minimizing an example to a smaller one) of examples in widespread use, at least in the Haskell ecosystem. At its core, QuickCheck works with types, both for generating values as well as shrinking them. To understand how this is done, we'll first have to take a (very small) excursion into type theory, to learn what a "type" even is (don't worry, this won't go very deep), at least for the purposes of this article.

Generally speaking, a type is a (potentially infinitely large) collection of values. For example, the Int32 type in Julia (corresponding to int32_t in C) is a type representing all possible 32-bit integers from the interval [-(2^31), 2^31 - 1], encoded in Two's complement binary. Since the source set is finite, so is the type. As another example, the String type in Julia represents all possible UTF-8 strings, i.e. sequences of Unicode code points stored as UTF-8. Because there is only a minimum length but not a (theoretical) maximum, String is an example of a type that can potentially grow infinitely large, barring memory constraints.

For Int32, generating an example is relatively trivial; just pick a random number from all valid instances. For String, you might start out with the empty string "", flip a coin to decide whether you should generate more characters or stop generation and return. If you generate more, you simply concatenate it onto the end of the result and flip a coin again. This can of course be biased towards longer strings. Another way for generating Strings would be to first generate a length and then generate that many code points.

julia> function gen_string(;bias=0.9)
           res = ""
           while true
               if rand(Float64) >= bias
                   return res
               end
               res *= rand(Char)
           end
       end
gen_string (generic function with 1 method)

julia> using Random # for randstring

julia> function gen_string_len(;min_len=0, max_len=20)
           len = rand(min_len:max_len)
           randstring(len)
       end
gen_string_len (generic function with 1 method)

Both of these functions return a String instance, but the strings they generate have wildly different properties themselves. For instance, the first function can return arbitrary Unicode characters, while the second function (as written) defaults to only returning alphanumeric characters. Additionally, the strings returned by gen_string are truly able to be infinitely long, while the ones from gen_string_len are not only bounded by max_len, but also uniformly sampled from the interval [0, 20]. This is not the case for gen_string; the strings it generates are increasingly unlikely to be longer, so shorter strings are much more likely.

This is a problem - which of these is better as the default for String? Or might an entirely different generator be better? In some cases, the alphanumeric nature of gen_string_len might be more appropriate than the arbitrary unicodeness of gen_string. In other cases, the potential for infinite length of gen_string could be more desirable than the limited nature of gen_string_len.

Further, because both generating and shrinking rely on knowing that the object is a String, both of these shrink in the same way. For String, we can shrink the length by dropping any character or we can shrink any individual character. The property that strings generated by gen_string_len could only ever be alphanumeric is lost, and shrinking its string might result in strings like "\u0\u4" (a string containing a NUL character, followed by an End Of Transmission character) being generated, which could be catastrophic if passed to a C function expecting a string of length 2.

You can solve this dilemma in QuickCheck by not actually returning a String, but a type that encodes these additional properties while inheriting all behaviour of String through the type classes prevalent in Haskell. This way, these generators can be distinguished, and their shrinking behaviour can be customized. The downside is that for each of these types, both a generator and a shrinker need to be written, and care has to be taken that the shrinker actually preserves the properties guaranteed by the generator.

Languages that don't have type level support for type classes also can't easily use this approach. Strongly typed languages won't accept anything other than String in their function signatures (if the function is limited to that type), so wrapping the output of either gen_string or gen_string_len in a new type isn't an option there. This means a different approach is needed - one that preserves the implicit properties of the generator function, while returning objects of the same type, ideally without having to duplicate the generation logic into shrinking.

Hedgehog and integrated shrinking

In 2017, the Haskell library Hedgehog set out to solve this dilemma through an approach called "integrated shrinking". The core idea is instead of simply generating an object, testing the property you want to test, and then (seperately) shrink the object to a smaller instance, you generate all possible shrinks ahead of time. Since this is not generally feasibly (e.g. in the case of infinite strings above), this is done lazily. The shrinking behaviour is thus integrated into the generating behaviour, (almost) eliminating code duplication between the generator and the shrinker, while simultaneously preserving all implicit properties that were present during generation. In practice, this is done not through naive combination of functions that directly generate random instances of whatever you want to generate, but by combining objects representing those generators and the shrunk values generated by them.

In the end, an Int32Generator represents a generator for a tree of values, the root of which is the initially generated value and the children of each node being the shrunk values.

For example, instead of having a type EvenInt32 for generating & shrinking even 32-bit integers (which themselves are of type EvenInt32!), you can simply take an existing iseven predicate and combine that with an Int32Generator to produce a new generator that also generates Int32:

julia> using PropCheck # my Hedgehog-like PBT library for Julia

julia> Int32Generator = itype(Int32);

# just generating a value gives us the entire (lazy) shrink tree
julia> generate(Int32Generator)
Tree(433258103)

# The tree contains Int32
julia> generate(Int32Generator) |> eltype
Int32

# generate some examples & look at their roots
julia> examples = [ root(generate(Int32Generator)) for _ in 1:1_000 ];

julia> count(isodd, examples)
509

julia> count(iseven, examples)
491

# now do the same for a filtered generator
julia> even_Int32_generator = filter(iseven, Int32Generator);

# the filtered tree still contains Int32
julia> generate(even_Int32_generator) |> eltype
Int32

# all examples are even
julia> examples = [ root(generate(even_Int32_generator)) for _ in 1:1_000 ];

julia> count(isodd, examples)
0

julia> count(iseven, examples)
1000

Generating a value from this filtered generator then ensures the predicate holds not only for the initial value, but also for any shrunken values. This eliminates the need for a specialized shrinking function for even integers, since you can simply reuse the shrinking function for Int32 and just have to apply the filtering step before producing a value, trying the next available shrunk value if the predicate passed to filter doesn't hold. Moreover, even_Int32_generator still generates proper Int32 objects, that just all "happen to be" even.

# look at the first 1000 subtrees
julia> examples = map(root, Iterators.take(PropCheck.subtrees(generate(even_Int32_generator)), 1_000));

julia> count(isodd, examples)
0

julia> count(iseven, examples)
1000

This makes the approach suitable for strongly typed languages without type classes! If you want to play around with this in Julia, check out PropCheck.jl, my Hedgehog inspired port of this approach to Julia. Note that I'm working on a replacement for PropCheck.jl, so while it's good to get your feet wet with property based testing, better things are on the horizon ;)

Downsides of combinations

This approach is not all wonderful though. While Hedgehog enables developers to combine generators without the need for additional wrapper types, the approach taken exposes a flaw that also shows up for more complicated objects in QuickCheck: When you want to generate an object that depends on more than one generated parent object (say, the length of a string and the possible characters for each position in that string), how should the resulting object shrink? How can Hedgehog generically combine the implicit shrinking trees of two such generators?

To illustrate, take the String example from earlier, where we first generated a length and then the individual elements for each index in the string. If we want to shrink the resulting string, we can either shrink the length, or any given character in the string. Naively picking one or the other first means that we can't then go back and shrink something else, without sacrificing the existing progress we made along the current shrinking progress - after all, the entire shrink tree is determined at the point of generation, not when shrinking the value. If we first shrink the length, we have to continue shrinking the length. If we first shrink the generator that produces elements, we have to continue shrinking the elements and can't start shrinking the length. This can result in bad shrinking results if we're getting stuck in a local minima of the shrinking process. With objects being combined from more and more parts, this quickly becomes infeasible in terms of quality of shrinks.

No, in order to shrink properly Hedgehog must look into the generators, take out their generator functions & shrinking functions, combine each and return an entirely new generator with a new (combined) shrinking function, creating an entirely different shrinking tree. This works, but is very complicated to implement. I suspect that this is the main reason Hedgehog is seldom used outside of Haskell. There are a number of ports made by the community, but they seem to be less sophisticated/well maintained than the Haskell counterpart. The Scala package linked to from the main Hedgehog website seems to be a dead link now.

It should be noted that this same problem also occurs when trying to combine generators & shrinkers of QuickCheck. In some sense the problem is worse there, because the shrinker has no knowledge of the properties that were implicit to the other generators when trying to combine shrinkers. Everything needs a bespoke solution.

If you want to read more about this problem, I can recommend this article by Edsko de Vries, which was instrumental in my understanding of how Hedgehog works. The article goes into much more depth and very likely explains the subtleties involved with this approach better than this short description ever could. Be aware that it's dense with Haskell language though.

Hypothesis and choice sequences

A few years before Hedgehog was released, there was another contender for property based testing. Unlike Hedgehog & QuickCheck which came from the functional programming community, Hypothesis is a Python project - and perhaps because of that, it takes an entirely different approach compared to the mathematically derived inner workings of these Haskell libraries.

Unlike Hedgehog & QuickCheck, Hypothesis internally places great emphasis on explicit management of the fact that it is a testing framework, and that there's associated state with one test execution. This is taken advantage of to allow a feature that neither Hedgehog nor QuickCheck have - dynamically manipulating the entire shrinking state & value generation, even while inside of a test. In particular, this means creating new values dependent on previously input values and having their influence on the outcome of the testcase also influence shrinking behaviour of the original value.

For QuickCheck it is fairly obvious to see why it can't do this, since shrinking & generation are separated completely and the way a given test value was generated is not exposed inside of the test at all. For Hedgehog, this is a bit more subtle - it's possible to get access to the original shrink tree, but merging that with an entirely new shrink tree is highly nontrivial and might require reaching across different call stacks. The resulting API is also incredibly unwieldy, since a user needs to be intimately familiar with how Hedgehog works internally to be able to correctly combine these shrink trees by hand. Of course, this again is a bit easier in Haskell because of its laziness, which allows some easier access to the original shrink tree than in other statically typed languages.

Hypothesis solves this issue elegantly, by decoupling the choices taken during generation from the actual generation of values entirely - it opts to track the entire decision history explicitly as a choice sequence, which is passed into each test case too. This allows simply taking that choice sequence, adding arbitrary choices and using those to generate new values which will correctly influence all other choices taken too.

To do this, Hypothesis has users compose so called "strategies". These strategies describe how to generate objects of a given type, with a given distribution, etc. One such strategy is data(), allowing access to all possible strategies in the process of running the testcase itself. Here's an example, taken from the Hypothesis documentation:

@given(data())
def test_draw_sequentially(data):
    x = data.draw(integers())
    y = data.draw(integers(min_value=x))
    assert x < y

Here, data represents any possible drawing strategy, effectively saying "add another choice using the given strategy to the choice sequence". The assert at the end serves as the marker that Hypothesis looks to falsify.

Ultimately what shrinks is not the objects themselves, but the choice sequence leading to those objects! By replaying a modified choice sequence (and thus constructing entirely new instances of the required objects), this modified choice sequence leads to shrunken values being passed into the testcase. This modified choice sequence then itself correctly leads to shrunken values being produced inside of the testcase as well. An extensive description of this approach can be found in "Test-Case Reduction via Test-Case Generation: Insights From the Hypothesis Reducer" by MacIver & Donaldson, the original authors of Hypothesis - an absolute fantastic and easy to follow paper.

This kind of internal test-case reduction, as the authors call it, has other advantages as well: because the choice sequence is completely decoupled from actually creating any objects, a number of different ways to judge how well an example does open up. For instance, Hypothesis supports an arbitrary scoring function to be used during execution, which Hypothesis will then try to maximize. This can be extremely useful for guiding Hypothesis towards examples that are more likely to be a good counterexample, in case the example in question is itself very complex and the generated example doesn't actually fail the test. In a sense, this combats the "curse of dimensionality" often seen in machine learning too, since "find the minimal counterexample reproducing a failure" really is an optimization problem in disguise. It's not difficult to imagine that this choice sequence could even be used as input to an LLM, asking it to produce similar sequences that produce failure cases as well. One downside of course is that the choice sequences are coupled with the given testcase that generated them, resulting in really bad transfer of any learned property between testcases. Neither of these optimization approaches are per se possible with the forced tree structure of QuickCheck and Hedgehog, since there is no way to transmit this additional information to the testing framework, let alone have it act on that information to steer shrinking. The state is implicit after all, not explicit.

However, it should in principle be possible to modify the expected signature of the frameworks to return a more complicated structure including such a targeting score as well as the result of the predicate to inform decisions about which branches are more interesting for future expansion. Whether that information can be integrated into the respective shrinking processes is another matter though - I suspect it should be easier to do in QuickCheck than in Hedgehog, since in Hedgehog the shrinking trees (and their order!) is already predetermined before the test starts.

Conclusion

Phew, that was a lot! Let's summarize a bit:

Overall, Hypothesis ends up being the most powerful & most extensible of all of these. The historical relation between QuickCheck and Hedgehog and how the latter tries to address problems of the former is evident. At the same time, the differences in approach make it clear how Hypothesis has little, if any, connection to the design concerns & features that the Haskell projects take advantage of. This should not be surprising, considering just how different the approaches to problem solving in Haskell and Python truly are.

Since this article has gotten quite long, I'll end it here. I'm definitely going to explore choice sequence based shrinking more in the future though! Concretely, I think this approach pairs very well with the powerful compile time reflection capabilities we have in Julia.. :) Much better than QuickCheck or Hedgehog based shrinking, at least.