An introduction to property-based testing
Or, why you should be using FsCheck and QuickCheck
Last updated
Was this helpful?
Or, why you should be using FsCheck and QuickCheck
Last updated
Was this helpful?
This post is part of the project. Check out all the other great posts there! And special thanks to Sergey Tihon for organizing this.
UPDATE: I did a talk on property-based testing based on these posts.
Let's start with a discussion that I hope never to have:
But seriously, my imaginary co-worker's complaint has some validity: How many tests are enough?
So now imagine that rather than being a developer, you are a test engineer who is responsible for testing that the "add" function is implemented correctly.
You are practising test-driven-development, enterprise-style, which means that you write a test, and then the EDFH implements code that passes the test.
So you start with a test like this (using vanilla NUnit style):
The EDFH then implements the add
function like this:
And your test passes!
Fair enough. So you write another test:
The EDFH then changes the implementation of the add
function to this:
At this point, you start thinking that the EDFH is being malicious, and that this back-and-forth could go on forever!
So the question is, what kind of test could you write so that a malicious programmer could not create an incorrect implementation, even if they wanted to?
Well, you could start with a much larger list of known results, and mix them up a bit.
But the EDFH is tireless, and will update the implementation to include all of these cases as well.
A much better approach is to generate random numbers and use those for inputs, so that a malicious programmer could not possibly know what to do in advance.
If the test looks like this, then the EDFH will be forced to implement the add
function correctly!
One final improvement -- the EDFH might just get lucky and have picked numbers that work by chance, so let's repeat the random number test a number of times, say 100 times.
So now we're done!
Or are we?
There's just one problem. In order to test the add
function, you're making use of the +
function. In other words, you are using one implementation to test another.
In some cases that is acceptable (see the use of "test oracles" in a following post), but in general, it's a bad idea to have your tests duplicate the code that you are testing! It's a waste of time and effort, and now you have two implementations to build and keep up to date.
So if you can't test by using +
, how can you test?
The answer is to create tests that focus on the properties of the function -- the "requirements". These properties should be things that are true for any correct implementation.
So let's think about what the properties of an add
function are.
One way of getting started is to think about how add
differs from other similar functions.
So for example, what is the difference between add
and subtract
? Well, for subtract
, the order of the parameters makes a difference, while for add
it doesn't.
So there's a good property to start with. It doesn't depend on addition itself, but it does eliminate a whole class of incorrect implementations.
That's a good start, but it doesn't stop the EDFH. The EDFH could still implement add
using x * y
and this test would pass!
So now what about the difference between add
and multiply
? What does addition really mean?
We could start by testing with something like this, which says that x + x
should the same as x * 2
:
But now we are assuming the existence of multiplication! Can we define a property that only depends on add
itself?
One very useful approach is to see what happens when the function is repeated more than once. That is, what if you add
and then add
to the result of that?
That leads to the idea that two add 1
s is the same as one add 2
. Here's the test:
That's great! add
works perfectly with this test, while multiply
doesn't.
However, note that the EDFH could still implement add
using y - x
and this test would pass!
Luckily, we have the "parameter order" test above as well. So the combination of both of these tests should narrow it down so that there is only one correct implementation, surely?
After submitting this test suite we find out the EDFH has written an implementation that passes both these tests. Let's have a look:
Aarrghh! What happened? Where did our approach go wrong?
Well, we forgot to force the implementation to actually use the random numbers we were generating!
So we need to ensure that the implementation does indeed do something with the parameters that are passed into it. We're going to have to check that the result is somehow connected to the input in a specific way.
Is there a trivial property of add
that we know the answer to without reimplementing our own version?
Yes!
What happens when you add zero to a number? You always get the same number back.
So now we have a set of properties that can be used to test any implementation of add
, and that force the EDFH to create a correct implementation:
There's quite a bit of duplicated code in these three tests. Let's do some refactoring.
First, we'll write a function called propertyCheck
that does the work of generating 100 pairs of random ints.
propertyCheck
will also need a parameter for the property itself. This will be a function that takes two ints and returns a bool:
With this in place, we can redefine one of the tests by pulling out the property into a separate function, like this:
We can also do the same thing for the other two properties.
After the refactoring, the complete code looks like this:
We have defined a set of properties that any implementation of add
should satisfy:
The parameter order doesn't matter ("commutativity" property)
Doing add
twice with 1 is the same as doing add
once with 2
Adding zero does nothing ("identity" property)
What's nice about these properties is that they work with all inputs, not just special magic numbers. But more importantly, they show us the core essence of addition.
In fact, you can take this approach to the logical conclusion and actually define addition as anything that has these properties.
You'll note that in our experiment, we missed defining "associativity", but instead created a weaker property (x+1+1 = x+2
). We'll see later that the EDFH can indeed write a malicious implementation that satisfies this property, and that associativity is better.
Alas, it's hard to get properties perfect on the first attempt, but even so, by using the three properties we came up with, we have got a much higher confidence that the implementation is correct, and in fact, we have learned something too -- we have understood the requirements in a deeper way.
A collection of properties like this can be considered a specification.
You might be thinking that only mathematical kinds of functions can be specified this way, but in future posts, we'll see how this approach can be used to test web services and databases too.
You also might be thinking that designing all these properties is a lot of work -- and you'd be right! It is the hardest part. In a follow-up post, I'll present some tips for coming up with properties which might reduce the effort somewhat.
But even with the extra effort involved upfront (the technical term for this activity is called "thinking about the problem", by the way) the overall time saved by having automated tests and unambigous specifications will more than pay for the upfront cost later.
In fact, the arguments that are used to promote the benefits of unit testing can equally well be applied to property-based testing! So if a TDD fan tells you that they don't have the time to come up with property-based tests, then they might not be looking at the big picture.
We have implemented our own property checking system, but there are quite a few problems with it:
It only works with integer functions.
It would be nice if we could use the same approach for functions that had string parameters, or in fact any type of parameter, including ones we defined ourselves.
It only works with two parameter functions (and we had to ignore one of them for the adding1TwiceIsAdding2OnceProperty
and identity
properties).
It would be nice if we could use the same approach for functions with any number of parameters.
When there is a counter-example to the property, we don't know what it is! Not very helpful when the tests fail!
There's no logging of the random numbers that we generated, and there's no way to set the seed, which means that we can't debug and reproduce errors easily.
It's not configurable. For example, we can't easily change the number of loops from 100 to something else.
It would be nice if there was a framework that did all that for us!
So let's look at how FsCheck would do the same thing as our homemade property-testing system.
First, you need to install FsCheck and load the DLL (FsCheck can be a bit finicky -- see the bottom of this page for instructions and troubleshooting).
The top of your script file should look something like this:
Once FsCheck is loaded, you can use Check.Quick
and pass in any "property" function. For now, let's just say that a "property" function is any function (with any parameters) that returns a boolean.
If you check one of the properties interactively, say with Check.Quick commutativeProperty
, you'll see the message:
Let's see what happens when we have a malicious implementation of add
. In the code below, the EDFH implements add
as multiplication!
That implementation will satisfy the commutative property, but what about the adding1TwiceIsAdding2OnceProperty
?
The result from FsCheck is:
That means that using 1
as the input to adding1TwiceIsAdding2OnceProperty
will result in false
, which you can easily see that it does.
By using random testing, we have made it harder for a malicious implementor. They will have to change tactics now!
The EDFH notes that we are still using some magic numbers in the adding1TwiceIsAdding2OnceProperty
-- namely 1 and 2, and decides to create an implementation that exploits this. They'll use a correct implementation for low input values and an incorrect implementation for high input values:
Oh no! If we retest all our properties, they all pass now!
That'll teach us to use magic numbers in our tests!
What's the alternative? Well, let's steal from the mathematicians and create an associative property test.
Aha! Now we get a falsification:
That means that using (8+2)+10
is not the same as 8+(2+10)
.
Note that not only has FsCheck found some inputs that break the property, but it has found a lowest example. It knows that the inputs 8,2,9
pass but going one higher (8,2,10
) fails. That's very nice!
Now that we have used FsCheck for real, let's pause and have a look at how it works.
The first thing that FsCheck does is generate random inputs for you. This is called "generation", and for each type, there is an associated generator.
For example, to generate a list of sample data, you use the generator along with two parameters: the number of elements in the list and a "size". The precise meaning of "size" depends on the type being generated and the context. Examples of things "size" is used for are: the maximum value of an int; the length of a list; the depth of a tree; etc.
Here's some code that generates ints:
In this example, the ints are not generated uniformly, but clustered around zero. You can see this for yourself with a little code:
The result is something like this:
You can see that most of the values are in the center (0 is generated 181 times, 1 is generated 104 times), and the outlying values are rare (10 is generated only 3 times).
You can repeat with larger samples too. This one generates 10000 elements in the range [-30,30]
What's great about the generator logic is that it will automatically generate compound values as well.
For example, here is a generator for a tuple of three ints:
Once you have a generator for a base type, option
and list
generators follow. Here is a generator for int option
s:
And here is a generator for int list
s:
And of course you can generate random strings too!
The best thing is that the generator will work with your own user-defined types too!
Here's one that generates a user-defined record type containing another user-defined type.
There are ways to have more fine-grained control over how your types are generated, but that will have to wait for another post!
Creating minimum counter-examples is one of the cool things about QuickCheck-style testing.
How does it do this?
There are two parts to the process that FsCheck uses:
First it generates a sequence of random inputs, starting small and getting bigger. This is the "generator" phase as described above.
If any inputs cause the property to fail, it starts "shrinking" the first parameter to find a smaller number. The exact process for shrinking varies depending on the type (and you can override it too), but let's say that for numbers, they get smaller in a sensible way.
For example, let's say that you have a silly property isSmallerThan80
:
You have generated random numbers and found that then property fails for 100
, and you want to try a smaller number. Arb.shrink
will generate a sequence of ints, all of which are smaller than 100. Each one of these is tried with the property in turn until the property fails again.
For each element in the list, test the property against it until you find another failure:
The property failed with 88
, so shrink again using that as a starting point:
The property failed with 83
now, so shrink again using that as a starting point:
The property failed with 81
, so shrink again using that as a starting point:
After this point, shrinking on 80 doesn't work -- no smaller value will be found.
In this case then, FsCheck will report that 80
falsifies the property and that 4 shrinks were needed.
Just as with generators, FsCheck will generate shrink sequences for almost any type:
And, as with generators, there are ways to customize how shrinking works if needed.
I mentioned a silly property isSmallerThan80
-- let's see how FsCheck does with it.
Oh dear! FsCheck didn't find a counter-example!
At this point, we can try a few things. First, we can try increasing the number of tests.
We do this by changing the default ("Quick") configuration. There is a field called MaxTest
that we can set. The default is 100, so let's increase it to 1000.
Finally, to use a specific config, you'll need to use Check.One(config,property)
rather than just Check.Quick(property)
.
Oops! FsCheck didn't find a counter-example with 1000 tests either! Let's try once more with 10000 tests:
Ok, so we finally got it to work. But why did it take so many tests?
The answer lies in some other configuration settings: StartSize
and EndSize
.
Remember that the generators start with small numbers and gradually increase them. This is controlled by the StartSize
and EndSize
settings. By default, StartSize
is 1 and EndSize
is 100. So at the end of the test, the "size" parameter to the generator will be 100.
But, as we saw, even if the size is 100, very few numbers are generated at the extremes. In this case it means that numbers greater than 80 are unlikely to be generated.
So let's change the EndSize
to something larger and see what happens!
That's more like it! Only 21 tests needed now rather than 8660 tests!
I mentioned that one of the benefits of FsCheck over a home-grown solution is the logging and reproducibility, so let's have a look at that.
We'll tweak the malicious implementation to have a boundary of 25
. Let's see how FsCheck detects this boundary via logging.
The result is:
Again, FsCheck has found that 25
is the exact boundary point quite quickly. But how did it do it?
First, the simplest way to see what FsCheck is doing is to use "verbose" mode. That is, use Check.Verbose
rather than Check.Quick
:
When do this, you'll see an output like that shown below. I've added all the comments to explain the various elements.
This display takes up a lot of space! Can we make it more compact?
Yes -- you can control how each test and shrink is displayed by writing your own custom functions, and telling FsCheck to use them via its Config
structure.
These functions are generic, and the list of parameters is represented by a list of unknown length (obj list
). But since I know I am testing a three parameter property I can hard-code a three-element list parameter and print them all on one line.
The configuration also has a slot called Replay
which is normally None
, which means that each run will be different.
If you set Replay
to Some seed
, then the test will be replayed exactly the same way. The seed looks like StdGen (someInt,someInt)
and is printed on each run, so if you want to preserve a run all you need to do is paste that seed into the config.
And again, to use a specific config, you'll need to use Check.One(config,property)
rather than just Check.Quick(property)
.
Here's the code with the default tracing functions changed, and the replay seed set explicitly.
The output is now much more compact, and looks like this:
So there you go -- it's quite easy to customize the FsCheck logging if you need to.
Let's look at how the shrinking was done in detail. The last set of inputs (46,-4,50) was false, so shrinking started.
We'll loop through the list [0; 23; 35; 41; 44; 45]
stopping at the first element that causes the property to fail:
The first element that caused a failure was x=35
, as part of the inputs (35, -4, 50)
.
So now we start at 35 and shrink that:
The first element that caused a failure was now x=27
, as part of the inputs (27, -4, 50)
.
So now we start at 27 and keep going:
At this point, x=25
is as low as you can go. None of its shrink sequence caused a failure. So we're finished with the x
parameter!
Now we just repeat this process with the y
parameter
At this point, y=1
is as low as you can go. None of its shrink sequence caused a failure. So we're finished with the y
parameter!
Finally, we repeat this process with the z
parameter
And now we're finished with all the parameters!
The final counter-example after shrinking is (25,1,26)
.
Let's say that we have a new idea for a property to check. We'll create a property called addition is not multiplication
which will help to stop any malicious (or even accidental) mixup in the implementations.
Here's our first attempt:
Bt when we run this test, we get a failure!
Well duh, obviously 0+0
and 0*0
are equal. But how can we tell FsCheck to ignore just those inputs and leave all the other ones alone?
This is done via a "condition" or filter expression that is prepended to the property function using ==>
(an operator defined by FsCheck).
Here's an example:
The new property is additionIsNotMultiplication_withPreCondition
and can be passed to Check.Quick
just like any other property.
Oops! We forgot another case! Let's fix up our precondition again:
And now this works.
This kind of precondition should only be used if you want to filter out a small number of cases.
If most of the inputs will be invalid, then this filtering will be expensive. In this case there is a better way to do it, which will be discussed in a future post.
These properties functions have a different purpose from "normal" functions, so how should we name them?
In the Haskell and Erlang world, properties are given a prop_
prefix by convention. In the .NET world, it is more common to use a suffix like AbcProperty
.
Also, in F# we have namespaces, modules, and attributes (like [<Test>]
) that we can use to organize properties and distinguish them from other functions.
Once you have a set of properties, you can combine them into a group (or even, gasp, a specification!), by adding them as static members of a class type.
You can then do Check.QuickAll
and pass in the name of the class.
For example, here are our three addition properties:
And here's the corresponding static class to be used with Check.QuickAll
:
At the beginning of this post, I was dismissive of tests that used "magic" numbers to test a very small part of the input space.
However, I do think that example-based tests have a role that complements property-based tests.
An example-based test is often easier to understand because it is less abstract, and so provides a good entry point and documentation in conjuction with the properties.
Here's an example:
You can use FsCheck from NUnit and other test frameworks, with an extra plugin (e.g. FsCheck.NUnit
for Nunit).
Rather than marking a test with Test
or Fact
, you use the Property
attribute. And unlike normal tests, these tests can have parameters!
Here's an example of some tests.
As you can see, you can change the configuration for each test (such as Verbose
and EndSize
) via properties of the annotation.
And the QuietOnSuccess
flag is available to make FsCheck compatible with standard test frameworks, which are silent on success and only show messages if something goes wrong.
In this post I've introduced you to the basics of property-based checking.
There's much more to cover though! In future posts I will cover topics such as:
We'll look at more general properties such as inverses (for testing serialization/deserialization), idempotence (for safe handling of multiple updates or duplicate messages),
and also look at test oracles.
How to create your own generators and shrinkers. We've seen that FsCheck can generate random values nicely.
But what about values with constraints such as positive numbers, or valid email addresses, or phone numbers. FsCheck gives you the tools to build your own.
How to do model-based testing, and in particular, how to test for concurrency issues.
I've also introduced the notion of an evil malicious programmer. You might think that such a malicious programmer is unrealistic and over-the-top.
But in many cases you act like an unintentionally malicious programmer. You happily create a implementation that works for some special cases, but doesn't work more generally, not out of evil intent, but out of unawareness and blindness.
Like fish unaware of water, we are often unaware of the assumptions we make. Property-based testing can force us to become aware of them.
Until next time -- happy testing!
The easiest way to make FsCheck available to you is to create an F# project and add the NuGet package "FsCheck.NUnit". This will install both FsCheck and NUnit in the packages
directory.
If you are using a FSX script file for interactive development, you'll need to load the DLLs from the appropriate package location, like this:
Next, test that FsCheck is working correctly by running the following:
If you get no errors, then everything is good.
If you do get errors, it's probably because you are on an older version of Visual Studio. Upgrade to VS2013 or failing that, do the following:
Make sure that your NUnit assemblies are being referenced locally rather than from the GAC.
These steps should ensure that compiled code works.
With F# interactive, it can be trickier. If you are not using VS2013, you might run into errors such as System.InvalidCastException: Unable to cast object of type 'Arrow'
.
The best cure for this is to upgrade to VS2013! Failing that, you can use an older version of FsCheck, such as 0.9.2 (which I have tested successfully with VS2012)
Unfortunately for you, the implementation is being written by a burned-out, always lazy and often malicious programmer, who I will call The Enterprise Developer From Hell, or "EDFH". (The EDFH has a ).
When you complain to the EDFH, they say that they are doing TDD properly, and only .
When you again complain to the EDFH, they point out that this approach is actually a best practice. Apparently it's called .
This is exactly what mathematicians do. If you look up , you'll see that it is defined entirely in terms of commutativity, associativity, identity, and so on.
Historically, unit tests, as well as being functional tests, have been as well. But an approach to specification using properties instead of tests with "magic" data is an alternative which I think is often shorter and less ambiguous.
Of course, not every business requirement can be expressed as properties like this, and we must not neglect the social component of software development. and domain driven design can play a valuable role when working with non-technical customers.
Thankfully there is! The library was originally developed for Haskell by Koen Claessen and John Hughes, and has been ported to many other languages.
The version of QuickCheck used in F# (and C# too) is the excellent library created by Kurt Schelfthout. Although based on the Haskell QuickCheck, it has some nice additional features, including integration with test frameworks such as NUnit and xUnit.
There are plenty of other generator functions available as well as Gen.sample
(more documentation ).
The FsCheck documentation has more on how you can tweak properties .
. The properties don't have to be mathematical.
The code samples used in this post are .
Want more? I have written
UPDATE: I did a talk on property-based testing based on these posts.
First make sure you have the latest F# core installed ().
Make sure your that your app.config
has the .