Trees in the real world

Examples using databases, JSON and error handling

This post is the sixth in a series.

In the previous post, we briefly looked at some generic types.

In this post, we'll do some deeper dives into some real-world examples of using trees and folds.

Series contents

Here's the contents of this series:

Defining a generic Tree type

In this post, we'll be working with a generic Tree inspired by the FileSystem domain that we explored earlier.

Here was the original design:

We can separate out the data from the recursion, and create a generic Tree type like this:

Notice that I have used seq to represent the subitems rather than list. The reason for this will become apparent shortly.

The file system domain can then be modelled using Tree by specifying FileInfo as data associated with a leaf node and DirectoryInfo as data associated with an internal node:

cata and fold for Tree

We can define cata and fold in the usual way:

Note that I am not going to implement foldBack for the Tree type, because it's unlikely that the tree will get so deep as to cause a stack overflow. Functions that need inner data can use cata.

Modelling the File System domain with Tree

Let's test it with the same values that we used before:

The totalSize function is almost identical to the one in the previous post:

And so is the largestFile function:

The source code for this section is available at this gistarrow-up-right.

The Tree type in the real world

We can use the Tree to model the real file system too! To do this, just set the leaf node type to System.IO.FileInfo and the internal node type to System.IO.DirectoryInfo.

And let's create some helper methods to create the various nodes:

Now you can see why I used seq rather than list for the subitems. The seq is lazy, which means that we can create nodes without actually hitting the disk.

Here's the totalSize function again, this time using the real file information:

Let's see what the size of the current directory is:

Similarly, we can get the largest file:

So that's one big benefit of using generic recursive types. If we can turn a real-world hierarchy into our tree structure, we can get all the benefits of fold "for free".

Mapping with generic types

One other advantage of using generic types is that you can do things like map -- converting every element to a new type without changing the structure.

We can see this in action with the real world file system. But first we need to define map for the Tree type!

The implementation of map can also be done mechanically, using the following rules:

  • Create a function parameter to handle each case in the structure.

  • For non-recursive cases

    • First, use the function parameter to transform the non-recursive data associated with that case

    • Then wrap the result in the same case constructor

  • For recursive cases, perform two steps:

    • First, use the function parameter to transform the non-recursive data associated with that case

    • Next, recursively map the nested values.

    • Finally, wrap the results in the same case constructor

Here's the implementation of map for Tree, created by following those rules:

If we look at the signature of Tree.map, we can see that all the leaf data is transformed to type 'a, all the node data is transformed to type 'b, and the final result is a Tree<'a,'b>.

We can define Tree.iter in a similar way:

Example: Creating a directory listing

Let's say we want to use map to transform the file system into a directory listing - a tree of strings where each string has information about the corresponding file or directory. Here's how we could do it:

And then we can print the strings out like this:

The results will look something like this:

The source code for this example is available at this gistarrow-up-right.

Example: Creating a parallel grep

Let's look at a more complex example. I'll demonstrate how to create a parallel "grep" style search using fold.

The logic will be like this:

  • Use fold to iterate through the files.

  • For each file, if its name doesn't match the desired file pattern, return None.

  • If the file is to be processed, then return an async that returns all the line matches in the file.

  • Next, all these asyncs -- the output of the fold -- are aggregated into a sequence.

  • The sequence of asyncs is transformed into a single one using Async.Parallel which returns a list of results.

Before we start writing the main code, we'll need some helper functions.

First, a generic function that folds over the lines in a file asynchronously. This will be the basis of the pattern matching.

Next, a little helper that allows us to map over Async values:

Now for the central logic. We will create a function that, given a textPattern and a FileInfo, will return a list of lines that match the textPattern, but asynchronously:

And now for the grep function itself:

Let's test it!

The result will look something like this:

That's not bad for about 40 lines of code. This conciseness is because we are using various kinds of fold and map which hide the recursion, allowing us to focus on the pattern matching logic itself.

Of course, this is not at all efficient or optimized (an async for every line!), and so I wouldn't use it as a real implementation, but it does give you an idea of the power of fold.

The source code for this example is available at this gistarrow-up-right.

Example: Storing the file system in a database

For the next example, let's look at how to store a file system tree in a database. I don't really know why you would want to do that, but the principles would work equally well for storing any hierarchical structure, so I will demonstrate it anyway!

To model the file system hierarchy in the database, say that we have four tables:

  • DbDir stores information about each directory.

  • DbFile stores information about each file.

  • DbDir_File stores the relationship between a directory and a file.

  • DbDir_Dir stores the relationship between a parent directory and a child directory.

Here are the database table definitions:

That's simple enough. But note that in order to save a directory completely along with its relationships to its child items, we first need the ids of all its children, and each child directory needs the ids of its children, and so on.

This implies that we should use cata instead of fold, so that we have access to the data from the lower levels of the hierarchy.

Implementing the database functions

We're not wise enough to be using the SQL Providerarrow-up-right and so we have written our own table insertion functions, like this dummy one:

In a real database, the identity column would be automatically generated for you, but for this example, I'll use a little helper function nextIdentity:

Now in order to insert a directory, we need to first know all the ids of the files in the directory. This implies that the insertDbFile function should return the id that was generated.

But that logic applies to the directories too:

But that's still not good enough. When the child ids are passed to the parent directory, it needs to distinguish between files and directories, because the relations are stored in different tables.

No problem -- we'll just use a choice type to distinguish between them!

With this in place, we can complete the implementation of the database functions:

Working with the catamorphism

As noted above, we need to use cata instead of fold, because we need the inner ids at each step.

The function to handle the File case is easy -- just insert it and return the PrimaryKey.

The function to handle the Directory case will be passed the DirectoryInfo and a sequence of PrimaryKeys from the children that have already been inserted.

It should insert the main directory record, then insert the children, and then return the PrimaryKey for the next higher level:

After inserting the directory record and getting its id, for each child id, we insert either into the DbDir_File table or the DbDir_Dir, depending on the type of the childId.

Note that I've also created a little helper function pkToInt that extracts the integer id from the PrimaryKey type.

Here is all the code in one chunk:

Now let's test it:

The output should look something like this:

You can see that the ids are being generated as the files are iterated over, and that each DbFile insert is followed by a DbDir_File insert.

The source code for this example is available at this gistarrow-up-right.

Example: Serializing a Tree to JSON

Let's look at another common challenge: serializing and deserializing a tree to JSON, XML, or some other format.

Let's use the Gift domain again, but this time, we'll model the Gift type as a tree. That means we get to put more than one thing in a box!

Modelling the Gift domain as a tree

Here are the main types again, but notice that the final Gift type is defined as a tree:

As usual, we can create some helper functions to assist with constructing a Gift:

And we can create some sample data:

Functions like description now need to handle a list of inner texts, rather than one. We'll just concat the strings together with an & separator:

Finally, we can check that the function still works as before, and that multiple items are handled correctly:

Step 1: Defining GiftDto

Our Gift type consists of many discriminated unions. In my experience, these do not serialize well. In fact, most complex types do not serialize well!

So what I like to do is define DTOarrow-up-right types that are explicitly designed to be serialized well. In practice this means that the DTO types are constrained as follows:

  • Only record types should be used.

  • The record fields should consist only primitive values such as int, string and bool.

By doing this, we also get some other advantages:

We gain control of the serialization output. These kinds of data types are handled the same by most serializers, while "strange" things such as unions can be interpreted differently by different libraries.

We have better control of error handling. My number one rule when dealing with serialized data is "trust no one". It's very common that the data is structured correctly but is invalid for the domain: supposedly non-null strings are null, strings are too long, integers are outside the correct bounds, and so on.

By using DTOs, we can be sure that the deserialization step itself will work. Then, when we convert the DTO to a domain type, we can do proper validation.

So, let's define some DTO types for out domain. Each DTO type will correspond to a domain type, so let's start with GiftContents. We'll define a corresponding DTO type called GiftContentsDto as follows:

Obviously, this quite different from the original GiftContents, so let's look at the differences:

  • First, it has the CLIMutableAttribute, which allows deserializers to construct them using reflection.

  • Second, it has a discriminator which indicates which case of the original union type is being used. Obviously, this string could be set to anything,

    so when converting from the DTO back to the domain type, we'll have to check that carefully!

  • Next is a series of fields, one for every possible item of data that needs to be stored. For example, in the Book case, we need a bookTitle,

    while in the Chocolate case, we need the chocolate type. And finally the price field which is in both types.

    Note that the chocolate type is stored as a string as well, and so will also need special treatment when we convert from DTO to domain.

The GiftDecorationDto type is created in the same way, with a discriminator and strings rather than unions.

Finally, we can define a GiftDto type as being a tree that is composed of the two DTO types:

Step 2: Transforming a Gift to a GiftDto

Now that we have this DTO type, all we need to do is use Tree.map to convert from a Gift to a GiftDto. And in order to do that, we need to create two functions: one that converts from GiftContents to GiftContentsDto and one that converts from GiftDecoration to GiftDecorationDto.

Here's the complete code for giftToDto, which should be self-explanatory:

You can see that the case (Book, Chocolate, etc.) is turned into a discriminator string and the chocolateType is also turned into a string, just as explained above.

Step 3: Defining a TreeDto

I said above that a good DTO should be a record type. Well we have converted the nodes of the tree, but the tree itself is a union type! We need to transform the Tree type as well, into say a TreeDto type.

How can we do this? Just as for the gift DTO types, we will create a record type which contains all the data for both cases. We could use a discriminator field as we did before, but this time, since there are only two choices, leaf and internal node, I'll just check whether the values are null or not when deserializing. If the leaf value is not null, then the record must represent the LeafNode case, otherwise the record must represent the InternalNode case.

Here's the definition of the data type:

As before, the type has the CLIMutableAttribute. And as before, the type has fields to store the data from all possible choices. The subtrees are stored as an array rather than a seq -- this makes the serializer happy!

To create a TreeDto, we use our old friend cata to assemble the record from a regular Tree.

Note that in F#, records are not nullable, so I am using Unchecked.defaultof<'NodeData> rather than null to indicate missing data.

Note also that I am assuming that LeafData or NodeData are reference types. If LeafData or NodeData are ever value types like int or bool, then this approach will break down, because you won't be able to tell the difference between a default value and a missing value. In which case, I'd switch to a discriminator field as before.

Alternatively, I could have used an IDictionary. That would be less convenient to deserialize, but would avoid the need for null-checking.

Step 4: Serializing a TreeDto

Finally we can serialize the TreeDto using a JSON serializer.

For this example, I am using the built-in DataContractJsonSerializer so that I don't need to take a dependency on a NuGet package. There are other JSON serializers that might be better for a serious project.

Step 5: Assembling the pipeline

So, putting it all together, we have the following pipeline:

  • Transform Gift to GiftDto using giftToDto,

    that is, use Tree.map to go from Tree<GiftContents,GiftDecoration> to Tree<GiftContentsDto,GiftDecorationDto>

  • Transform Tree to TreeDto using treeToDto,

    that is, use Tree.cata to go from Tree<GiftContentsDto,GiftDecorationDto> to TreeDto<GiftContentsDto,GiftDecorationDto>

  • Serialize TreeDto to a JSON string

Here's some example code:

And here is what the JSON output looks like:

The ugly @ signs on the field names are an artifact of serializing the F# record type. This can be corrected with a bit of effort, but I'm not going to bother right now!

The source code for this example is available at this gistarrow-up-right

Example: Deserializing a Tree from JSON

Now that we have created the JSON, what about going the other way and loading it into a Gift?

Simple! We just need to reverse the pipeline:

  • Deserialize a JSON string into a TreeDto.

  • Transform a TreeDto into a Tree to using dtoToTree,

    that is, go from TreeDto<GiftContentsDto,GiftDecorationDto> to Tree<GiftContentsDto,GiftDecorationDto>.

    We can't use cata for this -- we'll have to create a little recursive loop.

  • Transform GiftDto to Gift using dtoToGift,

    that is, use Tree.map to go from Tree<GiftContentsDto,GiftDecorationDto> to Tree<GiftContents,GiftDecoration>.

Step 1: Deserializing a TreeDto

We can deserialize the TreeDto using a JSON serializer.

What if the deserialization fails? For now, we will ignore any error handling and let the exception propagate.

Step 2: Transforming a TreeDto into a Tree

To transform a TreeDto into a Tree we recursively loop through the record and its subtrees, turning each one into a InternalNode or a LeafNode, based on whether the appropriate field is null or not.

As you can see, a number of things could go wrong:

  • What if the leafData and nodeData fields are both null?

  • What if the nodeData field is not null but the subtrees field is null?

Again, we will ignore any error handling and just throw exceptions (for now).

Question: Could we create a cata for TreeDto that would make this code simpler? Would it be worth it?

Step 3: Transforming a GiftDto into Gift

Now we have a proper tree, we can use Tree.map again to convert each leaf and internal node from a DTO to the proper domain type.

That means we need functions that map a GiftContentsDto into a GiftContents and a GiftDecorationDto into a GiftDecoration.

Here's the complete code -- it's a lot more complicated than going in the other direction!

The code can be grouped as follows:

  • Helper methods (such as strToChocolateType) that convert a string into a proper domain type and throw an exception if the input is invalid.

  • Case converter methods (such as bookFromDto) that convert an entire DTO into a case.

  • And finally, the dtoToGift function itself. It looks at the discriminator field to see which case converter to call,

    and throws an exception if the discriminator value is not recognized.

Step 4: Assembling the pipeline

We can now assemble the pipeline that takes a JSON string and creates a Gift.

This works fine, but the error handling is terrible!

Look what happens if we corrupt the JSON a little:

We get an ugly exception.

Or what if a discriminator is wrong?

or one of the values for the WrappingPaperStyle DU?

We get lots of exceptions, and as as functional programmers, we should try to remove them whenever we can.

How we can do that will be discussed in the next section.

The source code for this example is available at this gistarrow-up-right.

Example: Deserializing a Tree from JSON - with error handling

To address the error handling issue, we're going use the Result type shown below:

I'm not going to explain how it works here. If you are not familar with this approach, please read my post or watch my talkarrow-up-right on the topic of functional error handling.

Let's revisit all the steps from the previous section, and use Result rather than throwing exceptions.

Step 1: Deserializing a TreeDto

When we deserialize the TreeDto using a JSON serializer we will trap exceptions and turn them into a Result.

The signature of fromJson is now string -> Result<'a>.

Step 2: Transforming a TreeDto into a Tree

As before, we transform a TreeDto into a Tree by recursively looping through the record and its subtrees, turning each one into a InternalNode or a LeafNode. This time, though, we use Result to handle any errors.

But uh-oh, we now have a Tree where every internal node and leaf is wrapped in a Result. It's a tree of Results! The actual ugly signature is this: Tree<Result<'Leaf>,Result<'Node>>.

But this type is useless as it stands -- what we really want is to merge all the errors together and return a Result containing a Tree.

How can we transform a Tree of Results into a Result of Tree?

The answer is to use a sequence function which "swaps" the two types. You can read much more about sequence in my series on elevated worlds.

Note that we could also use the slightly more complicated traverse variant to combine the map and sequence into one step, but for the purposes of this demonstration, it's easier to understand if the steps are kept separate.

We need to create our own sequence function for the Tree/Result combination. Luckily the creation of a sequence function is a mechanical process:

  • For the lower type (Result) we need to define apply and return functions. See here for more details on what apply means.

  • For the higher type (Tree) we need to have a cata function, which we do.

  • In the catamorphism, each constructor of the higher type (LeafNode and InternalNode in this case) is replaced by an equivalent that is "lifted" to the Result type (e.g. retn LeafNode <*> data)

Here is the actual code -- don't worry if you can't understand it immediately. Luckily, we only need to write it once for each combination of types, so for any kind of Tree/Result combination in the future, we're set!

Finally, the actual dtoToTree function is simple -- just send the treeDto through dtoToTreeOfResults and then use sequenceTreeOfResult to convert the final result into a Result<Tree<..>>, which is just what we need.

Step 3: Transforming a GiftDto into a Gift

Again we can use Tree.map to convert each leaf and internal node from a DTO to the proper domain type.

But our functions will handle errors, so they need to map a GiftContentsDto into a Result<GiftContents> and a GiftDecorationDto into a Result<GiftDecoration>. This results in a Tree of Results again, and so we'll have to use sequenceTreeOfResult again to get it back into the correct Result<Tree<..>> shape.

Let's start with the helper methods (such as strToChocolateType) that convert a string into a proper domain type. This time, they return a Result rather than throwing an exception.

The case converter methods have to build a Book or Chocolate from parameters that are Results rather than normal values. This is where lifting functions like Result.lift2 can help. For details on how this works, see this post on lifting and this one on validation with applicatives.

And finally, the dtoToGift function itself is changed to return a Result if the discriminator is invalid.

As before, this mapping creates a Tree of Results, so we pipe the output of the Tree.map through sequenceTreeOfResult ...

... to return a Result of Tree.

Here's the complete code for dtoToGift:

The type signature of dtoToGift has changed -- it now returns a Result<Gift> rather than just a Gift.

Step 4: Assembling the pipeline

We can now reassemble the pipeline that takes a JSON string and creates a Gift.

But changes are needed to work with the new error handling code:

  • The fromJson function returns a Result<TreeDto> but the next function in the pipeline (dtoToTree) expects a regular TreeDto as input.

  • Similarly dtoToTree returns a Result<Tree> but the next function in the pipeline (dtoToGift) expects a regular Tree as input.

In both case, Result.bind can be used to solve that problem of mis-matched output/input. See here for a more detailed discussion of bind.

Ok, let's try deserializing the goodJson string we created earlier.

That's fine.

Let's see if the error handling has improved now. We'll corrupt the JSON again:

Great! We get an nice Failure case.

Or what if a discriminator is wrong?

or one of the values for the WrappingPaperStyle DU?

Again, nice Failure cases.

What's very nice (and this is something that the exception handling approach can't offer) is that if there is more than one error, the various errors can be aggregated so that we get a list of all the things that went wrong, rather than just one error at a time.

Let's see this in action by introducing two errors into the JSON string:

So overall, I'd say that's a success!

The source code for this example is available at this gistarrow-up-right.

Summary

We've seen in this series how to define catamorphisms, folds, and in this post in particular, how to use them to solve real world problems. I hope these posts have been useful, and have provided you with some tips and insights that you can apply to your own code.

This series turned out to be a lot longer that I intended, so thanks for making it to the end! Cheers!

Last updated

Was this helpful?