Building a useful set of parser combinators
15 or so combinators that can be combined to parse almost anything
UPDATE: Slides and video from my talk on this topic
In this series, we are looking at how applicative parsers and parser combinators work.
In the first post, we created the foundations of a parsing library.
In this post, we'll build out the library with many other useful combinators.
The combinator names will be copied from those used by FParsec, so that you can easily migrate to it.
1. map -- transforming the contents of a parser
map -- transforming the contents of a parserWhen parsing, we often we want to match a particular string, such as a reserved word like "if" or "where". A string is just a sequence of characters, so surely we could use the same technique that we used to define anyOf in the first post, but using andThen instead of orElse?
Here's a (failed) attempt to create a pstring parser using that approach:
let pstring str =
str
|> Seq.map pchar // convert into parsers
|> Seq.reduce andThenThis doesn't work, because the output of andThen is different from the input (a tuple, not a char) and so the reduce approach fails.
In order to solve this, we'll need to use a different technique.
To get started, let's try just matching a string of a specific length. Say, for example, that we want to match a three digits in a row. Well, we can do that using andThen:
let parseDigit =
anyOf ['0'..'9']
let parseThreeDigits =
parseDigit .>>. parseDigit .>>. parseDigitIf we run it like this:
then we get the result:
It does work, but the result contains a tuple inside a tuple (('1', '2'), '3') which is fugly and hard to use. It would be so much more convenient to just have a simple string ("123").
But in order to turn ('1', '2'), '3') into "123", we'll need a function that can reach inside of the parser and transform the result using an arbitrary passed in function.
Of course, what we need is the functional programmer's best friend, map.
To understand map and similar functions, I like to think of there being two worlds: a "Normal World", where regular things live, and "Parser World", where Parsers live.
You can think of Parser World as a sort of "mirror" of Normal World because it obeys the following rules:
Every type in Normal World (say
char) has a corresponding type in Parser World (Parser<char>).

And:
Every value in Normal World (say
"ABC") has a corresponding value in Parser World (that is, someParser<string>that returns"ABC").
And:
Every function in Normal World (say
char -> string) has a corresponding function in Parser World (Parser<char> -> Parser<string>).

Using this metaphor then, map transforms (or "lifts") a function in Normal World into a function in Parser World.

And by the way, if you like this metaphor, I have a whole series of posts that develop it further.
So that's what map does; how do we implement it?
The logic is:
Inside the
innerFn, run the parser to get the result.If the result was a success, apply the specified function to the success value to get a new, transformed value, and...
...return the new, mapped, value instead of the original value.
Here's the code (I've named the map function mapP to avoid confusion with other map functions):
If we look at the signature of mapP:
we can see that it has exactly the signature we want, transforming a function 'a -> 'b into a function Parser<'a> -> Parser<'b>.
It's common to define an infix version of map as well:
And in the context of parsing, we'll often want to put the mapping function after the parser, with the parameters flipped. This makes using map with the pipeline idiom much more convenient:
Parsing three digits with mapP
mapPWith mapP available, we can revisit parseThreeDigits and turn the tuple into a string.
Here's the code:
Or, if you prefer a more compact implementation:
And if we test it, we get a string in the result now, rather than a tuple:
We can go further, and map the string into an int:
If we test this, we get an int in the Success branch.
Let's check the type of parseThreeDigitsAsInt:
It's a Parser<int> now, not a Parser<char> or Parser<string>. The fact that a Parser can contain any type, not just a char or string, is a key feature that will be very valuable when we need to build more complex parsers.
2. apply and return -- lifting functions to the world of Parsers
apply and return -- lifting functions to the world of ParsersTo achieve our goal of creating a parser that matches a list of characters, we need two more helper functions which I will call returnP and applyP.
returnPsimply transforms a normal value into a value in Parser WorldapplyPtransforms a Parser containing a function (Parser< 'a->'b >) into a function in Parser World (Parser<'a> -> Parser<'b >)
Here's a diagram of returnP:

And here is the implementation of returnP:
The signature of returnP is just as we want:
Now here's a diagram of applyP:

And here is the implementation of applyP, which uses .>>. and map:
The infix version of applyP is written as <*>:
Again, the signature of applyP is just as we want:
Why do we need these two functions? Well, map will lift functions in Normal World into functions in Parser World, but only for one-parameter functions.
What's great about returnP and applyP is that, together, they can lift any function in Normal World into a function in Parser World, no matter how many parameters it has.
For example, we now can define a lift2 function that will lift a two parameter function into Parser World like this:
The signature of lift2 is:
Here's a diagram of lift2:

If you want to know more about how this works, check out my "man page" post on lift2 or my explanation that involves the "Monadster".
Let's see some examples of using lift2 in practice. First, lifting integer addition to addition of Parsers:
The signature is:
which shows that addP does indeed take two Parser<int> parameters and returns another Parser<int>.
And here's the startsWith function being lifted to Parser World:
Again, the signature of startsWithP is parallel to the signature of startsWith, but lifted to the world of Parsers.
3. sequence -- transforming a list of Parsers into a single Parser
sequence -- transforming a list of Parsers into a single ParserWe now have the tools we need to implement our sequencing combinator! The logic will be:
Start with the list "cons" operator. This is the two-parameter function that prepends a "head" element onto a "tail" of elements to make a new list.
Lift
consinto the world of Parsers usinglift2.We now have a a function that prepends a head
Parserto a tail list ofParsers to make a new list ofParsers, where:The head Parser is the first element in the list of parsers that has been passed in.
The tail is generated by calling the same function recursively with the next parser in the list.
When the input list is empty, just return a
Parsercontaining an empty list.
Here's the implementation:
The signature of sequence is:
which shows that the input is a list of Parsers and the output is a Parser containing a list of elements.
Let's test it by creating a list of three parsers, and then combining them into one:
As you can see, when we run it we get back a list of characters, one for each parser in the original list.
Implementing the pstring parser
pstring parserAt last, we can implement the parser that matches a string, which we'll call pstring.
The logic is:
Convert the string into a list of characters.
Convert each character into a
Parser<char>.Use
sequenceto convert the list ofParser<char>into a singleParser<char list>.And finally, use
mapto convert theParser<char list>into aParser<string>.
Here's the code:
Let's test it:
It works as expected. Phew!
4. many and many1 -- matching a parser multiple times
many and many1 -- matching a parser multiple timesAnother common need is to match a particular parser as many times as you can. For example:
When matching an integer, you want to match as many digit characters as you can.
When matching a run of whitespace, you want to match as many whitespace characters as you can.
There are slightly different requirements for these two cases.
When matching whitespace, it is often optional, so we want a "zero or more" matcher, which we'll call
many.On the other hand, when matching digits for an integer, you want to match at least one digit, so we want a "one or more" matcher, which we'll call
many1.
Before creating these, we'll define a helper function which matches a parser zero or more times. The logic is:
Run the parser.
If the parser returns
Failure(and this is key) just return an empty list. That is, this function can never fail!If the parser succeeds:
Call the function recursively to get the remaining values (which could also be an empty list).
Then combine the first value and the remaining values.
Here's the code:
With this helper function, we can easily define many now -- it's just a wrapper over parseZeroOrMore:
The signature of many shows that the output is indeed a list of values wrapped in a Parser:
Now let's test many:
Note that in the last case, even when there is nothing to match, the function succeeds.
There's nothing about many that restricts its use to single characters. For example, we can use it to match repetitive string sequences too:
Finally, let's implement the original example of matching whitespace:
Defining many1
many1We can also define the "one or more" combinator many1, using the following logic:
Run the parser.
If it fails, return the failure.
If it succeeds:
Call the helper function
parseZeroOrMoreto get the remaining values.Then combine the first value and the remaining values.
Again, the signature of many1 shows that the output is indeed a list of values wrapped in a Parser:
Now let's test many1:
As we saw in an earlier example, the last case gives a misleading error. It says "Expecting '9'" when it really should say "Expecting a digit". In the next post we'll fix this.
Parsing an integer
Using many1, we can create a parser for an integer. The implementation logic is:
Create a parser for a digit.
Use
many1to get a list of digits.Using
map, transform the result (a list of digits) into a string and then into an int.
Here's the code:
And let's test it:
5. opt -- matching a parser zero or one time
opt -- matching a parser zero or one timeSometimes we only want to match a parser zero or one time. For example, the pint parser above does not handle negative values. To correct this, we need to be able to handle an optional minus sign.
We can define an opt combinator easily:
Change the result of a specified parser to an option by mapping the result to
Some.Create another parser that always returns
None.Use
<|>to choose the second ("None") parser if the first fails.
Here's the code:
Here's an example of it in use -- we match a digit followed by an optional semicolon:
And here is pint rewritten to handle an optional minus sign:
Note that the resultToInt helper function now needs to handle the sign option as well as the list of digits.
And here it is in action:
6. Throwing results away
We often want to match something in the input, but we don't care about the parsed value itself. For example:
For a quoted string, we need to parse the quotes, but we don't need the quotes themselves.
For a statement ending in a semicolon, we need to ensure the semicolon is there, but we don't need the semicolon itself.
For whitespace separators, we need to ensure the whitespace is there, but we don't need the actual whitespace data.
To handle these requirements, we will define some new combinators that throw away the results of a parser:
p1 >>. p2will applyp1andp2in sequence, just like.>>., but throw away the result ofp1and keep the result ofp2.p1 .>> p2will applyp1andp2in sequence, just like.>>., but keep the result ofp1and throw away the result ofp2.
These are easy to define -- just map over the result of .>>., which is a tuple, and keep only one element of the pair.
These combinators allow us to simplify the digitThenSemicolon example shown earlier:
You can see that the result now is the same, whether or not the semicolon was present.
How about an example with whitespace?
The following code creates a parser that looks for "AB" followed by one or more whitespace chars, followed by "CD".
The result contains "AB" and "CD" only. The whitespace between them has been discarded.
Introducing between
betweenA particularly common requirement is to look for a parser between delimiters such as quotes or brackets.
Creating a combinator for this is trivial:
And here it is in use, to parse a quoted integer:
7. Parsing lists with separators
Another common requirement is parsing lists, seperated by something like commas or whitespace.
To implement a "one or more" list, we need to:
First combine the separator and parser into one combined parser, but using
>>.to throw away the separator value.Next, look for a list of the separator/parser combo using
many.Then prefix that with the first parser and combine the results.
Here's the code:
For the "zero or more" version, we can choose the empty list as an alternate if sepBy1 does not find any matches:
Here's some tests for sepBy1 and sepBy, with results shown in the comments:
What about bind?
bind?One combinator that we haven't implemented so far is bind (or >>=).
If you know anything about functional programming, or have seen my talk on FP patterns, you'll know that bind is a powerful tool that can be used to implement many functions.
Up to this point, I thought that it would be better to show implementations for combinators such as map and .>>. that were explicit and thus, hopefully, easier to understand.
But now that we have have some experience, let's implement bind and see what we can do with it.
Here's the implementation of bindP (as I'll call it)
The signature of bindP is:
which conforms to a standard bind signature. The input f is a "diagonal" function ('a -> Parser<'b>) and the output is a "horizontal" function (Parser<'a> -> Parser<'b>). See this post for more details on how bind works.
The infix version of bind is >>=. Note that the parameters are flipped: f is now the second parameter which makes it more convenient for F#'s pipeline idiom.
Reimplementing other combinators with bindP and returnP
bindP and returnPThe combination of bindP and returnP can be used to re-implement many of the other combinators. Here are some examples:
Note that the combinators that check the Failure path can not be implemented using bind. These include orElse and many.
Review
We could keep building combinators for ever, but I think we have everything we need to build a JSON parser now, so let's stop and review what we have done.
In the previous post we created these combinators:
.>>.(andThen) applies the two parsers in sequence and returns the results in a tuple.<|>(orElse) applies the first parser, and if that fails, the second parsers.choiceextendsorElseto choose from a list of parsers.
And in this post we created the following additional combinators:
bindPchains the result of a parser to another parser-producing function.mapPtransforms the result of a parser.returnPlifts an normal value into the world of parsers.applyPallows us to lift multi-parameter functions into functions that work on Parsers.lift2usesapplyPto lift two-parameter functions into Parser World.sequenceconverts a list of Parsers into a Parser containing a list.manymatches zero or more occurences of the specified parser.many1matches one or more occurences of the specified parser.optmatches an optional occurrence of the specified parser..>>keeps only the result of the left side parser.>>.keeps only the result of the right side parser.betweenkeeps only the result of the middle parser.sepByparses zero or more occurrences of a parser with a separator.sepBy1parses one or more occurrences of a parser with a separator.
I hope you can see why the concept of "combinators" is so powerful; given just a few basic functions, we have built up a library of useful functions quickly and concisely.
Listing of the parser library so far
Here's the complete listing for the parsing library so far -- it's about 200 lines of code now!
The source code displayed below is also available at this gist.
Summary
In this post, we have built on the basic parsing code from last time to create a library of a 15 or so combinators that can be combined to parse almost anything.
Soon, we'll use them to build a JSON parser, but before that, let's pause and take time to clean up the error messages. That will be the topic of the next post.
The source code for this post is available at this gist.
Last updated
Was this helpful?