Building a useful set of parser combinators
15 or so combinators that can be combined to parse almost anything
Last updated
Was this helpful?
15 or so combinators that can be combined to parse almost anything
Last updated
Was this helpful?
UPDATE:
In this series, we are looking at how applicative parsers and parser combinators work.
In the , we created the foundations of a parsing library.
In this post, we'll build out the library with many other useful combinators.
The combinator names will be copied from those used by , so that you can easily migrate to it.
map
-- transforming the contents of a parserWhen parsing, we often we want to match a particular string, such as a reserved word like "if" or "where". A string is just a sequence of characters, so surely we could use the same technique that we used to define anyOf
in the first post, but using andThen
instead of orElse
?
Here's a (failed) attempt to create a pstring
parser using that approach:
This doesn't work, because the output of andThen
is different from the input (a tuple, not a char) and so the reduce
approach fails.
In order to solve this, we'll need to use a different technique.
To get started, let's try just matching a string of a specific length. Say, for example, that we want to match a three digits in a row. Well, we can do that using andThen
:
If we run it like this:
then we get the result:
It does work, but the result contains a tuple inside a tuple (('1', '2'), '3')
which is fugly and hard to use. It would be so much more convenient to just have a simple string ("123"
).
But in order to turn ('1', '2'), '3')
into "123"
, we'll need a function that can reach inside of the parser and transform the result using an arbitrary passed in function.
Of course, what we need is the functional programmer's best friend, map
.
To understand map
and similar functions, I like to think of there being two worlds: a "Normal World", where regular things live, and "Parser World", where Parser
s live.
You can think of Parser World as a sort of "mirror" of Normal World because it obeys the following rules:
Every type in Normal World (say char
) has a corresponding type in Parser World (Parser<char>
).
And:
Every value in Normal World (say "ABC"
) has a corresponding value in Parser World (that is, some Parser<string>
that returns "ABC"
).
And:
Every function in Normal World (say char -> string
) has a corresponding function in Parser World (Parser<char> -> Parser<string>
).
Using this metaphor then, map
transforms (or "lifts") a function in Normal World into a function in Parser World.
So that's what map
does; how do we implement it?
The logic is:
Inside the innerFn
, run the parser to get the result.
If the result was a success, apply the specified function to the success value to get a new, transformed value, and...
...return the new, mapped, value instead of the original value.
Here's the code (I've named the map function mapP
to avoid confusion with other map functions):
If we look at the signature of mapP
:
we can see that it has exactly the signature we want, transforming a function 'a -> 'b
into a function Parser<'a> -> Parser<'b>
.
It's common to define an infix version of map
as well:
And in the context of parsing, we'll often want to put the mapping function after the parser, with the parameters flipped. This makes using map
with the pipeline idiom much more convenient:
mapP
With mapP
available, we can revisit parseThreeDigits
and turn the tuple into a string.
Here's the code:
Or, if you prefer a more compact implementation:
And if we test it, we get a string in the result now, rather than a tuple:
We can go further, and map the string into an int:
If we test this, we get an int
in the Success branch.
Let's check the type of parseThreeDigitsAsInt
:
It's a Parser<int>
now, not a Parser<char>
or Parser<string>
. The fact that a Parser
can contain any type, not just a char or string, is a key feature that will be very valuable when we need to build more complex parsers.
apply
and return
-- lifting functions to the world of ParsersTo achieve our goal of creating a parser that matches a list of characters, we need two more helper functions which I will call returnP
and applyP
.
returnP
simply transforms a normal value into a value in Parser World
applyP
transforms a Parser containing a function (Parser< 'a->'b >
) into a function in Parser World (Parser<'a> -> Parser<'b >
)
Here's a diagram of returnP
:
And here is the implementation of returnP
:
The signature of returnP
is just as we want:
Now here's a diagram of applyP
:
And here is the implementation of applyP
, which uses .>>.
and map
:
The infix version of applyP
is written as <*>
:
Again, the signature of applyP
is just as we want:
Why do we need these two functions? Well, map
will lift functions in Normal World into functions in Parser World, but only for one-parameter functions.
What's great about returnP
and applyP
is that, together, they can lift any function in Normal World into a function in Parser World, no matter how many parameters it has.
For example, we now can define a lift2
function that will lift a two parameter function into Parser World like this:
The signature of lift2
is:
Here's a diagram of lift2
:
Let's see some examples of using lift2
in practice. First, lifting integer addition to addition of Parsers:
The signature is:
which shows that addP
does indeed take two Parser<int>
parameters and returns another Parser<int>
.
And here's the startsWith
function being lifted to Parser World:
Again, the signature of startsWithP
is parallel to the signature of startsWith
, but lifted to the world of Parsers.
sequence
-- transforming a list of Parsers into a single ParserWe now have the tools we need to implement our sequencing combinator! The logic will be:
Start with the list "cons" operator. This is the two-parameter function that prepends a "head" element onto a "tail" of elements to make a new list.
Lift cons
into the world of Parsers using lift2
.
We now have a a function that prepends a head Parser
to a tail list of Parser
s to make a new list of Parser
s, where:
The head Parser is the first element in the list of parsers that has been passed in.
The tail is generated by calling the same function recursively with the next parser in the list.
When the input list is empty, just return a Parser
containing an empty list.
Here's the implementation:
The signature of sequence
is:
which shows that the input is a list of Parser
s and the output is a Parser
containing a list of elements.
Let's test it by creating a list of three parsers, and then combining them into one:
As you can see, when we run it we get back a list of characters, one for each parser in the original list.
pstring
parserAt last, we can implement the parser that matches a string, which we'll call pstring
.
The logic is:
Convert the string into a list of characters.
Convert each character into a Parser<char>
.
Use sequence
to convert the list of Parser<char>
into a single Parser<char list>
.
And finally, use map
to convert the Parser<char list>
into a Parser<string>
.
Here's the code:
Let's test it:
It works as expected. Phew!
many
and many1
-- matching a parser multiple timesAnother common need is to match a particular parser as many times as you can. For example:
When matching an integer, you want to match as many digit characters as you can.
When matching a run of whitespace, you want to match as many whitespace characters as you can.
There are slightly different requirements for these two cases.
When matching whitespace, it is often optional, so we want a "zero or more" matcher, which we'll call many
.
On the other hand, when matching digits for an integer, you want to match at least one digit, so we want a "one or more" matcher, which we'll call many1
.
Before creating these, we'll define a helper function which matches a parser zero or more times. The logic is:
Run the parser.
If the parser returns Failure
(and this is key) just return an empty list. That is, this function can never fail!
If the parser succeeds:
Call the function recursively to get the remaining values (which could also be an empty list).
Then combine the first value and the remaining values.
Here's the code:
With this helper function, we can easily define many
now -- it's just a wrapper over parseZeroOrMore
:
The signature of many
shows that the output is indeed a list of values wrapped in a Parser
:
Now let's test many
:
Note that in the last case, even when there is nothing to match, the function succeeds.
There's nothing about many
that restricts its use to single characters. For example, we can use it to match repetitive string sequences too:
Finally, let's implement the original example of matching whitespace:
many1
We can also define the "one or more" combinator many1
, using the following logic:
Run the parser.
If it fails, return the failure.
If it succeeds:
Call the helper function parseZeroOrMore
to get the remaining values.
Then combine the first value and the remaining values.
Again, the signature of many1
shows that the output is indeed a list of values wrapped in a Parser
:
Now let's test many1
:
As we saw in an earlier example, the last case gives a misleading error. It says "Expecting '9'" when it really should say "Expecting a digit". In the next post we'll fix this.
Using many1
, we can create a parser for an integer. The implementation logic is:
Create a parser for a digit.
Use many1
to get a list of digits.
Using map
, transform the result (a list of digits) into a string and then into an int.
Here's the code:
And let's test it:
opt
-- matching a parser zero or one timeSometimes we only want to match a parser zero or one time. For example, the pint
parser above does not handle negative values. To correct this, we need to be able to handle an optional minus sign.
We can define an opt
combinator easily:
Change the result of a specified parser to an option by mapping the result to Some
.
Create another parser that always returns None
.
Use <|>
to choose the second ("None") parser if the first fails.
Here's the code:
Here's an example of it in use -- we match a digit followed by an optional semicolon:
And here is pint
rewritten to handle an optional minus sign:
Note that the resultToInt
helper function now needs to handle the sign option as well as the list of digits.
And here it is in action:
We often want to match something in the input, but we don't care about the parsed value itself. For example:
For a quoted string, we need to parse the quotes, but we don't need the quotes themselves.
For a statement ending in a semicolon, we need to ensure the semicolon is there, but we don't need the semicolon itself.
For whitespace separators, we need to ensure the whitespace is there, but we don't need the actual whitespace data.
To handle these requirements, we will define some new combinators that throw away the results of a parser:
p1 >>. p2
will apply p1
and p2
in sequence, just like .>>.
, but throw away the result of p1
and keep the result of p2
.
p1 .>> p2
will apply p1
and p2
in sequence, just like .>>.
, but keep the result of p1
and throw away the result of p2
.
These are easy to define -- just map over the result of .>>.
, which is a tuple, and keep only one element of the pair.
These combinators allow us to simplify the digitThenSemicolon
example shown earlier:
You can see that the result now is the same, whether or not the semicolon was present.
How about an example with whitespace?
The following code creates a parser that looks for "AB" followed by one or more whitespace chars, followed by "CD".
The result contains "AB" and "CD" only. The whitespace between them has been discarded.
between
A particularly common requirement is to look for a parser between delimiters such as quotes or brackets.
Creating a combinator for this is trivial:
And here it is in use, to parse a quoted integer:
Another common requirement is parsing lists, seperated by something like commas or whitespace.
To implement a "one or more" list, we need to:
First combine the separator and parser into one combined parser, but using >>.
to throw away the separator value.
Next, look for a list of the separator/parser combo using many
.
Then prefix that with the first parser and combine the results.
Here's the code:
For the "zero or more" version, we can choose the empty list as an alternate if sepBy1
does not find any matches:
Here's some tests for sepBy1
and sepBy
, with results shown in the comments:
bind
?One combinator that we haven't implemented so far is bind
(or >>=
).
Up to this point, I thought that it would be better to show implementations for combinators such as map
and .>>.
that were explicit and thus, hopefully, easier to understand.
But now that we have have some experience, let's implement bind
and see what we can do with it.
Here's the implementation of bindP
(as I'll call it)
The signature of bindP
is:
The infix version of bind
is >>=
. Note that the parameters are flipped: f
is now the second parameter which makes it more convenient for F#'s pipeline idiom.
bindP
and returnP
The combination of bindP
and returnP
can be used to re-implement many of the other combinators. Here are some examples:
Note that the combinators that check the Failure
path can not be implemented using bind
. These include orElse
and many
.
We could keep building combinators for ever, but I think we have everything we need to build a JSON parser now, so let's stop and review what we have done.
In the previous post we created these combinators:
.>>.
(andThen
) applies the two parsers in sequence and returns the results in a tuple.
<|>
(orElse
) applies the first parser, and if that fails, the second parsers.
choice
extends orElse
to choose from a list of parsers.
And in this post we created the following additional combinators:
bindP
chains the result of a parser to another parser-producing function.
mapP
transforms the result of a parser.
returnP
lifts an normal value into the world of parsers.
applyP
allows us to lift multi-parameter functions into functions that work on Parsers.
lift2
uses applyP
to lift two-parameter functions into Parser World.
sequence
converts a list of Parsers into a Parser containing a list.
many
matches zero or more occurences of the specified parser.
many1
matches one or more occurences of the specified parser.
opt
matches an optional occurrence of the specified parser.
.>>
keeps only the result of the left side parser.
>>.
keeps only the result of the right side parser.
between
keeps only the result of the middle parser.
sepBy
parses zero or more occurrences of a parser with a separator.
sepBy1
parses one or more occurrences of a parser with a separator.
I hope you can see why the concept of "combinators" is so powerful; given just a few basic functions, we have built up a library of useful functions quickly and concisely.
Here's the complete listing for the parsing library so far -- it's about 200 lines of code now!
In this post, we have built on the basic parsing code from last time to create a library of a 15 or so combinators that can be combined to parse almost anything.
And by the way, if you like this metaphor, I have a .
If you want to know more about how this works, check out my or .
If you know anything about functional programming, or have seen my talk on , you'll know that bind
is a powerful tool that can be used to implement many functions.
which conforms to a standard bind signature. The input f
is a "diagonal" function ('a -> Parser<'b>
) and the output is a "horizontal" function (Parser<'a> -> Parser<'b>
). See .
The source code displayed below is also available at .
Soon, we'll use them to build a JSON parser, but before that, let's pause and take time to clean up the error messages. That will be the topic of the .
The source code for this post is available at .