Improving the parser library
Adding more informative errors
UPDATE: Slides and video from my talk on this topic
In this series, we are looking at how applicative parsers and parser combinators work.
In the first post, we created the foundations of a parsing library.
In the second post, we built out the library with many other useful combinators.
In this post, we'll rework the library to provide more helpful error messages.
1. Labelling a Parser
In some of the failing code examples from earlier posts, we got confusing errors:
let parseDigit = anyOf ['0'..'9']
run parseDigit "|ABC" // Failure "Expecting '9'. Got '|'"parseDigit is defined as a choice of digit characters, so when the last choice ('9') fails, that is the error message we receive.
But that message is quite confusing. What we really want is to receive is an error that mentions "digit", something like: Failure "Expecting digit. Got '|'".
That is, what we need is a way of labeling parsers with a word like "digit" and then showing that label when a failure occurs.
As a reminder, this is how the Parser type was defined in earlier posts:
type Parser<'a> = Parser of (string -> Result<'a * string>)In order to add a label, we need to change it into a record structure:
type ParserLabel = string
/// A Parser structure has a parsing function & label
type Parser<'a> = {
parseFn : (string -> Result<'a * string>)
label: ParserLabel
}The record contains two fields: the parsing function (parseFn) and the label.
One problem is that the label is in the parser itself, but not in the Result, which means that clients will not know how to display the label along with the error.
So let's add it to the Failure case of Result as well, in addition to the error message:
And while we are at it, let's define a helper function to display the result of a parse:
Updating the code
With this change to the definition of Parser and Result, we have to change some of the basic functions, such as bindP:
We have to make similar changes to returnP, orElse, and many. For the complete code, see the gist linked to below.
Updating the label
When we use a combinator to build a new compound parser, we will often want to assign a new label to it. In order to do this, we replace the original parseFn with another one that returns the new label.
Here's the code:
And let's create an infix version of this called <?>:
Let's test our new toy!
And the output is:
The error message is now Error parsing digit rather than Expecting '9'. Much better!
Setting default labels:
We can also set the default labels for certain combinators such as andThen and orElse based on the inputs:
2. Replacing "pchar" with "satisfy"
One thing that has bothered me about all the implementations so far is pchar, the basic primitive that all the other functions have built on.
I don't like that it is so tightly coupled to the input model. What happens if we want to parse bytes from a binary format, or other kinds of input. All the combinators other than pchar are loosely coupled. If we could decouple pchar as well, we would be set up for parsing any stream of tokens, and that would make me happy!
At this point, I'll repeat one of my favorite FP slogans: "parameterize all the things!" In the case of pchar, we'll remove the charToMatch parameter and replace it with a function -- a predicate. We'll call the new function satisfy:
Other than the parameters, the only thing that has changed from the pchar implementation is this one line:
With satisfy available, we can rewrite pchar:
Note that we are setting the label to be the charToMatch. This refactoring would not have been as convenient before, because we didn't have the concept of "labels" yet, and so pchar would not have been able to return a useful error message.
The satisfy function also lets us write more efficient versions of other parsers. For example, parsing a digit looked like this originally:
But now we can rewrite it using a predicate directly, making it a lot more efficient:
Similarly, we can create a more efficient whitespace parser too:
3. Adding position and context to error messages
Another way to improve the error messages is to show the line and column that the error occurred on.
Obviously, for simple one-liners, keeping track of the error location is not a problem, but when you are parsing a 100 line JSON file, it will be very helpful.
In order to track the line and column we are going to have to abandon the simple string input and replace it with something more complex, so let's start with that.
Defining a input that tracks position
First, we will need a Position type to store the line and column, with helper functions to increment one column and one line:
Next, we'll need to combine the input string with a position into a single "input state" type. Since we are line oriented, we can make our lives easier and store the input string as a array of lines rather than as one giant string:
We will also need a way to convert a string into a initial InputState:
Finally, and most importantly, we need a way to read the next character from the input -- let's call it nextChar.
We know what the input for nextChar will be (an InputState) but what should the output look like?
If the input is at the end, we need a way to indicate that there is no next character, so in that case return
None.Therefore in the case when a character is available, we will return
Some.In addition, the input state will have changed because the column (or line) will have been incremented as well.
So, putting this together, the input for nextChar is an InputState and the output is a pair char option * InputState.
The logic for returning the next char will be as follows then:
If we are at the last character of the input, return EOF (
None) and don't change the state.If the current column is not at the end of a line, return the character at that position and change the state by incrementing the column position.
If the current column is at the end of a line, return a newline character and change the state by incrementing the line position.
Here's the code:
Unlike the earlier string implementation, the underlying array of lines is never altered or copied -- only the position is changed. This means that making a new state each time the position changes should be reasonably efficient, because the text is shared everywhere.
Let's quickly test that the implementation works. We'll create a helper function readAllChars and then see what it returns for different inputs:
Here it is with some example inputs:
Note that the implementation returns a newline at the end of the input, even if the input doesn't have one. I think that this is a feature, not a bug!
Changing the parser to use the input
We now need to change the Parser type again.
To start with, the Failure case needs to return some kind of data that indicates the position, so we can show it in an error message.
We could just use the InputState as is, but let's be good and define a new type specially for this use, called ParserPosition:
We'll need some way to convert a InputState into a ParserPosition:
And finally, we can update the Result type to include ParserPosition:
In addition, the Parser type needs to change from string to InputState:
With all this extra information available, the printResult function can be enhanced to print the text of the current line, along with a caret where the error is:
Let's test printResult with a dummy error value:
The output is shown below:
Much nicer than before!
Fixing up the run function
run functionThe run function now needs to take an InputState not a string. But we also want the convenience of running against string input, so let's create two run functions, one that takes an InputState and one that takes a string:
Fixing up the combinators
We now have three items in the Failure case rather than two. This breaks some code but is easy to fix. I'm tempted to create a special ParserError type so that it never happens again, but for now, I'll just fix up the errors.
Here's a new version of satisfy:
Note that the failure case code is now Failure (label,err,pos) where the parser position is built from the input state.
And here is bindP:
We can fix up the other functions in the same way.
Testing the positional errors
Let's test with a real parser now:
And the output is:
Excellent! I think we can stop now.
4. Adding some standard parsers to the library
In the previous posts, we've built parsers for strings and ints in passing, but now let's add them to the core library, so that clients don't have to reinvent the wheel.
These parsers are based on those in the the FParsec library.
Let's start with some string-related parsers. I will present them without comment -- I hope that the code is self-explanatory by now.
Let's test pstring, for example:
Whitespace parsers
Whitespace is important in parsing, even if we do end up mostly throwing it away!
And here's some whitespace tests:
Numeric parsers
Finally, we need a parser for ints and floats.
And some tests:
5. Backtracking
One more topic that we should discuss is "backtracking".
Let's say that you have two parsers: one to match the string A-1 and and another to match the string A-2. If the input is A-2 then the first parser will fail at the third character and the second parser will be attempted.
Now the second parser must start at the beginning of the original sequence of characters, not at the third character. That is, we need to undo the current position in the input stream and go back to the first position.
If we were using a mutable input stream then this might be a tricky problem, but thankfully we are using immutable data, and so "undoing" the position just means using the original input value. And of course, this is exactly what combinators such as orElse (<|>) do.
In other words, we get backtracking "for free" when we use immutable input state. Yay!
Sometimes however, we don't want to backtrack. For example, let's say we have these parsers:
let
forExpression= the "for" keyword, then an identifier, then the "in" keyword, etc.let
ifExpression= the "if" keyword, then an identifier, then the "then" keyword, etc.
and we then create a combined expression parser that chooses between them:
let
expression=forExpression <|> ifExpression
Now, if the input stream is for &&& in something then the forExpression parser will error when it hits the sequence &&&, because it is expecting a valid identifier. At this point we don't want to backtrack and try the ifExpression -- we want to show an error such as "identifier expected after 'for'".
The rule then is that: if input has been consumed successfully (in this case, the for keyword was matched successfully) then do not backtrack.
We're not going to implement this rule in our simple library, but a proper library like FParsec does implement this and also has support for bypassing it when needed.
Listing of the final parser library
The parsing library is up to 500 lines of code now, so I won't show it here. You can see it at this gist.
Summary
In this post, we added better error handling and some more parsers.
Now we have everything we need to build a JSON parser! That will be the topic of the next post.
The source code for this post is available at this gist.
Last updated
Was this helpful?