Writing a JSON parser from scratch

In 250 lines of code

UPDATE: Slides and video from my talk on this topic

In this series, we are looking at how applicative parsers and parser combinators work.

  • In the first post, we created the foundations of a parsing library.

  • In the second post, we built out the library with many other useful combinators.

  • In the third post, we improved the error messages.

  • In this last post, we'll use the library we've written to build a JSON parser.

First, before we do anything else, we need to load the parser library script that we developed over the last few posts, and then open the ParserLibrary namespace:

#load "ParserLibrary.fsx"

open System
open ParserLibrary

You can download ParserLibrary.fsx from here.

1. Building a model to represent the JSON spec

The JSON spec is available at json.org. I'll paraphase it here:

  • A value can be a string or a number or a bool or null or an object or an array.

    • These structures can be nested.

  • A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes.

  • A number is very much like a C or Java number, except that the octal and hexadecimal formats are not used.

  • A boolean is the literal true or false

  • A null is the literal null

  • An object is an unordered set of name/value pairs.

    • An object begins with { (left brace) and ends with } (right brace).

    • Each name is followed by : (colon) and the name/value pairs are separated by , (comma).

  • An array is an ordered collection of values.

    • An array begins with [ (left bracket) and ends with ] (right bracket).

    • Values are separated by , (comma).

  • Whitespace can be inserted between any pair of tokens.

In F#, this definition can be modelled naturally as:

So the goal of our JSON parser is:

  • Given a string, we want to output a JValue value.

2. Getting started with Null and Bool

Let's start with the simplest tasks -- parsing the literal values for null and the booleans.

Parsing Null

Parsing the null literal is trivial. The logic will be:

  • Match the string "null".

  • Map the result to the JNull case.

Here's the code:

Note that we don't actually care about the value returned by the parser because we know in advance that it is going to be "null"!

This is a common situation, so let's write a little utility function, >>% to make this look nicer:

Now we can rewrite jNull as follows:

Let's test:

That looks good. Let's try another one!

Parsing Bool

The bool parser will be similar to null:

  • Create a parser to match "true".

  • Create a parser to match "false".

  • And then choose between them using <|>.

Here's the code:

And here are some tests:

Note that the error is misleading due to the backtracking issue discussed in the previous post. Since "true" failed, it is trying to parse "false" now, and "t" is an unexpected character.

3. Parsing String

Now for something more complicated -- strings.

The spec for string parsing is available as a "railway diagram" like this:

All diagrams sourced from json.org.

To build a parser from a diagram like this, we work from the bottom up, building small "primitive" parsers which we then combine into larger ones.

Let's start with "any unicode character other than quote and backslash". We have a simple condition to test, so we can just use the satisfy function:

We can test it immediately:

Ok, good.

Escaped characters

Now what about the next case, the escaped characters?

In this case we have a list of strings to match ("\"", "\n", etc) and for each of these, a character to use as the result.

The logic will be:

  • First define a list of pairs in the form (stringToMatch, resultChar).

  • For each of these, build a parser using pstring stringToMatch >>% resultChar).

  • Finally, combine all these parsers together using the choice function.

Here's the code:

And again, let's test it immediately:

It works nicely!

Unicode characters

The final case is the parsing of unicode characters with hex digits.

The logic will be:

  • First define the primitives for backslash, u and hexdigit.

  • Combine them together, using four hexdigits.

  • The output of the parser will be a nested, ugly tuple, so we need a helper function to convert the

    digits to an int, and then a char.

Here's the code:

And let's test with a smiley face -- \u263A.

The complete String parser

Putting it all together now:

  • Define a primitive for quote

  • Define a jchar as a choice between jUnescapedChar, jEscapedChar, and jUnicodeChar.

  • The whole parser is then zero or many jchar between two quotes.

One more thing, which is to wrap the quoted string in a JString case and give it a label:

Let's test the complete jString function:

4. Parsing Number

The "railway diagram" for Number parsing is:

Again, we'll work bottom up. Let's start with the most primitive components, the single chars and digits:

Now let's build the "integer" part of the number. This is either:

  • The digit zero, or,

  • A nonZeroInt, which is a digitOneNine followed by zero or more normal digits.

Note that, for the nonZeroInt parser, we have to combine the output of digitOneNine (a char) with manyChars digit (a string) so a simple map function is needed.

The optional fractional part is a decimal point followed by one or more digits:

And the exponent part is an e followed by an optional sign, followed by one or more digits:

With these components, we can assemble the whole number:

We haven't defined convertToJNumber yet though. This function will take the four-tuple output by the parser and convert it into a float.

Now rather than writing custom float logic, we're going to be lazy and let the .NET framework to the conversion for us! That is, each of the components will be turned into a string, concatenated, and the whole string parsed into a float.

The problem is that some of the components (like the sign and exponent) are optional. Let's write a helper that converts an option to a string using a passed in function, but if the option is None return the empty string.

I'm going to call it |>? but it doesn't really matter because it is only used locally within the jNumber parser.

Now we can create convertToJNumber:

  • The sign is converted to a string.

  • The fractional part is converted to a string, prefixed with a decimal point.

  • The exponent part is converted to a string, with the sign of the exponent also being converted to a string.

It's pretty crude, and converting things to strings can be slow, so feel free to write a better version.

With that, we have everything we need for the complete jNumber function:

It's a bit long-winded, but each component follows the spec, so I think it is still quite readable.

Let's start testing it:

And what about some failing cases?

Hmm. Something went wrong! These cases should fail, surely?

Well, no. What's happening in the -123. case is that the parser is consuming everything up the to decimal point and then stopping, leaving the decimal point to be matched by the next parser! So, not an error.

Similarly, in the 00.1 case, the parser is consuming only the first 0 then stopping, leaving the rest of the input (0.4) to be matched by the next parser. Again, not an error.

To fix this properly is out of scope, so let's just add some whitespace to the parser to force it to terminate.

Now let's test again:

and we find the error is being detected properly now.

Let's test the fractional part:

and the exponent part now:

It's all looking good so far. Onwards and upwards!

5. Parsing Array

Next up is the Array case. Again, we can use the railway diagram to guide the implementation:

We will start with the primitives again. Note that we are adding optional whitespace after each token:

And then we create a list of values separated by a comma, with the whole list between the left and right brackets.

Hold on -- what is this jValue?

Well, the spec says that an Array can contain a list of values, so we'll assume that we have a jValue parser that can parse them.

But to parse a JValue, we need to parse a Array first!

We have hit a common problem in parsing -- mutually recursive definitions. We need a JValue parser to build an Array, but we need an Array parser to build a JValue.

How can we deal with this?

Forward references

The trick is to create a forward reference, a dummy JValue parser that we can use right now to define the Array parser, and then later on, we will fix up the forward reference with the "real" JValue parser.

This is one time where mutable references come in handy!

We will need a helper function to assist us with this, and the logic will be as follows:

  • Define a dummy parser that will be replaced later.

  • Define a real parser that forwards the input stream to the dummy parser.

  • Return both the real parser and a reference to the dummy parser.

Now when the client fixes up the reference, the real parser will forward the input to the new parser that has replaced the dummy parser.

Here's the code:

With this in place, we can create a placeholder for a parser of type JValue:

Finishing up the Array parser

Going back to the Array parser, we can now compile it successfully, using the jValue placeholder:

If we try to test it now, we get an exception because we haven't fixed up the reference:

So for now, let's fix up the reference to use one of the parsers that we have already created, such as jNumber:

Now we can successfully test the jArray function, as long as we are careful to only use numbers in our array!

6. Parsing Object

The parser for Object is very similar to the one for Array.

First, the railway diagram:

Using this, we can create the parser directly, so I'll present it here without comment:

A bit of testing to make sure it works (but remember, only numbers are supported as values for now).

7. Putting it all together

Finally, we can combine all six of the parsers using the choice combinator, and we can assign this to the JValue parser reference that we created earlier:

And now we are ready to rock and roll!

Testing the complete parser: example 1

Here's an example of a JSON string that we can attempt to parse:

And here is the result:

Testing the complete parser: example 2

Here's one from the example page on json.org:

And here is the result:

Complete listing of the JSON parser

Here's the complete listing for the JSON parser -- it's about 250 lines of useful code.

The source code displayed below is also available at this gist.

Summary

In this post, we built a JSON parser using the parser library that we have developed over the previous posts.

I hope that, by building both the parser library and a real-world parser from scratch, you have gained a good appreciation for how parser combinators work, and how useful they are.

I'll repeat what I said in the first post: if you are interesting in using this technique in production, be sure to investigate the FParsec library for F#, which is optimized for real-world usage.

And if you are using languages other than F#, there is almost certainly a parser combinator library available to use.

Thanks!

The source code for this post is available at this gist.

Last updated

Was this helpful?