Writing a Markdown Compiler – Part 2

Hello, and welcome to the second part of the Markdown Compiler series! In case you’ve missed it, here is the first part.

In this part we’ll talk about the second step in compiling: Parsing – also known as Syntactic Analysis. This part has a bit more theory, so it might take some time to digest. Sip some coffee, relax, take yout time, and as long as you don’t rush it you’ll find it’s not hard at all 🙂

If you recall from our first part, we talked about Tokenizing, which is making sense of a bunch of characters by transforming them into tokens. Parsing is organizing those tokens into a tree data structure called Abstract Syntax Tree, or AST for short.

For example, say were to design a language where you can assign a variable like this:

We could establish that assignment are made of a word token, an equals token, and a number token. The following are invalid assignments:

You can see we only accept a small numbers of token sequences. In fact, the accepted sequences must be carefully ordered in order to be valid. A common solution to this problem – matching sequences – is regular expressions. A not-so-common solution is writing a Parser.

The parser itself is just a program which returns true if a sequence is valid, and false otherwise. Our parser will also return a Node of an AST, we’ll learn more about that in a bit.

Theory and Grammar

Now it’s time for a little theory. Don’t worry, I promise it won’t be that bad: A grammar is a set of rules which together define all possible valid character sequences. They look like this:

Rules can consist only of other rules and/or a terminal. For example, if we wanted to match a tiny language L = { a, b, c } we could write:

L = { a, b, c } can be read as “The language named L is made of 3 elements: a, b, and c“.

In this case, a, b and c are all terminals, they match themselves. We could use any symbol to represent or, we used | as it’s quite common. There is not an unified grammar representation – every library makes it’s own choices – nevertheless they all come from something called Backus Naur Form.

We’ll also use something called Kleene star:

It simply means: Match this 0 or more times. It will match the empty string "", a, aa, aaa and so on. A very similar helper is Kleen plus, which means: Match this 1 or more times.

Will match b, bb, bbb and so on, but not the empty string.

The order in which the grammar tries the rules is not formally defined, any possible match is a valid match. For example, consider the following grammar:

In that grammar, we have two ways of generating ‘aba’, one is by using the first branch of the or, and the other is using the second brach.

In our implementation we’ll use a top-down approach and just match the first branch. This means we would ignore the second branch, so be careful with your rules.

Languages generated by a grammar are called formal languages, you already know several formal languages, some of them are HTML, XML, CSS, JavaScript, Ruby, Swift, Java, and C.

Also, we won’t write just any grammar, we’ll limit our rules in the grammar a bit, that way we’ll only match Context-Free Languages. Why? Because they represent the best compromise between power of expression and ease of implementation. You can learn more about grammars here.

Which limitations are we talking about exactly? Not many really, we just need to avoid left-recursion:

A rule which calls itself before calling another rule or a terminal. Why this limitation? Well, one is because it’s harder to implement. Because we’ll use functions to implement rules, the implemenation of a left-recursive rule looks like this:

We’ve got an infinite loop! The good news is that all grammars with left-recursion can be written as a different equivalent grammar without left-recursion. In the next section we’ll convert a left-recursive grammar into a non-left-recursive one.

Just one more thing before we move on, I just want to show you how to to evaluate a grammar by hand. Let’s this tiny grammar as an example:

In the grammar above, I want to match an Identifier rule, a token of type EQUALS (also known as terminal), and a Number. As you can see, we’ve defined them using some building blocks called Terminals or Tokens. In our code, we’ll tell the WORD token to match [a-z]+ and the NUMBER token will match just [0-9].

To try out this grammar, all we need to know is the substitution model. We just replace rules with their definition until all we have are terminals. Let’s say I want to match foo = 1. We must start from the initial rule and see if we can get to that:

We were able to get to foo = 1, so it belongs to our language.

On Abstract Syntax Trees

Now, just some more theory before I let you go 🙂 The whole point of the grammar is to get an Abstract Syntax Tree representation – or AST for short, of our input. For example, a markdown grammar might parse hello __world__. as:

NOTE If you’ve never seen a tree data structure before, you might want to
check that out
.

Our parent node is PARAGRAPH. That node has a single child, SENTENCES, which in turn has 3 children nodes, TEXT, BOLD and another TEXT. The starting rule in our parser will be the top-most parent in our tree.

The thing about getting a tree out of a grammar is that we can remove ambiguity. Consider the following grammar:

If we were to manually build an AST for 2 + 2 - 4, we get

The way to read an AST is reading all the leafs of the tree – nodes without children – from left ro right. If we do that, you can see we matched (2 + 2) - 4. The problem, is that an equally valid representation could be:

This time, we end up with 2 + (2 - 4). We have two possible ASTs to choose from! Because we want our programs to be deterministic (given the same input, always return the same output), looks like we have some issues.

Well not really. Luckly for us, we happen to use left-recursive grammars. Those grammars have a nice property which is they never have ambiguity! Let’s transform the old grammar:

As you can see, we explicitly set the order of the operations to be performed, which in this case is multiplication, division, adition, and subtraction – just like C. The generated AST will now always be the same. Let’s see the way the grammar evaluates 2 + 2 so you get the hang of this trick:

This trick of transforming a left-recursive grammar to a non-left-recursive grammar works for all grammars. For more on this, you might want to check this article.

A simple Markdown grammar

Okay, enough theory, let’s start coding already! This is the grammar we’ll implement:

Our starting rule is Body, which just matches 0 or more Paragraphs. Each paragraph is made of either a SentenceAndNewline rule or a SentenceAndEOF rule. A Sentence is just text, bold text, or emphasized text.

Note that the Text rule seems quite silly. It’s just so it makes the implementation easier, we could easily get rid of it and just replace it with TEXT.

Implementation

The approach we’ll take is creating several small objects, each matching a single rule in our grammar. We’ll call those objects parsers, each parser might call other parsers in order to match the rule, including itself.

The source code for this whole compiler is on GitHub, feel free to clone it and see the parser files in your favorite editor as you read this, the snippets here are just for concept representation.

Let’s tart with the TEXT rule, which matches a single TEXT token.

All parsers behave very similarly. They take a list of tokens and peek the list to see if the sequence is correct. They then return the matched node, or an empty node (a null node). We call the result node because we want to build an abstract syntax tree.

Let’s see a parser a bit more complicated, __bold__ **text**:

Once again, we just check the token sequence is valid and return a node. The peek_or method lives in a TokenList object, it takes any amount of arrays as input and tries the tokens defined in each array one by one. It stops whenever it finds a match, returning true, otherwise it returns false. As you might imagine, the order of the arrays are very important, as it’s first-in-first-matched.

The emphasis parser is quite similar to this one, so let’s move onto something more interesting: The sentence parser. Our rule is Sentence := EmphasizedText | BoldText | Text. Seems simple enough, match_first does the trick fos us:

match_first is a concern which is included whenever needed. It’s somewhat like an or, it will try the given parsers and return the first valid node it finds. As usual, the order of the parsers is very important as they get tested in the given order.

Now, onto the next rule: SentenceAndNewline := Sentence+ NEWLINE NEWLINE.

Similar to match_first, we now have another concern, match_star, which matches something 0 or more times. Because we are matching a + (Kleen Plus), we want to match something once or more, so we error if we got nothing from our match_star method. Then we just match the two NEWLINE and return the node.

NOTE We could make a new helper, match_plus here. To keep things simple, I decided to do it “manually”. If you want to play around with the code, implementing match_plus is a good exercise.

Our little concerns take away most of the job, as the remaining parsers are quite trivial. For example, this is our Body parser:

Now what’s missing is something to start calling up parsers. Let’s wrap the whole parsing functionality into a Parser object which abstracts away all the complicated stuff into a simple API:

That’s it!

We’ve made it a long way so far. We can now transform __Foo__ and *bar*.\n\nAnother paragraph. into a tree data structure:

That is no easy feat! We’ve talked a lot about parsers, grammars, tokens and all that stuff. More than enough to sip in for a day. Remember the source code is on GitHub, feel free to clone and play around with it.

Next time, we’ll look at the final part of our compiler: The code generation layer. Hope you find these series useful, and see you in the next and final part!

Leave a Reply