Writing a Markdown Compiler – Part 3

Have you ever wanted to make your own programming language? Maybe a template engine? A JSON parser? If you have ever built any of those, you might have noticed it’s not exactly easy to get started. We’d like to help with that.

Welcome to Part 3, the final in this series on Writing a Markdown Compiler!

If you have made it this far, congratulations are in order 🙂

The last step is quite simple, covering code generation. You’ll have your very own markdown subset compiler in no time. In case you haven’t read the other parts yet, here’s part 1, Intro/Tokenizer and part 2, Parsing/Implementation.

Visitor Pattern

Now that we have our little language represented in an AST, we’ll use something called Visitor Pattern to visit each node and emit the proper HTML. I could point you to an article explaining this pattern but it’s not really worth it, it’s easier to just see it in action.

The core idea of our implementation is that each visitor will return some valid HTML. Once we have visited the whole tree we’ll join all of the visitor’s generated code into a full HTML translation of our markdown subset.

Not much theory in this one, let’s just dive into the code!

As you can see, we start with the top-most node, the Body. For each node in our AST we have a corresponding visitor. The body visitor is quite trivial, let’s take a look:

Because body_node is just a collection of paragraphs, we simply visit each paragraph and join the results together. The ParagraphVisitor is a bit more interesting:

We are doing basically the same thing, the only difference is that we finally add in some HTML! Because paragraphs are surrounded by p tags we wrap whatever it’s inside with those tags. We defer the work of translating what’s inside the paragraph to the SentenceVisitor:

Sentences are made of a bunch three diffeerent possible token combinations. In this visitor we make sure to get a valid instance given node.type. The remaining visitors are quite small so I’ll just show them all together:

As you can see we have chosen to translate bold using strong tags and emphasis using em tags. We could of course use b and i tags or basically whatever we want.

And that’s it! Let’s take a look at the generator tests to get a glimpse at how everything fits together:

It’s simply function composition: generate(parse(tokenize("my input"))).

Let’s just add one more object to our compiler, a pretty frontend object to abstract all the internals from our API consumers, we should always expose as little complexity as possible:

You did it!

Congratulations! Hopefully, by now, you have a deeper understanding on compilers. How they work, what they do, and a common compiler architecture. Please note that this is a very minimalistic compiler. A lot of things are missing, like good error reporting and more markdown features – nevertheless, the core concepts are there. So feel free to play around with the code, change whatever you want, or even share your own project in the comments section!

The cool thing about this structure is that you can write another code-gen layer for, say, XML, or ODT just by making a new generator. You could also re-think a new syntax for markdown, by building a new tokenizer + parser and just reuse the codegen layer.

It doesn’t really matter if you dont understand everything, the point is having a birds-eye view of the way compilers work. Hopefully these series served you as a solid base for any project requiring some of these skills, or at least just scratched an itch 🙂 Remember the full source code is at GitHub, feel free to play around with it.

Where to go from here

What I’ve shown you is the hard way of writing compilers. In my opinion, it’s very important to do things “by hand” in order to fully understand the underlying concepts. Of course, nowadays there are some tools which make life easier.

One of the oldest and most popular tools out there is Yacc. It’s a C library that takes in a grammar file, and gives you a C program which is able to recognize the given grammar.

Because Yacc was the first tool out there which did this, there are a lot of similar tools for other languages. There is Jison for Javascript and Racc for Ruby. Note that both Yacc and Jison allow us to use left-recursion in our rules, so even though both grammars match the same thing, this allows us to write more expressive, concise rules. Nothing is without a downside though, this introduces some complexity into our grammars, needed to remove ambiguity, like having to define operator precedence.

Personally, I like Peg.js, PEG stands for Parsing Expression Grammar, and it has the same limitations we used in our grammar. For quick prototyping is quite useful and easy to use. ANTLR is another popular tool for Java. Quite powerful, even though I’ve never used it myself, I’ve heard great things about this one.

If you like functional programming, there’s a whole functional-oriented architecture going on over there, they are called Parser Combinators and they play with the idea of combining small parsers together to build bigger ones. Here you can find a Ruby parser combinator to get an idea of how they work.

It’s important to note though, that most popular programming languages, like Lua and C#, implement their own tokenizers and parsers in order to be as efficient as possible.

NOTE If you want to build a toy programming language, I recommend this little book, How To Create Your Own Freaking Awesome Programming Language, which is a nice summary of the techniques displayed here, and adds a lot of new topics you can investigate further on your own. The book itself is a rough overview on language development rather than deep explanation of each topic. It’s a great place to start.

That’s all! Thanks for reading! Hopefully this was at least a somewhat useful intro. Let us know in the comments sections what you think. Cheers!

Leave a Reply