The ANTLR-Haskell project: parsing made easy

Nov. 19th, 2018

Motivating reasons for a new parser generator

Ultimately I want to be able to write self-documentable and self-contained parsers for languages across the spectrum of programming languages, ranging from declarative specification languages to compact and minimal functional programming languages, and even to the occasional imperative language to describe tasks to perform on my computer at regular intervals (think cron but with in-line integration for a non-bash language).

By self-documentable I mean that the parser-generator I get out from writing down a formal grammar in some BNF-like syntax comes with human-readable documentation for the functions, automatons, and whatever data structures these magic parsing tools generate. Why? Well who doesn't want to understand why their parser is choking on some input that is clearly in the language as defined by their grammar.

By self-contained, I'm largely talking about the environment as well as package management ecosystem in which the parser-generator operates. A parser generator like Happy is all well and nice when you need a lot of control over how you build and run your parser (i.e. at the terminal and from arbitrary bash scripts in linguistically heterogeneous software projects). But, there's also plenty of value that comes from operating within community-controlled package management systems, as well as natively supported compiler frameworks like QuasiQuoters. One parser-generator library I've used in the past that gets close to these requirements is Peggy, an eDSL for defining language grammars as PEGs. My own qualms aside about PEGs as a disambiguation mechanism for context-free grammars, the Peggy library embodies much of what I want to have in a parser-generator:

  • It's deeply embedded
  • Bootstrapped (Peggy syntax defined in Peggy)
  • [Claims to] generate fast & modern Haskell code

Unfortunately though, Peggy has not been regularly maintained (6+ years since the latest commits to the project), and the documentation and example grammars are not particularly extensive. Example grammars where present make no claim of generality or adherence to community language standards, or at the very least need to draw uponmore popular language definition repos.

So why might a library like Peggy not catch on? Well the backstory of the project itself aside (of which I have no first-hand knowledge), the isolated nature of its presentation immediately jumps out at me. Peggy, for all its principality in appropriately labeling itself formally as being a PEG-based parser-generator, strikes me as well-engineered but lacking in the open-source nourishment it needed to thrive as a tool.

Thus with the above discussion in mind, I give you Version 1.0 of antlr-haskell. With antlr-haskell I want to bring language parsing as found in the Haskell ecosystem into the modern age of tool support in the form of a deeply-embedded quasiquoter. I also want to help bring an industry-standard language parsing framework into the world of Haskell by porting one of the largest collections of programming language grammars to a functional-programming-friendly environment. To join me on this adventure, scroll to the bottom of this post to read about contributing, or just jump right in to the series of github issues we have piling up.

State-of-the-art parsing in Haskell

Aside from the Happy parser generator tool and the Peggy Haskell library, in my experience the most commonly referred to parsing library in the Haskell ecosystem is parsec. The general usage pattern I find myself in when using parsec is the following:
  • Copy an old module for which I've written a parser, typically from here by:
    • Keeping the 10+ parsec imports at the top of the file.
    • Modifying the lexer at the bottom with any reservedNames for the new language I want to parse.
  • Define a parser in the parsec Parser monad for each non-terminal in my language.
  • Tweak these functions until they parse a few small minimal example languages.
  • Triage all the weird parsing bugs that crop up due to recursive grammar rules, the try-fail semantics of parsec's combinators, and monadic error propagation.
  • Sigh exasperatedly when I need to figure out how buildExpresionParser in Text.Parsec.Expr works.
Aside from the boot parser for antlr-haskell, I've worked on both the PADS parser and the DeckBuild parser for a simple playing-card definition language.

Symbolic expressions example

A symbolic expression, or s-expression for short, is a notation for defining an arbitrarily nested tree-like list structure. At the leaves of an s-expression are the atoms which can be quoted strings, plain symbols, or numeric values. These atoms are composed into a tree structure by defining a rooted n-ary tree structure called an item. Let's start with a basic representation of s-expressions as Haskell data structures as follows:

data Atom
  = Str String
  | Symb String
  | Number String
  deriving (Eq, Ord, Show)

data Item
  = Atm   Atom
  | List  [Item]
  | Field Item Item
  deriving (Eq, Ord, Show)

These data structures can be used to define what it means for the following piece of code to be an s-expression:

((milk juice 3.1) . (honey marmalade "jam"))

Namely the above s-expression represents a pair of items each consisting of a list of three sub-items. In Haskell data structure form, we want this code to parse as follows:

Field
  (List [Atm (Symb "milk"), Atm (Sym "juice"), Atm (Number 3.1)])
  (List [Atm (Symb "honey"), Atm (Sym "marmalade"), Atm (Str "jam")])

In order to define an atom in antlr-haskell we write the following:

[g4|
  atom  : STRING -> Str
        ;

  STRING : '"' ( ('\\' .) | ~ ["\\] )* '"' -> String;
|]

First, note the funny [g4| ... |] syntax. If you're not familiar with the notion of a QuasiQuoter in Haskell, you can just think of this syntax as a way to sandbox the syntax of defining an antlr-haskell grammar from that of defining regular Haskell code. The name of the language of our sandbox goes inside the opening bracket delimiter (g4 in our case, ANTLR's language for defining a context-free grammar). And the subsequent closing bracket goes at the end of our sandbox to indicate we want to go back to coding in Haskell.

Inside the g4-quoted code block above, we've defined the non-terminal symbol atom, which parses correctly when the current input to our parser matches exactly on the terminal symbol for a STRING. To disambiguate, a non-terminal g4 symbol must begin with a lower-case letter and a terminal symbol must begin with an upper-case letter. When actually consuming some input string these two classes of identifiers in g4 (terminals and non-terminals) correspond to Haskell data types constructed by the lexical analyzer and the parser respectively. Also important to note is that semicolons delimit g4 production rules.

 

Let's break down our example a little more though. Taking the definition of a STRING terminal symbol we have:

  STRING : '"' ( ('\\' .) | ~ ["\\] )* '"' -> String;

The colon (:) syntax above defines the terminal symbol as named on the left of the colon, which lexes when the regular expression (regex) to the right-hand side of the colon matches the beginning of the current input string. This particular regex for a STRING can be read as follows:

  1. Match a literal double-quote character: '"'. The single-quotes are G4 syntax for a literal string consisting of one or more characters,
  2. The outer-most parenthesized regex, ( ('\\' .) | ~ ["\\] )*, is G4 syntax for grouping of regular expressions. In this case the asterisk suffixed to the parentheses says repeat the innards of the parentheses zero or more times, namely Kleene star.
    • A string may contain a literal backslash ('\\') followed by any character whatsoever (including a non-closing double-quote) as denoted by the period / dot symbol.
    • Or (as denoted by the vertical bar) a string may contain any character that is not (as denoted by the tilde) a double quote or a single literal backslash. Namely characters inside of square brackets like ["\\] defines a set of characters to match on.
  3. Once the lexer reaches a double-quote character, '"', without a backslash before it, the lexer says "I have successfully parsed a STRING".
  4. Lastly we jump back into G4 (not G4 regex) syntax with the arrow symbol. To the right of the arrow goes a Haskell type or type constructor currently in scope. In this case, the String type is treated specially by the antlr-haskell lexer but more generally we can put any type for which the Read typeclass is instantiated, or we can put any type constructor that expects exactly one String argument.

The other piece of the puzzle we have to look at is the atom declaration in G4, namely the following G4 snippet:

atom  : STRING -> Str ;

As with the lexeme syntax we see a familiar structure where a colon defines the non-terminal symbol named atom, where an atom is precisely just the STRING lexeme that we just defined. Then with the arrow symbol we say that when the antlr-haskell parser sees a STRING lexeme in a context where we're expecting an atom, we stuff it in the Str data constructor that we defined earlier.

But what about SYMB and NUMBER? Well, we need to beef up our terminal and non-terminal definitions as follows:

[g4|
  atom
     : STRING -> Str
     | SYMBOL -> Symb
     | NUMBER -> Number
     ;

  STRING : '"' ( ('\\' .) | ~ ["\\] )* '"' -> String;

  WHITESPACE : [ \n\t\r]+ -> String;

  NUMBER : ('+' | '-')? DIGIT+ ('.' DIGIT+)? -> Double;

  SYMBOL : SYMBOL_START (SYMBOL_START | DIGIT)* -> String;

  fragment SYMBOL_START : [a-zA-Z+\-*/] ;

  fragment DIGIT : [0-9] ;
|]

The above code snippet now introduces the ability to define alternatives of a non-terminal production rule. Namely an atom parses if and only if the current input token produced by the lexer was a STRING, SYMBOL, or NUMBER. We've also introduced the ability to refer to other lexemes from some lexical rule (such as NUMBER being composed of DIGITs). Note that, as is the case with the Java version of ANTLR, antlr-haskell's implementation of g4 does not support recursively defined lexical rules.

Furthermore, lexical fragments (lexical productions defined with the fragment annotation just before their name) are simply a nice shorthand to indicate regular-expression fragments that get copied-and-pasted into other lexical rules that refer to them. This distinction is important for parsers which do an initialize tokenization pass before starting parsing. Namely the default antlr-haskell tokenizer only looks for non-fragment lexemes in the input, producing tokens that are greedily chosen based on the longest-matching regular expression for the current input.

What can we do with atoms?

So far all we've really done is defined a glorified pattern-matching function that takes in a list of tokens and constructs a list of the corresponding atoms. If we were to define some Token type, we might just well have written the above grammar as the following Haskell code:

data Token = STRING String | SYMBOL String | NUMBER Double

tokenToAtom :: Token -> Atom
tokenToAtom (STRING s) = Str s
tokenToAtom (SYMBOL s) = Symb s
tokenToAtom (NUMBER n) = Number n

Clearly, the novelty of defining a language in a BNF-like syntax such as g4 comes into play when non-terminal production rules form a graph-structure rather than just a tree. Now that we know how to parse the leaves of an s-expression AST, we can subsequently define a structure over those leaves in the form of lists, items, and sexprs as follows:

[g4|
  grammar Sexpression;

  sexpr
     : item*
     ;

  item
     : atom                   -> ${\a -> Atm a}
     | list                   -> List
     | '(' item '.' item ')'  -> Field
     ;

  list
     : '(' item* ')'
     ;

  atom
     : STRING -> Str
     | SYMBOL -> Symb
     | NUMBER -> Number
     ;

  STRING : '"' ( ('\\' .) | ~ ["\\] )* '"' -> String;

  WHITESPACE : [ \n\t\r]+ -> String;

  NUMBER : ('+' | '-')? DIGIT+ ('.' DIGIT+)? -> Double;

  SYMBOL : SYMBOL_START (SYMBOL_START | DIGIT)* -> String;

  fragment SYMBOL_START : [a-zA-Z+\-*/] ;

  fragment DIGIT : [0-9] ;

|]

The main addition over the previous code snippets is that this one shows us how to define an inline Haskell lambda function to process the result of parsing a particular alternative of a production rule. In the atom case of an item above, we see the ${\a -> Atm a} syntax being used to define a lambda. This is what we call in quasiquoters an antiquotation. That is, anything delimited within an opening ${ and the subsequent closing curly brace can be any arbitrary Haskell expression. In order for the compiler to generate code that type-checks, however, this expression must be a function which takes a number of curried arguments and returns some type that's the same as all other branches of the production rule.

So for instance in the item non-terminal definition above, all of the production alternatives (atom, list, and field of two items) have a Haskell expression to the right of the arrow which expects a number of arguments equal to the number of non-terminals on the left-hand side of the arrow. That is, atom and list are each non-terminals which produce a Haskell Atom and a Haskell list of Item respectively. Correspondingly the Atm and List data constructors we defined for an Atom type-check in conjunction with the item production rules. For a look "underneath the hood" at how antlr-haskell actually constructs values upon parsing a particular production rule alternative, take a look at the spliced Haskell code generated by antlr-haskell here. The functions of note are the ones that start with ast2* meaning

Finally the grammar declaration on the first line of the quasiquoted g4 snippet above defines the name of our grammar. The name given can be useful if you want to programmatically inspect the grammar data structure that our antlr-haskell quasiquoter generates, or even build on top of other functions and data structures that the compiler generates. As of this writing, the antlr-haskell compiler generates the following grammar for our sexpression definition:

data SexpressionNTSymbol
  = NT_sexpr | NT_item | NT_atom | NT_list | NT_item_star
  deriving (Eq, Ord, Show, Hashable, Generic, Bounded, Enum, Data, Lift)
data SexpressionTSymbol
  = T_0 | T_1 | T_2 | T_STRING | T_WHITESPACE | T_NUMBER | T_SYMBOL
  deriving (Eq, Ord, Show, Hashable, Generic, Bounded, Enum, Data, Lift)
sexpressionGrammar' ::
  Prettify s_a7nA =>
  Grammar s_a7nA SexpressionNTSymbol SexpressionTSymbol
sexpressionGrammar'
  = defaultGrammar NT_sexpr ::
      Grammar s_a7nA SexpressionNTSymbol SexpressionTSymbol
    {ns = Data.HashSet.fromList
            [minBound .. maxBound :: SexpressionNTSymbol],
     ts = Data.HashSet.fromList
            [minBound .. maxBound :: SexpressionTSymbol],
     ps = [(Production NT_sexpr) ((Prod Pass) [NT NT_item_star]),
           (Production NT_item) ((Prod Pass) [NT NT_atom]),
           (Production NT_item) ((Prod Pass) [NT NT_list]),
           (Production NT_item)
             ((Prod Pass) [T T_0, NT NT_item, T T_1, NT NT_item, T T_2]),
           (Production NT_list)
             ((Prod Pass) [T T_0, NT NT_item_star, T T_2]),
           (Production NT_atom) ((Prod Pass) [T T_STRING]),
           (Production NT_atom) ((Prod Pass) [T T_SYMBOL]),
           (Production NT_atom) ((Prod Pass) [T T_NUMBER]),
           (Production NT_item_star)
             ((Prod Pass) [NT NT_item, NT NT_item_star]),
           (Production NT_item_star) ((Prod Pass) [NT NT_item]),
           (Production NT_item_star) ((Prod Pass) [Eps]),
           (Production NT_item_star)
             ((Prod Pass) [NT NT_item, NT NT_item_star]),
           (Production NT_item_star) ((Prod Pass) [NT NT_item]),
           (Production NT_item_star) ((Prod Pass) [Eps])]}
sexpressionGrammar ::
  Grammar () SexpressionNTSymbol SexpressionTSymbol
sexpressionGrammar
  = Text.ANTLR.LL1.removeEpsilons sexpressionGrammar'

Running the parser

There are three primary parsing functions that the antlr-haskell quasiquoter generates: glrParse, slrParse, and allstarParse with the following type signatures:

glrParse :: (t -> Bool) -> [t] -> ast
slrParse :: [t] -> ast
allstarParse :: [t] -> ast

The GLR parser first takes a function for filtering out certain tokens, especially whitespace so that a G4-defined grammar non-terminal doesn't need to. Then, as with slrParse and allstarParse, it takes the list of input tokens and produces some AST type based on what type gets constructed by the entrypoint to the grammar (the first non-terminal rule listed in the source file, top-down).

Contributing

The primary github repo for the antlr-haskell project can be found here: https://github.com/cronburg/antlr-haskell

Github issues

When submitting an issue it is preferable that you include a minimal grammar which triggers the unexpected behavior, what you expect the output to be, and a stack/cabal test-suite defining the correct behavior as such. If an existing test-suite doesn't logically relate to the bug, add a new one based on the test-suite template here.

Academic software notice

This project has been tested to a limited extent by a small number of academic researchers at a single University.