Sebastian Millies: Dev-Blog : DSLs with Graham Hutton's Functional Parser

One of the best introductions to functional programming is Graham Hutton's excellent book Programming in Haskell. Although it is primarily a course on the Haskell programming language, it also provides an introduction to the underpinnings of functional programming in general. It is very readable, aimed at the undergraduate level. I think it should be on the shelf of every student of functional programming.

I had thought of writing a book review, but there's really no need for it. You can find in-depth technical reviews on the book's site, and more: You can download slides for each chapter, watch video lectures, and get the complete Haskell source code of all the examples. That is for free, without even buying the book - a tremendous service.

Instead of a review, I'll show you what I did for an exercise. I ported the functional parser that Hutton presents in chapter 8 of his book to Java. My aim is to demonstrate the following:

Mastering functional idioms will help you to create powerful interfaces supporting a fluent programming style in Java
Of particular use in abstracting from tedious glue code are monadic interfaces, i. e. basically those that have a flatMap method
There is a special way of encapsulating internal state that is sometimes useful

The parser itself is a recursive descent parser of the kind you wouldn't necessarily use for large and complex grammars. (You'd perhaps use a compiler-compiler or parser generator tool like ANTLR or Xtext/Xtend.) However, you might use it to quickly spin-off a little DSL or demo coding. The real point is not the parser, but the approach to application design.

In the following, some of the text is taken verbatim from Hutton's book, some is my own. I shall not bother with quotes and attributions, because I am not claiming any original ideas here. Remember: this is just an exercise in porting Haskell to Java. However, I'll try to be helpful by interspersing a few extra examples and comments.

Nevertheless, this post is going to be a bit tedious, in the way textbooks are. I apologize in advance. If you're feeling impatient, perhaps you might want to scroll down and look at the final example, a little parser for arithmetic expressions. I summarize my experiences and give some advice in the concluding section.

Basic parsers

So what is a parser? It is a function that takes an input character sequence and produces a pair comprising a result value and an output character sequence, the unconsumed rest of the input. Let's abstract over the type of the result value and define a corresponding functional interface:

@FunctionalInterface
public interface SimpleParser<T> {   
    abstract Tuple<T, CharSequence> parse(CharSequence cs);
}

Hutton's parser in fact returns a list, to include the possibility of returning more than one result if the input sequence can be parsed in more than one way. For simplicity, however, he only considers parsers that return at most one result. For that reason, I have opted for the simpler signature. It follows that our parser cannot handle ambiguity. You might want to do the generalization yourself.

I shall assume the existence of a special empty tuple, which has no components, as an analogue to the empty list, which signifies that the input cannot be parsed at all. (Otherwise, the type Tuple is just what you might expect.)

We will now define some basic parsers, and then see how we can arrange them in a hierarchy of ever higher-level derived parsers. First, here's a method that produces a basic parser that always succeeds with the given result value without consuming any input:

default <V> SimpleParser<V> unit(V v) {
    return inp -> tuple(v, inp);
}

In Haskell, this method would be called "return", but of course that's not possible in Java. We'll also provide output, a static analogue to unit, and failure, a way to produce a parser that always fails regardless of the input. In fact, we'll define quite a few static factory methods for creating parsers. These I have collected in a separate utility class SimpleParserPrimitives. The parser interface itself will contain only the (non-static) methods to combine and manipulate parsers. It will turn out that these methods are quite general. You might re-use them, and each time you might well provide a different set of parsing "primitives" for a different parsing purpose. The primitives I will be discussing are geared towards parsing programming languages.

public static <T> SimpleParser<T> output(T v) {
    return inp -> tuple(v, inp); 
}

public static <T> SimpleParser<T> failure() {
    return _inp -> empty();
}

The method empty returns the empty tuple, representing failure. Our final basic parser is item, which fails if the input string is empty, and succeeds with the first character as the result value otherwise.

public static SimpleParser<Character> item() {
    return inp -> inp.length() == 0 ? empty() : tuple(inp.charAt(0), inp.subSequence(1, inp.length()));
}

Sequencing

Suppose we want to parse an arithmetic expression like "5 + 4 * 3". One way to decompose the problem is to create a parser for the numeral "5", and a parser for the rest of the expression "+ 4 * 3" (which we would again decompose into a parsers for the symbol "+" and the expression "4 * 3" and so on.) We then look for a way to combine these parsers into a parser for the whole expression. The combined parser shall execute the parsing steps one after the other. In the process of doing that we need to make decisions based on the result of the previous step. One such decision is if and how to continue parsing the rest of the input or not. If we want to continue, we must pass on the intermediate result to the following step.

Here's the signature of such a method that takes a description of the future parsing steps, and returns a new parser that will execute the current step combined with those future ones:

default <V> SimpleParser<V> then(Function<? super T, SimpleParser<V>> f)

The returned parser fails if the application of the current parser to the input string fails, and otherwise applies the function f to the result value to give a second parser q, which is then applied to the output string to give the final result. Thus, the result value produced by this parser is made directly available for processing in the remaining steps by binding it to a lambda variable in a function describing those remaining steps. For example, the expression

p.then(x -> output(x));

may be read as: Parse the input using the parser p, and then, calling the result "x", apply output() to x.

The method then corresponds to flatMap on Java Stream and Optional, resp. to CompletableFuture.thenCompose(), which are the three "monadic" interfaces built into Java 8. These methods all have the property of chaining computations together (in the case of streams like nested for-loops, as I have discussed in several earlier posts.) Here's the implementation. You can see how each line corresponds to a part of the description we have given above.

default <V> SimpleParser<V> then(Function<? super T, SimpleParser<V>> f) {
    Objects.requireNonNull(f);
    SimpleParser<V> pf = inp -> {
        Tuple<T, CharSequence> t = parse(inp);
        if (t.isEmpty()) {
            return empty();
        }
        return f.apply(t.first()).parse(t.second());
    };
    return pf;
}

Let's look what we can do with the building blocks we have assembled so far. Here's a contrived example. The parser p consumes three characters, discards the second, and returns the distance between the first and the third:

@Test 
public void composition() {
    SimpleParser<Integer> p = 
            item().then( x -> 
            item().then( y ->
            item().then( z ->
            output(z - x)
            )));
    assertEquals(tuple(2,"def"), p.parse("abcdef"));
    assertEquals(empty(), p.parse("ab"));
}

Notice how the formatting goes some way to hide the nesting of the constructs. I find the code better readable that way, because conceptually, we are just executing parsing steps in sequence. Note, too, how we are ignoring the variable y in the last two steps. Let's introduce a variant on then that allows us to dispense with this superfluous variable.

default <V> SimpleParser<V> then(Supplier<SimpleParser<V>> p) {
    Objects.requireNonNull(p);
    return then(_v -> p.get());
}

The above parser can now be rewritten without the unused variable:

    SimpleParser<Integer> p = 
            item().then( x -> 
            item().then( () ->
            item().then( z ->
            output(z - x)
            )));

Note the use of Supplier to enforce laziness: One could not simply overload then to accept a parser instance, because that would imply that the argument parser p were immediately constructed by the caller, while otherwise it is constructed only after the current parser has completed successfully. But at least we do no longer have to litter our code with dummy variables. Haskell has nice syntactic sugar ("do-notation") that allows just leaving out the unused variables, and also hides the nesting of the calls (what I have tried to achieve by code formatting).

Also note how there's no reference to the input that is being parsed in the definition of our little parser. There is no explicit variable of type CharSequence being threaded through the calls, or anything like that. All the mechanics of state handling are neatly encapsulated inside then.

Derived parsers

We can now start combining our basic parsers in meaningful ways to derive more parsing primitives. First, we build upon item to define a parser that accepts a certain character only if it satisfies some predicate. And with the help of the utility methods in java.lang.Character we can go on to define parsers for all kinds of character classes, like digits, lower and upper case letters, alphanumeric characters or whitespace. For example:

/** A parser that returns the next character in the input if it satisfies a predicate. */
public static SimpleParser<Character> sat(Predicate<Character> pred) {
    Objects.requireNonNull(pred);
    return item().then(x -> pred.test(x) ? output(x) : failure());
}

/** A parser that returns the next character in the input if it is the given character. */
public static SimpleParser<Character> character(Character c) {
    return sat(x -> c.equals(x));
}

/** A parser that returns the next character in the input if it is a digit. */
public static SimpleParser<Character> digit() {
    return sat(Character::isDigit);
}

/** A parser that returns the next character in the input if it is a whitespace character. */
public static SimpleParser<Character> whitespace() {
    return sat(Character::isWhitespace);
}

With character we can recursively recognize strings.

public static SimpleParser<String> string(String s) {
    Objects.requireNonNull(s);
    return s.isEmpty() ? output("") : character(s.charAt(0)).then(() -> string(s.substring(1)).then(() -> output(s)));

The base case states that the empty string can always be parsed. The recursive case states that a non-empty string can be parsed by parsing the first character, parsing the remaining characters, and returning the entire string as the result value.

Choice and repetition

There are more ways to combine parsers than simple sequencing. We often need to express alternative ways of parsing one string ("choice"), and we need repetition, where a string is parsed by the repeated application of the same parser.

/** Applies the given parser if this parser fails. */
default SimpleParser<T> orElse(SimpleParser<T> p) {
    Objects.requireNonNull(p);
    return inp -> {
        Tuple<T, CharSequence> out = parse(inp);
        return out.isEmpty() ? p.parse(inp) : out;
    };

/** Applies this parser at least once or not at all. */
default SimpleParser<List<T>> many() {
    return many1().orElse(unit(emptyList()));
}

/** Applies this parser once and then zero or more times. */
default SimpleParser<List<T>> many1() {
    return then(c -> many().then(cs -> unit(cons(c, cs))));
}

(cons is a utility that adds an element at the front of a persistent list. Think of copying the cs to a new list and adding c at index 0.)

Note that many and many1 are mutually recursive. In particular, the definition for many states that a parser can either be applied at least once or not at all, while the definition for many1 states that it can be applied once and then zero or more times. The result of applying a parser repeatedly is a list of the collected results. At some point, we may want to convert this list back to a single value, e. g. we'd perhaps like to convert a sequence of digits to an integer value.

public static SimpleParser<Integer> nat() {
    return digit().many1()
                  .then(cs -> output(ParserUtils.charsToString(cs)))
                  .then(s -> output(Integer.parseInt(s)));
}

This looks a bit clumsy, after all we are only transforming the result value inside a parser without actually consuming any input. It would be more natural if we had a method map to do that. (BTW, Hutton does not include such a method.) Fortunately, map is easy to define in terms of unit and then. In fact, it is axiomatic that this definition should be valid, otherwise we have made a mistake in the definition of our flatMap-analogue then (probably not with unit, since it is so simple).

default <V> SimpleParser<V> map(Function<? super T, V> f) {
    Objects.requireNonNull(f);
    Function<V, SimpleParser<V>> unit = this::unit;
    return then(unit.compose(f));
}

A new parser is "wrapped" around the output of the function f, as opposed to then, which already takes a parser-bearing function and does not wrap it with another parser. The analogy with map and flatMap on Stream is obvious. Our code for parsing a natural number now becomes more readable. And we can now also use method references.

public static SimpleParser<Integer> nat() {
    return digit().many1()
                  .map(ParserUtils::charsToString)
                  .map(Integer::parseInt);
}

Tokenization

In order to handle spacing, we introduce a new parser token that ignores any space before and after applying some given parser to an input string.

/** A parser that collapses as many whitespace characters as possible. */
public static SimpleParser<String> space() {
    return whitespace().many().map(s -> "");
}

/** A parser that ignores any space around a given parser for a token. */ 
public static <T> SimpleParser<T> token(SimpleParser<T> p) {
    Objects.requireNonNull(p);
    return space().then(() -> p.then(v -> space().then(() -> output(v)))); 
}

We may be interested in a diversity of tokens, according to our target language, such as tokens for variable names and other identifiers, reserved words, operators, numbers etc. For example, we can define

/** A parser for a natural number token. */ 
public static SimpleParser<Integer> natural() {
    return token(nat());
} 

/** A parser for a symbol token. */
public static SimpleParser<String> symbol(String sym) {
    Objects.requireNonNull(sym);
    return token(string(sym));
}

This completes our overview of the parsing primitives. Let's see a few applications.

Lists of numbers

The following test defines a parser for a non-empty list of natural numbers that ignores spacing around tokens. This definition states that such a list begins with an opening square bracket and a natural number, followed by zero or more commas and natural numbers, and concludes with a closing square bracket. Note that the parser should only succeed if a complete list in precisely this format is consumed.

@Test
public void tokenization() {
    SimpleParser<List<Integer>> numList = 
                            symbol("[").then(() ->
                            natural().then(n ->
                                symbol(",").then(() -> natural()).many().then(ns ->
                            symbol("]").then(() ->
                            output(cons(n,ns))))));

    assertEquals(tuple(asList(1,2,3),""), numList.parse("[1,2,3]"));
    assertEquals(tuple(asList(11,22,33),""), numList.parse(" [11,  22, 33 ] "));
    assertEquals(tuple(asList(11,22,33),"abc"), numList.parse(" [11,  22, 33 ] abc"));
    assertEquals(empty(), numList.parse("[1,  2,]"));
}

You get the idea. There is one shortcoming: Detecting failure on account of ill-formed or unused input involves examining the output tuples in caller code. This may not always be what we want.

Representing failure

What would one expect to happen when some input is not accepted? In Java one would certainly expect an exception to be thrown. However, that is not so easy here. Of course, we cannot make then or failure throw an exception instead of returning empty(), because that would abort all possible search paths, not just the one we're currently on. Putting exception handling code for regular control flow into orElse I also consider bad. The best thing to do, I suppose, is to add a top-level method to the parser that would examine the parsing result.

default T eval(CharSequence cs) {
    Objects.requireNonNull(cs);
    Tuple<T, CharSequence> t = parse(cs);
    if (t.isEmpty()) {
        throw new IllegalArgumentException("Invalid input: " + cs);
    }
    CharSequence unusedInput = t.second();
    if (unusedInput.length() > 0) {
        throw new IllegalArgumentException("Unused input: " + unusedInput);
    }
    return t.first();
}

This way, we have also successfully hidden the internal use of tuples from callers of the parser, which is a good thing.

Arithmetic expressions

Hutton presents an extended example for parsing arithmetic expressions. The example serves to show how easy it can be to create a little DSL.

Consider a simple form of arithmetic expressions built up from natural numbers using addition, multiplication, and parentheses. We assume that addition and multiplication associate to the right, and that multiplication has higher priority than addition. Here's an unambiguous BNF grammar for such expressions.

expr ::= term (+ expr | nil)
term ::= factor (∗ term | nil)
factor ::= natural | (expr)
natural ::= 0 | 1 | 2 | · · ·

It is now straightforward to translate this grammar into a parser for expressions, by simply rewriting the rules using our parsing primitives. We choose to have the parser itself evaluate the expression to its integer value, rather than returning some form of tree.

SimpleParser<Integer> expr() {
    return term().then(t ->
                    symbol("+").then(() -> expr().then(e -> output(t + e)))
                    .orElse(output(t)));
}
    
SimpleParser<Integer> term() {
    return factor().then(f ->
                      symbol("*").then(() -> term().then(t -> output(f * t)))
                      .orElse(output(f)));
}
    
SimpleParser<Integer> factor() {
    return natural()
           .orElse(
               symbol("(").then(() ->
               expr().then(e ->
               symbol(")").then(() ->
               output(e)))));
}

I like the look of that. Once you get used to the formalism, writing the parser is as easy as writing the grammar in the first place. Let's test the whole thing:

@Test
public void multiplicationTakesPrecedence() {
    assertThat( expr().eval("2+3*5"), is(17) );
}

Laziness

Notice how we have created functions at each step that only get evaluated when we actually use them. Only then is a new parser instantiated. The construction of the parser is interleaved with its execution. As mentioned above, it is important in this context that then always takes a function, if necessary even a no-argument function, that yields a parser, never directly a parser instance. If that were different, we might actually have an infinite loop, e. g. expr() → term() → factor() → expr(), where the last call to expr() would be that in the "orElse"-clause, because Java as a strict language evaluates method arguments immediately.

Recursive lambdas

You may wonder why I have written the expression parser in terms of methods, instead of declaring instance variables for the functions expr, term, and factor.The reason is that lambdas capture local variables by value when they are created, and we would therefore get a compile-time error: "Cannot reference a field before it is defined". The only way to write recursive lambdas is to make everything static and use fully qualified names (see the Lambda FAQ).

Lessons learned

I feel I have learned something from this exercise, especially about the feel of functional programming in Java. A few particular lessons have been:

"Monadic" interfaces are easy to design once you get your head around flatMap and its relatives. This method in its various guises, in Stream, Optional, CompletableFuture or our SimpleParser, is a powerful tool you need to understand.
When you need to add functionality to your lambdas, default methods in interfaces are the way to go. You can instantiate your functions with lambda expressions, and then combine them fluently in a variety of interesting ways.This is also what the JDK does, e. g. the Function interface has a default method compose for functional composition.
Our parser interface has rather many (in fact, eight) default methods. All these default methods are strongly coupled. It is generally accepted that this is bad when designing for inheritance (cf. what Josh Bloch has to say in Effective Java, 2nd ed., items 17 and 18). Perhaps that is a reason to feel a bit queasy about the approach.Which leads me to the next point.
Functional interfaces need not be designed for inheritance, so that in contrast to "ordinary" interfaces, strong coupling is acceptable: I have adopted a practice according to which I do not extend functional interfaces and implement them only with lambdas. (It's nevertheless OK to extend another functional interface in order to specialize a generic type variable or add a few little static utilities. That is what UnaryOperator and BinaryOperator do, and they are the only two of the 57 functional interfaces in the JDK to extend another interface.)
Functions are great for re-usability. Being implemented via interfaces, they are usually public. As a consequence, data structures tend to leak all over the place. (We had to take special care to hide the use of tuples.) I guess that this is part of the functional mind-set: While in the OO-world, we encapsulate and hide as much we can (with reason), this is less important in the functional world.
Don't clutter your interfaces with too many static methods. It is sometimes difficult to decide where a static method should go (e. g. output and failure are so essential, should they have been put into the parser interface?).
Java is really not a functional programming language.Stay with idiomatic Java as far as possible. Some of the things that I found to be missing in Java to make it more "functional" are

persistent collections, tuples (you can pull-in third party libraries)
some important syntactic sugar like do-notation, inconspicuous unused variables (perhaps underscore only) etc. Nested constructs can quickly become difficult to read.
the ability to write non-static recursive lambdas.
a nice syntax for functional application.

Beware of Java's eagerness, prefer lazy idioms.
Don't kill your backtracking with exceptions.
Do not throw checked exceptions from your code. Sooner or later (probably sooner) it's going to be a pain when you want it to work with lambdas, because all the most important standard functional methods in the JDK - like Function#apply() - do not have a throws-clause.

In my IDE (Eclipse Mars) I have been annoyed by two things in particular. I don't know if other IDEs offer a better experience.

Debugging lambdas is a pain. (As an exercise, you might replace equals with == in the character parser, pretend you didn't know that you made this silly mistake, and try to fix it.) Unit testing becomes all the more important. Here are a few negative experiences I have made:

Being restricted to line breakpoints is bad when you're interested only in a part of those nice one-liners.
In order to set breakpoints inside a lambda, you need blocks, again bad for elegance.
When you're inside the lambda, you may see synthetic members named arg$1 etc. without being really able to tell what they are. It helps a bit when you at least show the declared type in the Variables view.
Stepping into a lambda is not really an option, because of all the synthetic stuff in between.

Formatting lambdas is not always easy (I often do it manually, case by case, because I don't always like what Eclipse does automatically).

And do have a look at Hutton's book!

Sebastian Millies: Dev-Blog

SyntaxHighlighter

10 Jan 2016

DSLs with Graham Hutton's Functional Parser