Friday, October 16, 2009

Building Parser Combinators in Scheme

I do not write code for the sake of writing code, but sometimes the best way to understand a concept is to write it up. I can think of many situations where I have no clue what the heck is going on by reading the documentations or even the tutorials, but I start to understand what's going on once I start to hack up with a solution.

Parser combinator is one of those things for me. Based on what I can find, Haskell was where the term is coined and expanded, but unfortunately I cannot read Haskell well enough to fully follow the excellent monadic parser combinator paper, let along trying to understand Parsec, and think about how it would be translated into scheme.

PLT Scheme contains a parser combinator library, but its documentation is written for someone who is completely familiar with the parser combinator and how it would work within scheme, so I couldn't start with it (in contrast, I was able to figure out parser-tools in short order, even though my previous experience with lex & yacc isn't a lot either).

Searching online, there are a couple of parser combinator written in scheme:
It wasn't much, but it was something that I can build on. Let's get started...

Immediate Obstacle - Reading from Port Instead of String or List

Parser combinator promises top-down parser development in almost BNF-like fashion, including infinitely look ahead and unlimited backtracking, all of which are extremely fascinating powers.

But reading the monadic parser combinator leaves me scratching my head wondering how this would work with ports. The parser combinator appears to derive its backtracking power via reading the string into a list of characters. This is nice and all but it doesn't work with ports, since once a character is read from a port it is gone and you lose the ability to backtrack (yes - with some ports it is possible to backtrack but this does not work with all ports).

I wasn't able to find an answer to this problem via the above URLs either, so I'll have to come up with my own solution.

What I came up with was the following:

  • we'll use the peek-* functions instead of the read-* functions to access the port data
  • we'll keep track of the byte counts that we have accessed so far, and we'll pass the byte count as a skip-ahead to the next parser
  • we'll only do a single read at the end to remove everything we have found so far - i.e. if the parse failed we'll never remove a single character from the port
By adhering to the above approach we now can backtrack without side-effect on ports as well (the side-effect only occur once at the end of the parse when we have a successful match).

With the above in mind - let's start building our parser combinators.

A Simple Example

Let's try to parse a very simple example - a symbol that contains only alphanumeric characters, with the first character being alpha:

(parse-symbol (open-input-string "asymbol1 ")) ;; => "asymbol1" 
So what should parse token look like? If it looks something like the following it would be good.

(make-parser (seq alpha (zero-more (one-of alpha numeric))) 
  (lambda (a1 lst) 
    (list->string (cons a1 lst))) 
Look at the above bolded line - you can almost read it as the definition of the token: a sequence of an alpha character, followed by zero or more of either an alpha or a numeric character. All we need to do is to build the seq, alpha, numeric, one-of, and zero-more.

We should start with the simplest of the above, which would be the numeric parser - we can implement it as the following:

(define (numeric in (skip 0)) 
  (let ((c (peek-char in skip)))
    (if (char<=? #\0 c #\9)
        (values c (add1 skip)) 
        (values #f #f)))) 
Basically - if the next character is one of the numeric characters, we will return the character and the updated count (which is increment of the skip count). Otherwise we return #f for both, with the second indicating that we did not read anything. Note it is the second value indicating whether or not we have a successful parse, because we want to allow #f as a legal return value from a successful parse. This means if you want to backtrack you'll have to keep track of the skip in your calling function in case the parse fails.

Although this example is "simple" - it already tells us exactly what a parser would look like:
  1. peek something from the port
  2. do some test to determine whether the peek returns the desired data
  3. if it does - return the data and update the count appropriately - this is a "success"
  4. otherwise return #f #f - this is a "fail"
All of the parsers will basically follow this structure.

Now - both the success and the fail can be refactored out as their own parsers:

(define (fail in (skip 0))
  (values #f #f)) 

(define (return v) ;; we return the value without consuming from the port 
  (lambda (in (skip 0)) 
    (values v skip))) 
Then numeric can be rewritten as:

(define (numeric in (skip 0)) 
  (let ((c (peek-char in skip)))
    (if (char<=? #\0 c #\9)
        ((return c) in (add1 skip))
        (fail in skip)))) 
This makes it more flexible to construct simpler higher level parser combinators.

The alpha can be written as follows:

(define (alpha in (skip 0)) 
  (let ((c (peek-char in skip))) 
    (if (or (char<=? #\a c #\z) (char<=? #\A c #\Z))
        ((return c) in (add1 skip)) 
        (fail in skip)))) 
It ought to be clear that we can refactor alpha and numeric to make it more succinct:

(define (char-test test?) 
  (lambda (in (skip 0)) 
    (let ((c (peek-char in skip)))
      (if (and (char? c) (test? c)) 
          ((return c) in (add1 skip)) 
          (fail in skip)))))

(define alpha 
  (char-test (lambda (c) 
               (or (char=<? #\a c #\z) (char=<? #\A c #\Z)))))

(define numeric 
  (char-test (lambda (c)
               (char=<? #\0 c #\9))))
In this case - char-test is a higher order parser that takes in a predicate to create a parser. We now have a base to create more character-based parsers. We'll do so in the next post. Stay tuned.


  1. Another approach is to use the stream SRFI.

    (port->stream input)

    Will give you a stream, which resembles a list in Haskell (the members aren't evaluated until you request them, and they are memoised).

    That's what I did when I explored this stuff.

  2. Bob - thanks for the idea on using stream. Appreciated.