Friday, January 8, 2010

BZLIB/PARSEQ.plt - (4) Token-Based Parsers API

Previously we have looked at fundamental parsers and the combinators API, as well as common parsers for character, number, and string, now it is time to look at token-based parsers provided by bzlib/parseq.

A huge class of parsing involves tokenizing the streams by skipping over whitespaces. For example, if we want to parse for a list of 3 integer, separated by comma, we currently have to write:

(define three-int-by-comma 
  (seq whitespaces 
       i1 <- integer 
       whitespaces 
       #\, 
       whitespaces 
       i2 <- integer 
       whitespaces 
       #\, 
       whitespaces 
       i3 <- integer 
       (return (list i1 i2 i3)))) 
The code above looks messy, and it would be nice if we do not have to explicitly specify the parsing of whitespaces. token allows us to abstract away the parsing of whitespaces:

(define (token parser (delim whitespaces)) 
  (seq delim 
       t <- parser 
       (return t))) 
The above code can now be rewritten as:

(define three-int-by-comma2 
  (seq i1 <- (token integer) 
       (token #\,) 
       i2 <- (token integer) 
       (token #\,) 
       i3 <- (token integer) 
       (return (list i1 i2 i3)))) 
Which looks a lot better. But given tokenizing is such a common parsing task, we have a shorthand for the above called tokens:

(define-macro (tokens . exps) 
  (define (body exps) 
    (match exps 
      ((list exp) (list exp)) 
      ((list-rest v '<- exp rest) 
       `(,v <- (token ,exp) . ,(body rest)))
      ((list-rest exp rest) 
       `((token ,exp) . ,(body rest)))))
  `(seq . ,(body exps)))
Which will reduce the above parsing to the following:

(define three-int-by-comma3
  (tokens i1 <- integer 
          #\,
          i2 <- integer 
          #\, 
          i3 <- integer 
          (return (list i1 i2 i3)))) 
There is a case insensitive version of tokens called tokens-ci that allows the character and string token to be parsed in case insensitive fashion.

Besides tokenizing, another common need in token-based parsing is to handle delimited sets. In the above example, the 3 integers are delimited by commas. delimited generalize the pattern:

(define (delimited parser delim) 
  (tokens v <- parser 
          v2 <- (zero-many (tokens v3 <- delim
                                   v4 <- parser
                                   (return v4)))
          (return (cons v v2))))
The following parses a list of comma-delimited integers:

(delimited integer #\,) 
Another common pattern is to parse for brackets that surrounds the value that you need. Just about all programming languages have such constructs. And bracket handles such parses:

(define (bracket open parser close) 
  (tokens open
          v <- parser 
          close 
          (return v))) 
And bracket/delimited combines the case where you need to parse a bracketed delimited values:

(define (bracket/delimited open parser delim close) 
  (tokens open ;; even the parser is optional...  
          v <- (zero-one (delimited parser delim) '()) 
          close 
          (return v)))

That's it for the bzlib/parseq API. If you find anything missing, please let me know.

Enjoy.

2 comments:

  1. This library is quite an awesome find! I'm currently playing with combinator parsing in conjunction with a system very similar to google's "protobuf"s as a way of implementing a simple bittorrent client in scheme, and I had implemented a decent chunk of my own combinator parsing library before I thought of checking PLaneT. Thanks for writing this!

    ReplyDelete