Wednesday, January 6, 2010

BZLIB/PARSEQ.PLT - (2) Parser Combinators API

[Continuing the previous post on the API of bzlib/parseq]

Previously we have looked at the basic parsers that peeked into inputs, now it's time to look at the combinators.

Basic Parser Combinators

bind is the "bind operator" for the parsers, with the following signature:

(-> Parser/c (-> any/c Parser/c) Parser/c) 
It takes in a parser, a "transform" function that will consume the returned value from the first parser and transform the value into another parser, and combine both into another parser.

As there are other more specialized combinators, you should not need to use bind directly unless you are trying to compose functions during run-time.

result simplifies the transform function you need to write so your transform function only need to return the value, not another parser. Below is the definition of result:

(define (result parser helper)
  (bind parser 
        (lambda (v) 
          (if v 
              (return (helper v))  
              fail))))
result* works like result, but it only works with a return value that is a list, and it applies the transform against the value. This comes in handy when you know the parser returns a list and you want to bind the list to individual arguments in your transform function. Below is an example (sequence* combinator is explained later):

(result* (sequence* (char= #\b) (char= #\a) (char= #\r)) 
         (lambda (b a r) ;; b maps to #\b, a maps to #\a, and r maps to #\r 
            (list->string (list b a r))))
Multi-Parser Combinators

seq is a macro-based combinator that bounds multiple parsers in succession, and returns the results only if all parsers succeed. The design of this parser is inspired by Shaurz's Haskell-style parser combinator.

The default form of seq is (seq <parser>), which is equivalent of <parser>. An seq expression must end in the first form.

The second form of seq allows you to create lexical binding against the value of the intermediate parser so you can manipulate the result with the last parser. An example to parse 2 separate digit and sum by with the first being multiplied by 2 and the second multiplied by 4:

(seq a <- (char= #\0 #\9) 
     b <- (char= #\0 #\9) 
     (return (+ (* 2 (string->number (string a))) 
                (* 4 (string->number (string b)))))) 
This form, however, cannot be the last statement in seq, i.e., you must have other parser statements after a binding (otherwise why do you need the binding?).

The last form of seq behaves like the second form but does not create a binding, so it looks like the first form.

An example is to parse away the parentheses surrounding a digit:

(seq (char= #\() ;; 3rd form - no lexical binding 
     digit <- (char= #\0 #\9) ;; 2nd form - binds digit 
     (char= #\)) ;; 3rd form - no lexical binding 
     (return (string->number (string digit))) ;; 1st form 
     )
seq is perhaps the most versatile combinator that you will use to create other combinators and parsers.

sequence is the functional version of seq that takes a list of parsers as its argument:

(sequence (list parser ...)) 
sequence has the following limitations:
  • sequence do not allow for custom bindings
  • sequence cannot be used to define recursive parser (its inner parser must be defined first)
  • sequence do not handle custom transformation of the results - you should use it with result* to bind the transformation (see above).

sequence* is the variable args version of sequence, i.e. (sequence* parser ...).

choice is a macro combinator that combines mutliple parsers so the first one that succeeds will return the value:

;; returns if the next char is either #\a, #\b, or #\c.  Else fail. 
(choice (char= #\a) (char= #\b) (char= #\c)) 

one-of is the functional version of choice that takes a list of parsers as its argument:

(one-of (list parser ...)) 
Since this is an function combinator, unlike choice, you cannot define recursive parser with it.

one-of* is the variable arg version of one-of, i.e., (one-of* parser ...).

all-of requires all of the inner parsers to succeed, and it returns the result from the last parser:

;; an contrived example: parse successfully if the next char is #\e 
(all-of (list alpha (char-between #\d #\f) (char= #\e))) 
Remember it is the last result being returned, so if you inner parsers requires different amount of bytes matched, make sure the one that peek the most bytes gets matched last.

all-of* is the variable arg list version of all-of.

Repetition-Based Combinators

repeat is the most flexible repetition-based combinators:

(repeat <parser> <minimum> <maximum>) 
The minimum argument defaults to 1, and the maximum argument defaults to positive infinity, which corresponds to the one-many combinator:

(define (one-many parser) 
  (repeat parser 1 +inf.0)) 
And zero-many maps to:

(define (zero-many parser) 
  (repeat parser 0 +inf.0)) 
Note that all three parsers will return a list that contains each of the successful parses. zero-many will return a list even if the inner parser does not match, so zero-many will always successfully parse.

zero-one is a special repetition-based combinator in that it does not return a list. Instead, it allows you to parse for a single occurence, and if it fails, you can specify a default value to substitute for the failure:

;; example of parsing for "foo" but returns "bar" if the parse failed. 
(zero-one (string= "foo") "bar") 
Hence zero-one will always succeed as well.

That's it for the combinators. We'll discuss some pre-defined parsers in the next post. Stay tuned.

No comments:

Post a Comment