Monday, October 19, 2009

Building Parser Combinators in Scheme (2) - Higher Order Combinators

Previously we started building parser combinators for parsing a symbol that has the following signature (seq alpha (zero-more (one-of alpha numeric))), and we stopped at refactoring the numeric and alpha parsers, and ended up with return, fail, and char-test.

char-test Expanded

char-test gives us a good base for building a bunch of other character-based parser:

;; char= returns a parser to test whether the next char equals c
(define (char= c)
  (char-test (lambda (it)
               (char=? it c)))) 

;; in-chars returns a parser to match against the list of chars
(define (in-chars lst)
  (char-test (lambda (it) 
               (member it lst)))) 

;; not-in-chars is the reverse of in-chars 
(define (not-in-chars lst)
  (char-test (lambda (it) 
               (not (member it lst))))) 

;; in-char-range returns a parser to match char between from & to chars
(define (in-char-range from to)
  (char-test (lambda (it)
               (char<=? from it to))))

;; the oppposite of in-char-range
(define (not-in-char-range from to)
  (char-test (lambda (it)
               (not (char<=? from it to)))))
So we can build parsers such as the following:

;; a parser to test for the backslash
(define backslash (char= #\\)) 
;; writing numeric using in-chars
(define numeric (in-chars '(#\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9)))
;; writing equivalent to regexp \S (non-whitespace) 
(define non-whitespace (not-in-chars '(#\tab #\return #\newline #\space #\vtab)))
;; lower-case alpha using in-char-range 
(define lower-case (in-char-range #\a #\z)))
;; upper-case alpha using in-char-range
(define upper-case (in-char-range #\A #\Z))) 
It would be nice to write alpha in terms of upper-case and lower-case as defined above:

(define alpha (one-of upper-case lower-case)) 
So let's see how we can write one-of.

Higher order Parsers

The idea of one-of is straight forward - take in a list of the parsers to test against the input, one parser at a time. The first one that succeeded would be returned:


(define (one-of . test)
  (lambda (in (skip 0)) 
    (let loop ((rest test))
      (if (null? rest)
          (fail in skip)
          (let-values (((v count)
                        ((car rest) in skip)))
            (if (not count)
                (loop (cdr rest))
                ((return v) in count)))))))
Now we can then write the following:

(define alpha (one-of (in-char-range #\a #\z)
                      (in-char-range #\A #\Z)))

(define alpha-numeric (one-of numeric alpha)) 
one-of is basically a parser combinator that acts like an or expression. We can also have a combinator that acts like an and:

(define (all-of . test)
  (lambda (in (skip 0)) 
    (let loop ((rest test)
               (v #f)
               (count skip))
      (if (null? rest)
          ((return v) in count)
          (let-values (((v count)
                        ((car rest) in skip)))
            (if (not count)
                (fail in skip)
                (loop (cdr rest) v count)))))))
Notice that all-of will return the value and the count from the last match (the same behavior as and), but all the previous tests also need to match. It would of course be the user's responsibility to construct valid combination that will pass.

If we want to parse multiple matches in succession (i.e. one after another), we need a sequence combinator:

(define (seq . test)
  (lambda (in (skip 0)) 
    (let loop ((rest test)
               (acc '()) 
               (count skip))
      (if (null? rest)
          ((return (reverse acc)) in count)
          (let-values (((v count)
                        ((car rest) in count)))
            (if (not count) ;; we are done! 
                (fail in skip)
                (loop (cdr rest) (cons v acc) count)))))))
This will allow us to parse a sequence of tokens, for example, the below demonstrates parsing a social security number in nnn-nn-nnnn form:

(define dash (char= #\-)) 
(define ssn (seq numeric numeric numeric dash numeric numeric dash numeric numeric numeric numeric)) 
Now it can be bothersome to write repeated numerics like above, so we can have a repeat parser:

(define (repeat test n) 
  (lambda (in (skip 0)) 
    (let loop ((i 0)
               (acc '())
               (count skip))
      (let-values (((v count)
                    (test in count))) 
        (cond ((not count) ;; failed before reaching n
               (fail in skip))
              ((= i (sub1 n)) ;; succeeded
               ((return (reverse (cons v acc))) in count))
              (else
               (loop (add1 i) (cons v acc) count)))))))
Then ssn can be written as:

(define NN (repeat numeric 2)) 
(define NNN (repeat numeric 3)) 
(define NNNN (repeat numeric 4)) 
(define ssn (seq NNN dash NN dash NNNNN)) 
repeat parses for fixed numbers of repeats - what if we want to have unbounded repeats? Let's try to build zero-many that'll match for zero or more occurrences:

(define (zero-many test) 
  (lambda (in (skip 0)) 
    (let loop ((acc '())
               (count skip)) 
      (let-values (((v new-count)
                    (test in count)))
        (if (not new-count) ;; we are done... 
            ((return (reverse acc)) in count)
            (loop (cons v acc) new-count))))))

Which we can then use to build one-many that must have at least one match:

(define (one-many test)
  (lambda (in (skip 0))
    (let-values (((v count)
                  (test in skip))) 
      (if (not count)
          (fail in skip)
          (let-values (((out count)
                        ((zero-many test) in count)))
            ((return (cons v out)) in count))))))
And of course there is a special test of zero or one occurrence:

(define (zero-one test)
  (lambda (in (skip 0)) 
    (let-values (((v count)
                  (test in skip))) 
      (if (not count) 
          ((return #f) in skip)
          ((return v) in count)))))
With all the above we now finally can construct the symbol parse:

(define symbol (seq alpha (zero-many (one-of alpha numeric))))
Which will return the following when parsing a symbol:

> (symbol (open-input-string "asymbol1 "))
(#\a (#\s #\y #\m #\b #\o #\l #\1))
8
The first is the read value - notice that they are listed according to the position within the sequence (#\a matches alpha, and (#\s #\y #\m #\b #\o #\l #\1) matches (zero-many (one-of alpha numeric))). And the second value indicating the bytes "peeked", which is 8.

The final step is to have a parser that allow us to take in the value and transform them according to our needs:

(define (make-parser test return)
  (lambda (in (skip 0)) 
    (let-values (((v count)
                  (test in skip))) 
      (cond ((not count)
             (values #f #f))
            (else
             (read-bytes count in)
             (values (apply return v) count))))))
So we can build the final symbol parser as following:

(define parse-symbol 
  (make-parser (seq alpha (zero-many (one-of alpha numeric)))
    (lambda (alpha lst) 
      (list->string (cons alpha lst)))))
which will return a string for us instead of the above args:

> (parse-symbol (open-input-string "asymbol1 "))
"asymbol1"
8
At this point we have the most of the basic parser combinators constructed, and the rest is to fill in the details as necessary. We'll take a look at how to improve the parser to handle more complex scenarios in the future posts. Stay tuned.

2 comments:

  1. Hi YC,

    do you also plan to put this on planet (when it's ready)?
    Or do you regard it as exploratory code, given that there seems to be already a combinator parser library for PLT (you link to a documentation file, but I don't know where the library itself would be - also I agree that if this is already ALL the documentation available for it, I'd have a problem :-;) )
    In any case, thanks for writing posts like this - this one motivated me to start a first look at something I'd have regarded as "not yet accessible for me at the time present" before :-)

    cheers
    Sigrid

    ReplyDelete
  2. Sigrid -

    yes when the code is ready I will put it on planet. Macros gets all the glamor for the ability to create DSL, but I am more excited about the ease of creating parsers that can parse any arbitrary input (this is not as easy as macro of course), and that's why I am putting effort into it. I am glad you find it useful.

    ReplyDelete