Friday, January 8, 2010

BZLIB/PARSEQ.plt - (3) Common Parsers API

Previously we have looked at fundamental parsers and the combinators API, now it is time to look at some common parsers provided by bzlib/parseq.

In this case, since we are constructing these parsers on top of the fundamental parsers and combinators, we will show the definitions accordingly.

Character Category Parsers

digit is a character between #\0 and #\9.

(define digit (char-between #\0 #\9)) 
not-digit is a character not between #\0 and #\9.

(define not-digit (char-not-between #\0 #\9))
lower-case is a character beween #\a and #\z.

(define lower-case (char-between #\a #\z)) 
upper-case is a character between #\A and #\Z.

(define upper-case (char-between #\A #\Z))
alpha is either an lower-case or upper-case character.

(define alpha (choice lower-case upper-case)) 
alphanumeric is either an alpha character or a digit character.

(define alphanumeric (choice alpha digit)) 
whitespace is either a space, return, newline, tab, or vertical tab.

(define whitespace (char-in '(#\space #\return #\newline #\tab #\vtab)))
not-whitespace is a character that is not a whitespace.

(define not-whitespace (char-not-in '(#\space #\return #\newline #\tab #\vtab)))
whitespaces parses for zero or more whitespace characters:

(define whitespaces (zero-many whitespace))
ascii is a charater bewteen 0 to 127:

(define ascii (char-between (integer->char 0) (integer->char 127)))
word is either an alphanumeric or an underscore:

(define word (choice alphanumeric (char= #\_)))
not-word is a character that is not a word:

(define not-word (char-when (lambda (c) 
                              (not (or (char<=? #\a c #\z)
                                       (char<=? #\A c #\Z)
                                       (char<=? #\0 c #\9) 
                                       (char=? c #\_))))))
Finally, newline parses for either CR, LF, or CRLF:


(define newline 
  (choice (seq r <- (char= #\return) 
               n <- (char= #\newline)
               (return (list r n)))
          (char= #\return)
          (char= #\newline)))

Number Parsers

sign parses for either + or -, and defaults to +.

(define sign (zero-one (char= #\-) #\+))
natural parses for 1+ digits:

(define natural (one-many digit)) 
decimal parses for a number with decimal points:

(define decimal (seq number <- (zero-many digit)
                     point <- (char= #\.)
                     decimals <- natural 
                     (return (append number (cons point decimals)))))
positive parses for either natural or decimal. Note decimal needs to be placed first since natural will succeed when parsing a decimal:

(define positive (choice decimal natural)) 
The above parsers returns the characters that represents the positive numbers. To get it to return numbers, as well as parsing for both positive and negative numbers, we have a couple of helpers:

;; make-signed will parse for the sign and the number.
(define (make-signed parser)
  (seq +/- <- sign
       number <- parser 
       (return (cons +/- number)))) 

;; make-number will convert the parsed digits into number. 
(define (make-number parser)
  (seq n <- parser 
       (return (string->number (list->string n)))))
Then natural-number parses and returns a natural number:

(define natural-number (make-number natural))
integer will parse and returns an integer (signed):

(define integer (make-number (make-signed natural))) 
positive-number will parse and return a positive number (integer or real):

(define positive-number (make-number positive)) 
real-number will parse and return a signed number, integer or real:

(define positive-number (make-number positive)) 

String Parsers

The following parsers parses for quoted string and returns the inner content as a string.

escaped-char parses for characters that were part of an escaped sequence. This exists for characters such as \n (which should return a #\newline), and character such as \" (which should return just "):

(define (escaped-char escape char (as #f)) 
  (seq (char= escape) 
       c <- (if (char? char) (char= char) char)
       (return (if as as c)))) 

;; e-newline 
(define e-newline (escaped-char #\\ #\n #\newline)) 

;; e-return 
(define e-return (escaped-char #\\ #\r #\return)) 

;; e-tab 
(define e-tab (escaped-char #\\ #\t #\tab)) 

;; e-backslash 
(define e-backslash (escaped-char #\\ #\\))
quoted parses for the quoted string pattern (including escapes):

;; quoted 
;; a specific string-based bracket parser 
(define (quoted open close escape)
  (seq (char= open) 
       atoms <- (zero-many (choice e-newline 
                                   e-return 
                                   e-tab 
                                   e-backslash 
                                   (escaped-char escape close) 
                                   (char-not-in  (list close #\\)))) 
       (char= close)
       (return atoms)))
make-quoted-string abstracts the use of quoted.

(define (make-quoted-string open (close #f) (escape #\\)) 
  (seq v <- (quoted open (if close close open) escape)
       (return (list->string v))))
Then single-quoted-string and double-quoted-string look like the following:

(define single-quoted-string (make-quoted-string #\'))

(define double-quoted-string (make-quoted-string #\"))
Finally, quoted-string will parse both single-quoted-string and double-quoted-string:

(define quoted-string 
  (choice single-quoted-string double-quoted-string))

That is it for now - we will talk about parsing tokens next. Enjoy.

No comments:

Post a Comment