Tuesday, January 5, 2010

BZLIB/PARSEQ.plt - a Monadic Parser Combinator Library

BZLIB/PARSEQ.plt is now available via PLANET. Inspired by Haskell's Parsec and Shaurz's Haskell-style parser combinator, bzlib/parsec provides a monadic parser combinator library that can handle both character and binary data parsing.

If you need a refresher on parser combinators, you can read my previous posts for a quick tour.


(require (planet bzlib/parseq)) 

The package includes the following examples that you can inspect for the usage of the API:

(require (planet bzlib/parseq/example/csv) ;; a customizable csv parser 
         (planet bzlib/parseq/example/calc) ;; a simple calculator with parens 
         (planet bzlib/parseq/example/regex) ;; a simplified regular expression parser & evaluator 
         (planet bzlib/parseq/example/sql) ;; parsing SQL's create table statement 
         (planet bzlib/parseq/example/json) ;; parses JSON data format 
Parser Type Signature & Input
A parser is a function that has the following signature:

(-> Input/c (values any/c Input/c)) ;; returns the value and the next input 
If the value is #f then the parse has failed. This might be changed in the future to another value so you can return #f as a parsed value.

The input is a struct with the following structure:

(define-struct input (source pos) #:prefab) 
It is an abstraction over an input-port so you can keep track of the current position on the port. The function make-input will take in an input-port, a string, or a byte and return an input struct with the position initiated to 0.

During the parsing, the values are peeked instead of read so we can backtrack to the beginning. That means when you finished parsing, all of the data are still in the port. make-reader wraps over a parser so you can just pass an input-port instead of needing to create an input struct, and it also consumed the bytes if the parse is successful:

(make-reader parser) ;; => (-> input-port? any/c) 
Fundamental Parsers (input not peeked)

The following parsers do not peek into the input.

(return <v>) returns <v> that you specify.

fail returns a failed struct that includes the position of the parse when the parser fails. The failed struct is currently defined as follows:

(define-struct failed (pos) #:prefab) 
succeeded? and failed? tests whether the returned value is a failed struct (succeeded? equals (compose not failed?)).

SOF (start of file) will return 'sof if it is the start of the input (i.e., position = 0).

Fundamental Parsers (input peeked)

item peeks the input, test the input to ensure it satisfies a criteria, and if so, returns the value and advance the port by the size of the peeked value:

(item <peek> <isa?> <satisfy?> <size>) 
peek => (-> Input/c any/c) 
isa? => (-> any/c boolean?) 
satisfy? => (-> any/c any) 
size => (-> any/c exact-integer?) 
isa? tests for the return value's type so you can simplify the writing of satisfy?, which can assume the value is of the right type.

You use item only when you need to create new parsers that the library do not already provide.

Non-Character Type Parsers

bzlib/parseq allows non-character parsers so you can use it to parse binary data instead of just text streams. You can mix them together of course.

(bytes= <bytes>) returns when the next set of bytes equals the passed in bytes. For example:

> ((bytes= #"foo") (make-input "foo bar"))
#s(input # 3)
(string= <string>) returns when the next set of bytes (in string) equals the passed in string.

(string-ci= <string>) is the case-insensitive version of string=.

(byte= <byte>) returns when the next byte equals the passed in byte.

(bits= <bits>) returns the next byte when it equals the passed in bits (a list of 8 0's or 1's).

byte= and bits= are built on top of byte-when, which is built on top of item. byte-when has the following signature:

(byte-when <satisfy?> (<isa?> byte?) (<size> (the-number 1))) 
;; (the-number <n>) returns a lambda that returns <n>  
EOF is also built on top of byte-when as (byte-when identity eof-object? (the-number 0)).

Character-Based Parsers

The counterpart to byte-when for character-based parsers is char-when, which has the following signature:

(item <satisfy?> (<isa?> char?) char-utf-8-length)

The following are built on top of char-when:

(char= <c>) returns <c> when the next character equals <c>.

(char-not= <c>) is the opposite of char=.

(char-ci=? <c>) is the ci (case-insensitive) version of char=.

(char-not-ci=? <c>) is the opposite of char-ci=?.

(char-between <lc> <hc>) returns the next char when it falls between <lc> and <hc>.

(char-not-between <lc> <hc>) is the opposite of char-between.

(char-ci-between <lc> <hc>) is the ci version of char-between.

(char-ci-not-between <lc> <hc>) is the opposite of char-ci-between.

(char-in <chars>) returns the next char when it is one of the characters in <chars>.

(char-not-in <chars>) is the opposite of char-in.

(char-ci-in <chars>) is the ci version of char-in.

(char-ci-not-in <chars>) is the opposite of char-ci-in.

Literal Parsers

literal is used to abstract the parsers that basically performs an equal comparison (e.g., string=, byte=, etc), as well as allowing an inner parser to pass through, so you do not have to explicitly choose between char=, string=, etc., based on the argument. Example:

(literal #\a) ;; => (char= #\a) 
(literal "abc") ;; => (string= "abc") 
(literal any-byte) => any-byte 

literal-ci is the case-insensitive version of literal, the difference is that it will return the case-insensitive parser for character and string:

(literal-ci #\a) ;; => (char-ci= #\a) 
(literal-ci "abc") ;; => (string-ci= "abc") 
(literal-ci #"abc") ;; => (literal #"abc") 

That about sums it up for the basic parsers. The next post will document the combinators. Stay tuned.

1 comment:

  1. Dear all

    I am a student at AL-QUDS University in Palestine, I need your help
    as possible .

    Please help me by showing me how can I use Scheme or by DrRacket software to
    build the parser for this Grammar:

    expression -> number expression_tail
    expression_tail -> + expression expression_tail

    expression_tail -> Ø

    Example on valid expression: "5 + 4 + 3 + 6 +7”