Monday, January 18, 2010

BZLIB/PLANET.plt - A Local PLANET Proxy/Repository Server

PLT's planet system is great - if you want to install a planet module, you just declare it in the code, and it will automatically manage the download and the install for you, without you having to separately run another module installation programs like Perl's CPAN, Ruby's GEM, PHP's PEAR, etc.

However, planet can still be improved. Specifically, as there is currently only a single central repository, you might experience some inconvenient server outage from time to time. And it is difficult to take advantage of the planet's automatic code distribution power unless you plan on releasing the code for public consumption.

That is - until today.

BZLIB/PLANET.plt is designed to solve exactly the issue of a single planet repository. Going forward, you can install bzlib/planet and run a local planet proxy and repository. bzlib/planet is, of course as usual, available via planet under LGPL ;)


bzlib/planet contains a proxy server that you can setup and run. The first thing to do is to install it via require:

(require (planet bzlib/planet/proxy)) 

Friday, January 8, 2010

BZLIB/PARSEQ.plt - (4) Token-Based Parsers API

Previously we have looked at fundamental parsers and the combinators API, as well as common parsers for character, number, and string, now it is time to look at token-based parsers provided by bzlib/parseq.

A huge class of parsing involves tokenizing the streams by skipping over whitespaces. For example, if we want to parse for a list of 3 integer, separated by comma, we currently have to write:

(define three-int-by-comma 
  (seq whitespaces 
       i1 <- integer 
       i2 <- integer 
       i3 <- integer 
       (return (list i1 i2 i3)))) 
The code above looks messy, and it would be nice if we do not have to explicitly specify the parsing of whitespaces. token allows us to abstract away the parsing of whitespaces:

(define (token parser (delim whitespaces)) 
  (seq delim 
       t <- parser 
       (return t))) 
The above code can now be rewritten as:

(define three-int-by-comma2 
  (seq i1 <- (token integer) 
       (token #\,) 
       i2 <- (token integer) 
       (token #\,) 
       i3 <- (token integer) 
       (return (list i1 i2 i3)))) 
Which looks a lot better. But given tokenizing is such a common parsing task, we have a shorthand for the above called tokens:

(define-macro (tokens . exps) 
  (define (body exps) 
    (match exps 
      ((list exp) (list exp)) 
      ((list-rest v '<- exp rest) 
       `(,v <- (token ,exp) . ,(body rest)))
      ((list-rest exp rest) 
       `((token ,exp) . ,(body rest)))))
  `(seq . ,(body exps)))
Which will reduce the above parsing to the following:

(define three-int-by-comma3
  (tokens i1 <- integer 
          i2 <- integer 
          i3 <- integer 
          (return (list i1 i2 i3)))) 
There is a case insensitive version of tokens called tokens-ci that allows the character and string token to be parsed in case insensitive fashion.

Besides tokenizing, another common need in token-based parsing is to handle delimited sets. In the above example, the 3 integers are delimited by commas. delimited generalize the pattern:

(define (delimited parser delim) 
  (tokens v <- parser 
          v2 <- (zero-many (tokens v3 <- delim
                                   v4 <- parser
                                   (return v4)))
          (return (cons v v2))))
The following parses a list of comma-delimited integers:

(delimited integer #\,) 
Another common pattern is to parse for brackets that surrounds the value that you need. Just about all programming languages have such constructs. And bracket handles such parses:

(define (bracket open parser close) 
  (tokens open
          v <- parser 
          (return v))) 
And bracket/delimited combines the case where you need to parse a bracketed delimited values:

(define (bracket/delimited open parser delim close) 
  (tokens open ;; even the parser is optional...  
          v <- (zero-one (delimited parser delim) '()) 
          (return v)))

That's it for the bzlib/parseq API. If you find anything missing, please let me know.


BZLIB/PARSEQ.plt - (3) Common Parsers API

Previously we have looked at fundamental parsers and the combinators API, now it is time to look at some common parsers provided by bzlib/parseq.

In this case, since we are constructing these parsers on top of the fundamental parsers and combinators, we will show the definitions accordingly.

Character Category Parsers

digit is a character between #\0 and #\9.

(define digit (char-between #\0 #\9)) 
not-digit is a character not between #\0 and #\9.

(define not-digit (char-not-between #\0 #\9))
lower-case is a character beween #\a and #\z.

(define lower-case (char-between #\a #\z)) 
upper-case is a character between #\A and #\Z.

(define upper-case (char-between #\A #\Z))
alpha is either an lower-case or upper-case character.

(define alpha (choice lower-case upper-case)) 
alphanumeric is either an alpha character or a digit character.

(define alphanumeric (choice alpha digit)) 
whitespace is either a space, return, newline, tab, or vertical tab.

(define whitespace (char-in '(#\space #\return #\newline #\tab #\vtab)))
not-whitespace is a character that is not a whitespace.

(define not-whitespace (char-not-in '(#\space #\return #\newline #\tab #\vtab)))
whitespaces parses for zero or more whitespace characters:

(define whitespaces (zero-many whitespace))
ascii is a charater bewteen 0 to 127:

(define ascii (char-between (integer->char 0) (integer->char 127)))
word is either an alphanumeric or an underscore:

(define word (choice alphanumeric (char= #\_)))
not-word is a character that is not a word:

(define not-word (char-when (lambda (c) 
                              (not (or (char<=? #\a c #\z)
                                       (char<=? #\A c #\Z)
                                       (char<=? #\0 c #\9) 
                                       (char=? c #\_))))))
Finally, newline parses for either CR, LF, or CRLF:

(define newline 
  (choice (seq r <- (char= #\return) 
               n <- (char= #\newline)
               (return (list r n)))
          (char= #\return)
          (char= #\newline)))

Number Parsers

sign parses for either + or -, and defaults to +.

(define sign (zero-one (char= #\-) #\+))
natural parses for 1+ digits:

(define natural (one-many digit)) 
decimal parses for a number with decimal points:

(define decimal (seq number <- (zero-many digit)
                     point <- (char= #\.)
                     decimals <- natural 
                     (return (append number (cons point decimals)))))
positive parses for either natural or decimal. Note decimal needs to be placed first since natural will succeed when parsing a decimal:

(define positive (choice decimal natural)) 
The above parsers returns the characters that represents the positive numbers. To get it to return numbers, as well as parsing for both positive and negative numbers, we have a couple of helpers:

;; make-signed will parse for the sign and the number.
(define (make-signed parser)
  (seq +/- <- sign
       number <- parser 
       (return (cons +/- number)))) 

;; make-number will convert the parsed digits into number. 
(define (make-number parser)
  (seq n <- parser 
       (return (string->number (list->string n)))))
Then natural-number parses and returns a natural number:

(define natural-number (make-number natural))
integer will parse and returns an integer (signed):

(define integer (make-number (make-signed natural))) 
positive-number will parse and return a positive number (integer or real):

(define positive-number (make-number positive)) 
real-number will parse and return a signed number, integer or real:

(define positive-number (make-number positive)) 

String Parsers

The following parsers parses for quoted string and returns the inner content as a string.

escaped-char parses for characters that were part of an escaped sequence. This exists for characters such as \n (which should return a #\newline), and character such as \" (which should return just "):

(define (escaped-char escape char (as #f)) 
  (seq (char= escape) 
       c <- (if (char? char) (char= char) char)
       (return (if as as c)))) 

;; e-newline 
(define e-newline (escaped-char #\\ #\n #\newline)) 

;; e-return 
(define e-return (escaped-char #\\ #\r #\return)) 

;; e-tab 
(define e-tab (escaped-char #\\ #\t #\tab)) 

;; e-backslash 
(define e-backslash (escaped-char #\\ #\\))
quoted parses for the quoted string pattern (including escapes):

;; quoted 
;; a specific string-based bracket parser 
(define (quoted open close escape)
  (seq (char= open) 
       atoms <- (zero-many (choice e-newline 
                                   (escaped-char escape close) 
                                   (char-not-in  (list close #\\)))) 
       (char= close)
       (return atoms)))
make-quoted-string abstracts the use of quoted.

(define (make-quoted-string open (close #f) (escape #\\)) 
  (seq v <- (quoted open (if close close open) escape)
       (return (list->string v))))
Then single-quoted-string and double-quoted-string look like the following:

(define single-quoted-string (make-quoted-string #\'))

(define double-quoted-string (make-quoted-string #\"))
Finally, quoted-string will parse both single-quoted-string and double-quoted-string:

(define quoted-string 
  (choice single-quoted-string double-quoted-string))

That is it for now - we will talk about parsing tokens next. Enjoy.

Wednesday, January 6, 2010

BZLIB/PARSEQ.PLT - (2) Parser Combinators API

[Continuing the previous post on the API of bzlib/parseq]

Previously we have looked at the basic parsers that peeked into inputs, now it's time to look at the combinators.

Basic Parser Combinators

bind is the "bind operator" for the parsers, with the following signature:

(-> Parser/c (-> any/c Parser/c) Parser/c) 
It takes in a parser, a "transform" function that will consume the returned value from the first parser and transform the value into another parser, and combine both into another parser.

As there are other more specialized combinators, you should not need to use bind directly unless you are trying to compose functions during run-time.

result simplifies the transform function you need to write so your transform function only need to return the value, not another parser. Below is the definition of result:

(define (result parser helper)
  (bind parser 
        (lambda (v) 
          (if v 
              (return (helper v))  
result* works like result, but it only works with a return value that is a list, and it applies the transform against the value. This comes in handy when you know the parser returns a list and you want to bind the list to individual arguments in your transform function. Below is an example (sequence* combinator is explained later):

(result* (sequence* (char= #\b) (char= #\a) (char= #\r)) 
         (lambda (b a r) ;; b maps to #\b, a maps to #\a, and r maps to #\r 
            (list->string (list b a r))))
Multi-Parser Combinators

seq is a macro-based combinator that bounds multiple parsers in succession, and returns the results only if all parsers succeed. The design of this parser is inspired by Shaurz's Haskell-style parser combinator.

Tuesday, January 5, 2010

BZLIB/PARSEQ.plt - a Monadic Parser Combinator Library

BZLIB/PARSEQ.plt is now available via PLANET. Inspired by Haskell's Parsec and Shaurz's Haskell-style parser combinator, bzlib/parsec provides a monadic parser combinator library that can handle both character and binary data parsing.

If you need a refresher on parser combinators, you can read my previous posts for a quick tour.


(require (planet bzlib/parseq)) 

The package includes the following examples that you can inspect for the usage of the API:

(require (planet bzlib/parseq/example/csv) ;; a customizable csv parser 
         (planet bzlib/parseq/example/calc) ;; a simple calculator with parens 
         (planet bzlib/parseq/example/regex) ;; a simplified regular expression parser & evaluator 
         (planet bzlib/parseq/example/sql) ;; parsing SQL's create table statement 
         (planet bzlib/parseq/example/json) ;; parses JSON data format 
Parser Type Signature & Input
A parser is a function that has the following signature:

(-> Input/c (values any/c Input/c)) ;; returns the value and the next input 
If the value is #f then the parse has failed. This might be changed in the future to another value so you can return #f as a parsed value.

The input is a struct with the following structure:

(define-struct input (source pos) #:prefab) 
It is an abstraction over an input-port so you can keep track of the current position on the port. The function make-input will take in an input-port, a string, or a byte and return an input struct with the position initiated to 0.

During the parsing, the values are peeked instead of read so we can backtrack to the beginning. That means when you finished parsing, all of the data are still in the port. make-reader wraps over a parser so you can just pass an input-port instead of needing to create an input struct, and it also consumed the bytes if the parse is successful:

(make-reader parser) ;; => (-> input-port? any/c) 
Fundamental Parsers (input not peeked)

The following parsers do not peek into the input.

(return <v>) returns <v> that you specify.

fail returns a failed struct that includes the position of the parse when the parser fails. The failed struct is currently defined as follows:

(define-struct failed (pos) #:prefab) 
succeeded? and failed? tests whether the returned value is a failed struct (succeeded? equals (compose not failed?)).

SOF (start of file) will return 'sof if it is the start of the input (i.e., position = 0).

Fundamental Parsers (input peeked)

item peeks the input, test the input to ensure it satisfies a criteria, and if so, returns the value and advance the port by the size of the peeked value:

(item <peek> <isa?> <satisfy?> <size>) 
peek => (-> Input/c any/c) 
isa? => (-> any/c boolean?) 
satisfy? => (-> any/c any) 
size => (-> any/c exact-integer?) 
isa? tests for the return value's type so you can simplify the writing of satisfy?, which can assume the value is of the right type.

You use item only when you need to create new parsers that the library do not already provide.

Non-Character Type Parsers

bzlib/parseq allows non-character parsers so you can use it to parse binary data instead of just text streams. You can mix them together of course.

(bytes= <bytes>) returns when the next set of bytes equals the passed in bytes. For example: