Wednesday, September 30, 2009

DBI and SQL Escape

Scott Hickey has discovered a bug in DBI:

;; assume you have a table1 with an id and a date field. 
(exec h "insert into table1 values (?id , ?date)" `((id . 1) (date . ,(srfi19:current-date))))
;; => regexp-replace*: expects type  as 2nd argument, given:
;;    #(struct:tm:date 9150000 8 19 0 30 9 2009 -18000); other arguments were: #px"\\'" "''"
This issue is now fixed and the newer version of the DBI and are now available through planet:
This post documents the issue and the resolution of the bug.

This issue is caused by the default SQL escape code that does not know how to handle srfi date objects. The default SQL escape code is quite primitive for the reasons below.

To work around the problem - you can use prepared statements:

(prepare h 'insert-table1 "insert into table1 values (?id , ?date)")
(exec h 'insert-table1 `((id . 1) (date . ,(srfi19:current-date))))
The default preparation code exists as a prepared statement proxy for those database drivers that have no prepared statement capabilities. This is one of the selling points of DBI over other API - the query always allow parameterizations. But that means DBI cannot delegate the task of SQL escaping back to the user.

Because my usage of database has always surround prepared statements, I did not write an extensive SQL escaping library (and hence the bug). Plus, there were technical reasons that prepared statements are superior:
  • SQL escapes does not work uniformly across all databases, especially for types such as blobs and date objects, which each database have their own syntax (and some basically discourage using SQL escapes for blobs)
  • SQL escapes are prone to SQL injections if done poorly. One of my previous gigs was to weed out SQL injection bugs in client code base and while the concept is simple many implementations still got it wrong
  • In general prepared statements will have better performances (but this unfortunately is not always true if the cached query plan results in a miss by the server) for multiple uses
Prepared statements (and stored procedures) are superior to SQL escapes in just about all aspects, including performance and security. There are only three downsides that I am aware of for prepared statements:
  • it might cause the database to hold onto the referred objects so it cannot be dropped - this mainly impacts development environment, since that actually helps your production environment from having tragic accident of dropping tables, views, etc.
  • it might not work well for code that creates dynamic SQL statements that refers to tables with unique prefixes (wordpress and a bunch of PHP code falls into this design style), since there might be thousands of such unique prefixes in a given database. In general, such design really should be discouraged, since databases are designed more for few large tables instead of many small tables
  • it's not all that useful and can potentially be slower for one-call statements, but most of the time this is a non-issue
Anyhow - the reason I am highlighting the merit of prepared statements over SQL escapes is that I believe going toward prepared statements is the way to go, especially for databases that already have them. So I decide to make the database drivers for dbd-spgsql, dbd-jsqlite, and dbd-jazmysql to implicitly create prepared statements if you do not want to explicitly name the query via the prepare call.

So - you can just write:

(exec h "insert into table1 values (?id , ?date)" `((id . 1) (date . ,(srfi19:current-date))))
And it will behave as if you do the following:

;; note - prepare only takes symbol as a key - so you cannot do this manually yourself
(prepare h "insert into table1 values (?id , ?date)" "insert into table1 values (?id , ?date)")
(exec h "insert into table1 values (?id , ?date)" `((id . 1) (date . ,(srfi19:current-date))))
So dbd-spgsql, dbd-jazmysql, dbd-jsqlite will no longer use SQL escape for parameterization going forward. They have been made available via planet.

Thank you Scott for discovering and reporting this issue.

Handling Time Zone in Scheme (4): Rule Conversion

Previously we have discussed timezone handling in a series of posts:
  1. motivation and overview
  2. parsing zoneinfo database
  3. calculating offsets
To continue from the third post where we are in the midst of calculating daylight saving offsets, we have figured out the applicable rules, and we now need to convert them into date structs so we can determine the exact boundary.

Going back to our two rule examples for America/Los_Angeles:

  (2007 +inf.0 - 3 (match 0 >= 8) (2 0 0 w) 3600 "D")
  (2007 +inf.0 - 11 (match 0 >= 1) (2 0 0 w) 0 "S")
We want to convert them into the applicable values (for both 2009 and the previous year - 2008):

2009/3/8 02:00:00-08:00 
2009/11/1 02:00:00-07:00
2008/3/9 02:00:00-08:00 
2008/11/2 02:00:00-07:00 
In order to do so, we'll first have to be able to convert the ON (day of the month) into the correct date value, and then we'll have to convert the AT (time of the date) into the correct time value. Let's get started.

Day of the Month

The simplest ON format is a day number (ranging from 1-31), and for that we do not have to do too much. But there are also two other formats that are based on weekdays:

'(last 0) ;; => last sunday (sunday = 0, monday = 1 ..., saturday = 6) 
'(match 0 >= 5) ;; => sunday on or after 5th 
'(match 2 <= 10) ;; => tuesday on or before 10th 

That means we need to be able to convert them to the appropriate day of the month based on the year and the month.

Doomsday Algorithm and Weekday Calculation

To be able to calculate the weekday-based date values, we first need to be able to calculate the weekday of a particular date. For that we can make use of the doomsday algorithm, which are based on the concept that there is a doomsday every month, and they are easy to remember based on a moniker (4/4, 6/6, 8/8, 10/10, 12/12, ...). The linked explanation makes it sounds more complicated than it actually is - below is the oneline doomsday algorithm in scheme:

(define (doomsday y)
  (modulo (+ 2 (floor (+ y (/ y 4) (- (/ y 100)) (/ y 400)))) 7))
Then with doomsday we can calculate the weekday of a date:

Tuesday, September 29, 2009

Handling Time Zone in Scheme (3): Computing Offsets

This is the third post of the timezone handling series. Please see the previous posts for additional details:
  1. motivation and overview
  2. parsing zoneinfo database
At this point we are able to parse the zoneinfo database files into a giant structure that contains all of the applicable zones and rules. Let's see how to utilize the data.

There are two main purposes for using the time zone information, the first is to figure out the actual offset for a given zone for a particular date:
(tz-offset (build-date 2009 7 1 ...) "America/Los_Angeles") ;; => -25200 
(tz-offset (build-date 2009 12 1 ...) "America/Los_Angeles") ;; => -28800 
And the second use is to convert between different timezones of the same date:

(tz-convert (build-date 2009 7 1 0 0 0) "America/Los_Angeles" "GMT") 
;; => (date 2009 7 1 7 0 0 0 0) 
(tz-convert (build-date 2009 7 1 0 0 0) "America/Los_Angeles" "Asia/Taipei") 
;; => (date 2009 7 1 15 0 0 0 28800)

It should be clear that tz-convert can be built on top of tz-offset, so we first should figure out how tz-offset to compute the offsets.

Basic Idea
The zoneinfo database structure more or less denotes how we should be making use of the library:
  1. look up the zone - find the zone that you are looking for
  2. based on the zone, map to the correct set of rules by testing which rules is applicable (based on the UNTIL field of the zone value)
  3. apply the set of rules against the date to find the proper daylight saving offset

With that in mind, let's first load the parsed structure into something more useable. Let's convert it into a hashtable loading the zones by name, which in turn holds the references to the applicable rules, off of which we can then use to compute actual offsets. Assuming the zoneinfo database has been serialized to a file, then the following would load the data back in:

(define (make-zoneinfo zi)
  (make-links (make-zones (caddr zi) (make-rules (cadr zi)))
              (cadddr zi)))

(make-zoneinfo (call-with-input-file path read))
Basically - we'll first load the rules, and then pass the rules into make-zones, which will then be passed to make-links to complete building the hashtable.

With the following structure definitions for zone and rule, we have:

(define-struct zone (offset rule format until) #:transparent)

(define-struct rule (from to type month date time offset format) #:transparent)

(define (make-rules rules)
  (make-immutable-hash (map (lambda (kv)
                              (cons (car kv)
                                    (map (lambda (rule)
                                           (apply make-rule rule)) 
                                         (cdr kv))))
                            (cdr rules))))

(define (make-zones zones rules)
  (make-immutable-hash (map (lambda (kv)
                              (cons (car kv)
                                    (map (lambda (zone)
                                           (define (helper offset rule format until)
                                             (make-zone offset (hash-ref rules rule #f) format until))
                                           (apply helper zone))
                                         (cdr kv))))
                            (cdr zones))))

And make-link left-folds over the zones and the links list:

Monday, September 28, 2009

Handling Time Zone in Scheme (2): Parsing The Zoneinfo Database

Continuing from the motivation post on the difficulty of handling timezone, we'll now focus on solving the first problem - parsing the zoneinfo database.

As previously discussed, the zoneinfo database format is pretty straight forward:
  • comments starts from # until the end of the line
  • records are line oriented, and there are no line continuation
  • there are 3 main record types:
    • rule - describes daylight saving rules
    • zone - describes different time zones and how the offset (and the applicable daylight saving rules) change over the years
    • link - describes aliases for the different zones

What we want to accomplish is to parse the zoneinfo database and convert into scheme values that are more consumable. Let's try to map them out.

Overall Structure

It would be nice if we can parse the files into structure that looks like:

(zoneinfo (rule ("US" rule1 ...) ...) 
          (zone ("America/New_York" zone-rule1 ...) ...) 
          (link (Link1 ...) ...)) 
The above value would be easy to load back into memory (just a singe read operation). And all of the related rules are all grouped together.

Since the records are line oriented - when we parse them it would look like the following:

(rule "US" ...)
(rule "US" ...) 
...
(zone "America/New_York" ...) 
(zone "America/New_York" ...) 
...
(link ...) 
...
Assuming we are able to generate the above records, we can group them via the group we introduced during the development of the memcached client, by first grouping on rule, zone, and link, and within each group we then further group based on the sub key, and then cons 'zoneinfo in front.

;; 2-layer grouping 
(cons 'zoneinfo 
      (map (lambda (kv)
             (cons (car kv)
                   (group (cdr kv))))
           (group zis)))
Once the transformation is done - we just need to serialize the data out to a file.

(call-with-output-file <target>
  (lambda (out)
     (print zoneinfo out)) 
  #:exists 'replace) 
Then we just need to make sure that we can parse the files (yes - the zoneinfo database consists of multiple files) so we can generate the records to feed into the 2-layer grouping. Since append the results from parsing multiple files is trivial, we'll just discuss the parsing of a single file.

Parsing The Zoneinfo File

PLT Scheme offers a lot of fantastic support for parsing files. We will not go through them all here, but at the high level, there is the parser tools that are basically lex and yacc in scheme, and there is also the parser combinators that you can use to construct your monadic combinator parsers. At a low level, you can use all of the read-* functions to read just about anything you want from an input port.

Friday, September 25, 2009

Handling Time Zone in Scheme: Motivation & Overview

Timezone is a difficult problem, more difficult than it has to be. Probably the biggest challenge is the daylight saving time. It changes with the whims of politicians and governments. Once a while, a city/state will just arbitrarily decide to change their timezone. How should timezone software handle such changes?

In such case, we might naturally assume the answer is to keep all of the same hours but for a different timezone. This answer would have been sufficient when things are not global: what if some of the appointments are with people from other timezone?

Luckily such situation does not arise very often, and neither does the changing of the daylight saving switching dates. But they illustrate the difficulty of timezone handlings.

Insufficient Solution

Most software stores an offset along with a time to denote the timezone offset from GMT, and sometimes even a check to determine whether it is a daylight saving time. In PLT Scheme - the PLT date object has both.

(define-struct date (second minute hour day month year week-day year-day dst? time-zone-offset))
But such solution is brittle in the case of date manipulations, even without the drastic circumstances above. What if you want to calculate the date four month ahead? How do you know whether or not the time should have the same daylight saving offset applied?

A possible solution is to call the C date functions, which can handle the date calculation correctly, pending the TZ environment variable. The drawback of the approach is that if your problem needs to be timezone aware, then you'll be constantly swapping your environment. Besides the fact that such approach will serialize all of your threads, it is also less desirable.

Let's see if we can bring timezone handling into scheme.

Zoneinfo Database

How does C date functions knows how to calculate dates? It consults all of the timezone information with a database called zoneinfo. This database contains all of the past and current timezones and their corresponding offsets. This database is the best authoritative source if you need to handle timezones.

On linux/mac - type man zic, and you will get the details on the format of the zoneinfo files. The zoneinfo files are line oriented, with comment lines starting with #. There are two main constructs we are interested from the files are the zones and the rules:

Thursday, September 24, 2009

Develop a Memcached Client (4) - BZLIB/DBD-MEMCACHED

This is the "dramatic" conclusion of the memcached client development. You can find additional details in the previous posts:
  1. network development overview
  2. memcached client - storage API
  3. memcached client - retrieval and deletion API, and the rest
  4. distributed hash table frontend for memcached

Allright - it's not that dramatic, but I've made the memcached client available as a DBI driver via planet, so you can use it with the DBI interface.

The nicest thing about DBI is the ease for extension. You can - say - create your own driver to wrap around the memcached driver and the postgresql driver, so you do not have to sprinkle your code with calls to both drivers everywhere.

You'll find the tutorial of using bzlib/dbd-memcached after the integration section.

Integrate with DBI

The last stop is to integrate both the single instance and the distributed hashtable instance into DBI, so we can use it with the DBI interface.

As you remember from the create a DBI driver series, we basically have to create the following functions and register them as a driver with DBI:
  • connect - wrap around the creation of the connection
  • disconnect - disconnect the underlying connection
  • prepare - for memcached it's a NOOP
  • query - the majority of work is here
  • begin-trans (optional) - for memcached it's a NOOP
  • commit (optional) - for memcached it's a NOOP
  • rollback (optional) - for memcached it's a NOOP
Below is the code that registers the single instance memcached client - integration of memcached/dht is left as an exercise:

(define (m-connect driver host port)
  (make-handle driver (memcached-connect host port) (make-immutable-hash-registry) 0))

(define (m-disconnect handle)
  (memcached-disconnect (handle-conn handle)))

(define (m-prepare handle stmt)
  (void)) 

(define (make-query set! add! replace! append! prepend! cas! get gets delete! incr! decr! flush-all!)
  (lambda (handle stmt (args '()))
    (let ((client (handle-conn handle)))
      (case stmt 
        ((set! add! replace! append! prepend!)
         (let/assert! ((key (assoc/cdr 'key args))
                       (value (assoc/cdr 'value args))
                       (flags (assoc/cdr 'flags args 0))
                       (exp-time (assoc/cdr 'exp-time args 0)))
                      ((case stmt
                         ((set!) set!)
                         ((add!) add!)
                         ((replace!) replace!)
                         ((append!) append!)
                         ((prepend!) prepend!))
                       client key value 
                       #:exp-time exp-time #:flags flags 
                       #:noreply? (assoc/cdr 'noreply? args))))
        ((cas!)
         (let/assert! ((key (assoc/cdr 'key args))
                       (value (assoc/cdr 'value args))
                       (cas (assoc/cdr 'cas args))
                       (flags (assoc/cdr 'flags args 0))
                       (exp-time (assoc/cdr 'exp-time args 0)))
                      (cas! client key value cas
                                      #:exp-time exp-time #:flags flags #:noreply? (assoc/cdr 'noreply? args))))
        ((get gets)
         (cons (list "key" "value" "flags" "cas")
               (apply (case stmt
                        ((get) get)
                        ((gets) gets))
                      client (map cdr (filter (lambda (kv)
                                                (equal? (car kv) 'key))
                                              args)))))
        ((delete!)
         (let/assert! ((key (assoc/cdr 'key args))
                       (delay (assoc/cdr 'delay args 0)))
                      (delete! client key delay (assoc/cdr 'noreply? args))))
        ((incr! decr!)
         (let/assert! ((key (assoc/cdr 'key args))
                       (value (assoc/cdr 'value args)))
                      ((case stmt
                         ((incr!) incr!)
                         ((decr!) decr!))
                       client key value (assoc/cdr 'noreply? args))))
        ((flush-all!) 
         (let/assert! ((delay (assoc/cdr 'key args 10)))
                      (flush-all! client delay (assoc/cdr 'noreply? args))))
        (else
         (error 'query "invalid stmt: ~a" stmt))))))

Develop a Memcached Client (3) - Distributed Hash Table

This is continuation of the network development in PLT post. You can see the previous posts for more details:
  1. network development overview
  2. memcached client - storage API
  3. memcached client - retrieval and deletion API, and the rest
At this point we have a functional memcached client, but since people generally use memcached like a distributed hash table, what we have is not yet sufficient - we need a frontend to multiple instances of memcached.

The idea is simple - have a structure holding multiple memcached clients:

(define-struct memcached/dht-client (clients)) ;; clients is a vector of memcached clients.
Then we need to ensure that the key gets consistently mapped to the client that holds the data, and ensure the keys are distributed uniformly.

Uniform Hash Distribution

While the concept is pretty straight forward, the algorithm behind hashing is non-trivial. A good string hash function that I found is djb2, with the below as the scheme implementation:

(define (djb2-hash key)
  (define (convert key)
    (cond ((string? key) (convert (string->bytes/utf-8 key)))
          ((bytes? key) (bytes->list key))
          ((symbol? key) (convert (symbol->string key)))))
  (define (helper bytes hash)
    (cond ((null? bytes) hash)
          (else
           (helper (cdr bytes) (+ (* hash 33) (car bytes))))))
  (helper (convert key) 5381))
Once we convert the key into a hash code, we just need to take the remainder against the number of available memcached instances:

;; get the hash code 
(define (memcached/dht-hash client key)
  (remainder (djb2-hash key) (vector-length (memcached/dhs-client-clients client))))
As long as the count and the order of the clients remain the same, we will hash the key to the same client:

;; return the particular client by the key 
(define (memcached/dht-target client key)
  (vector-ref (memcached/dht-client-clients client)
              (memcached/dht-hash client key)))
With them, we can now wrap around all of our API's - we basically maintain the same interface as the single instance API, except that we are passing in the dhs client:

Wednesday, September 23, 2009

Develop a Memcached Client (2) - Filling Out the API

This is the continuation of the network programming series - see the previous installments for more details:

Now we have the ability to store data into memcached, we want to retrieve the stored data.

Generating Requests

Memcached offers two API for such purpose:
  • get - return the data 
  • gets - return the data along with the cas id 
(yes - it's only one character difference... oh well)


Let's take a look at the details of the API:

get <key>*\r\n
gets <key>*\r\n
The <key>* means there can be multiple keys, separated by space. We can represent the generation with the following:

(define (cmd-get out type keys) 
  (define (keys-helper)
    (let ((out (format "~a" keys)))
      (substring out 1 (sub1 (string-length out)))))
  (display-line out "~a ~a" (case type
                              ((get gets) type)
                              (else (error 'cmd-get "unknown get type: ~a" type)))
                (keys-helper)))
The next step is to parse the input.

Parse Response

The response is a bit harder, and it takes the following forms:
  • there can be multiple values returned (one for each key found), and the ending is marked by END\r\n
  • each value starts with a line of VALUE <key> <flags> <bytes-length> [<cas-unique>]\r\n
    • <cas-unique> is a 64-bit integer that uniquely identifies the object; and it is only returned if you use gets instead of get


  • then the value is followed by the data block, terminated by \r\n. The data block should have the same length as indicated by <bytes-length>


The following algorithm will handle the above responses:

Developing a Memcached Client (1) - Storage

[This is a continuation of the network development in PLT post]

Memcached is all the rage these days with the high scalability guys.  Legends have it that as soon as you slap a couple of memcached in between your web app and your database server, all of your scalability issues go away.  Even if that's glossing over of details, memcached certainly offers advantages over accessing the discs all the time, and we want it in PLT.

Let's try to build a memcached client.

Steps to Build a Client (or a Server)

The steps to build either a client or a server is pretty straight forward (independent of the complexity of the protocol):
  1. study the protocol 
  2. write the "output functions" that sends the info to the other party 
  3. write the "input functions" that parses the input from the other party 
  4. combine the input & output functions, and manage the states of the connection (if applicable)
You are likely to mix the steps up instead of doing them in the order prescribed above and that's okay.  The steps are numbered to help us to think of them as an orderly process, but the process is likely more iterative.

Allright - let's get started.

Study the Protocol 

The key to build a network interface is to understand the protocols.  It is best if the protocol is open and published, so you have something written that you can study (hopefully in a language you understand).  The next best option is to reverse engineer from an open source implementation, and the worst is to reverse engineer from sniffing the network interactions between the clients and the server.  I am glad such memcached's protocol is already published so we do not have to go down the other routes.

Please refer to the memcached protocols throughout this process as the official reference.  We'll sprinkle information from the official doc in the posts as necessary.

First a quick overview of the memcached protocol.
  • request/response protocol - there are no unexpected responses that the client has to handle 
  • multiple requests and responses within a single session 
  • *mostly* line-based protocol - the commands are single lines that are terminated by CRLF; the data is also terminated by CRLF, but it is actually determined by length parameter
  • can store, retrieve, and delete the content (as well as flush) 
  • there are also statistics that can be retrieved, but the protocol appears in flux for the stats (so it'll be out of scope for now) 
  • the protocol itself is *stateless* as each of the requests and responses are independent from each other (each request is atomic) 
Overall the protocol is simple yet non-trivial, which gives us a great example for network development.  We'll start with the storage part of the protocol.

Storage API

Memcached has multiple storage commands for different situations:

Tuesday, September 22, 2009

Network Development in PLT: Overview

One of the nice thing about PLT Scheme is that you can write native scheme network apps, rather than having to use FFI to interface with C libraries for the network apps. The nice thing about using native scheme approach is that it works with PLT's threads, unlike FFI, which blocks all PLT threads (and effectively synchronizes the network access). And according to Geoffrey, FFI also have the disadvantage of having to worry about whether it expects 32-bit or 64-bit pointers.

The challenge with network development though is that we seldom have to do it, so it's hard to know all of the ins and outs of network development. This series of posts will provide a tutorial of doing networking development. The goal is to keep the principles as general as possible so it applies toward other programming languages, but obviously this is a scheme blog.

General Concept

We'll quickly go over the basic network architecture concept for the sake of completeness.

In general, networking involves clients and servers interacting with each other by sending information to each other and (if necessary) wait for responses from the other party.

Clients are programs that sends out the requests, and servers are programs that "fulfill" the requests (and possibly send back a response).  A program can be both a client and a server at the same time.

The information they send each other are serialized to bytes, which will be reconstructed by the receivers into meaningful representations that they can interpret.

Depending on the nature of the work, clients and servers might only need to send information to each other once and be done (HTTP is one such protocol), but sometimes they need to communicate back and forth to accomplish the task (SMTP is one such protocol), and such situation it might be necessary for the client and the server to keep track of the state in order to manage the work.

So, at a high level architecture, we need to focus on the following in order to do network development:

  • manage the connection (initiation, termination, etc) 
  • serialize and send information to the other party 
  • receive information from the other party and interpret the meaning (and act accordingly)
  • track and manage the state if the protocol requires it 
The above applies to all network developments.  The specific details and the associated complexity comes down to the specific protocols you are developing.

Let's take a look at what PLT offers to help us handle each of the needs.

Network Connection Management 

By default PLT offers network programming in TCP, UDP, and SSL.  You can require them into your module:




(require scheme/tcp scheme/udp openssl)

We'll focus on just TCP network development in this tutorial since most network protocols will be TCP-based.

Initiate Connections 

If you are developing a client, you can initiate a client connection with:

Monday, September 21, 2009

bzlib/dbd-file - filesystem-based database driver - now available

As previously promised, bzlib/dbd-file, a demonstration of how to extend bzlib/dbi around filesystem, is now available via planet.

You can revisit the posts that chronicled its developments:
  1. Overview of the DBI internals 
  2. a draft driver - list directories and open files
  3. enhance the driver - save files atomically
  4. enhance the driver - delete files atomically 
Prerequisite 


There isn't a prerequisite for this module, but there are a few caveats:
  • No transaction support (only atomicity)
  • Atomicity not guaranteed on Windows (it'll almost work most of the time)
  • No prepared queries 
  • No SQL - this is not a SQL database driver
Installation 

The installation over planet is straight forward:

;; in REPL  
(require (planet bzlib/dbi))
(require (planet bzlib/dbd-file))


Usage 

You should set aside a particular directory as the root for the database.  Let's call that <root-path<.  As with any database - you should not manually access the files within that directory to minimize the chance of destroying the database.

To connect to the database:

;; use the 'file driver  
(define handle (connect 'file <root-path>))  
;; or use the 'file/rs driver 
(define handle (connect 'file/rs <root-path>))

The difference between the two drivers is that 'file/rs returns values that are compatible with the DBI query helper functions such as rows, rows/false, cell/false, since it converts the underlying return value into recordset structure, so it is just a bit less efficient.

To disconnect from the database:

(disconnect handle) ;; NOOP  

The call itself is unnecessary since there are no external resources to clean up.  The prepare is also a NOOP.

To "select" the data from the files by paths:

(query handle 'open  `((path . "/path1") (path . "/path/to/file2") ...)) 
;; => a list of bytes, each bytes is the total content of the file 

 The paths used with the driver (either returned values or passed arguments) are always a jailed absolute path, which will be merged with the root directory to form the actual underlying path.

To "select" the paths a particular base path:

(query handle 'list `()) ;; list the files in the root directory
(query handle 'list `((path . "/some/path"))) ;; list the files in some path 
;; => returns a list of the jailed paths  

To "insert" or "update" a particular path:

(query handle 'save! `((path . "/some/path") (content . #"some bytes")))  

Both the path and content parameters are required in this particular case.  The file will be saved atomically.

To "delete" files or directories:

;; delete files (not directories)
(query handle 'delete! `((path . "/some/path") (path . "/some/other/path") ...))
;; delete empty directories

(query handle 'rmdir! `((path . "/some/path") (path . "/some/other/path") ...))


;; delete either files or directories 
(query handle 'rm-rf! `((path . "/some/path") (path . "/some/other/path") ...))



'delete! will only delete files, and will error if trying to delete directories. 'rmdir! will only delete empty directories, and will error trying to delete either files or non-empty directories.  rm-rf! will delete either files or directories (whether empty or not).  The deletion are atomic on non-Windows platform.

Notes About Windows

Since Windows locks opened filehandles, and generally have background processes such as antivirus softwares that randomly open files for inspections, you might find the save and deletion operations error out intermittently.  You can of course handle the errors and then retry the operations to get around the issues, but basically supporting atomic file-based save/deletion is out of scope.

That's it for now - enjoy.

BZLIB/FLEXER 0.1 - FLEX Integration with SHP - Now Available

bzlib/flexer is now available through planet.

As described in the FLEX/SHP posts, bzlib/flexer provides integration between FLEX and SHP, allowing you to do the following:
  1. directly embed flash video files by calling (flash <url;> ...)
  2. directly code in MXML within SHP and compile into flash video  with (mxml ...)
  3. optimizing the compilation of MXML
The package is released under LGPL. 

Prerequisite

You'll need the Adobe Flex SDK 3 downloaded and installed separately, and make sure that the flex compiler mxmlc is on the system path.

You'll also need bzlib/shp, with the planet version >= 1.2. 

Installation & Configuration


Install via planet with

(require (planet bzlib/shp:1:2))
(require (planet bzlib/flexer))

Make sure to require bzlib/flexer in your SHP required script as well:

Saturday, September 19, 2009

Create a Driver for bzlib/dbi (4) - Concluding Filesystem Driver

This is the fourth installment of the extending bzlib/dbi series - for refreshers, see the following previous posts:
  1. Overview of the DBI internals 
  2. a draft driver - list directories and open files
  3. enhance the driver - save files atomically
Basically we now have the SQL equivalent of select, insert, and update.  The next stop is to provide the ability to delete files and directories.

The Deletion Capability 

PLT Scheme offers the following for deleting files and directories:
  • (delete-file file-path) - delete the path if it is a file; otherwise raise an error
  • (delete-directory dir-path) - delete the path if it is an empty directory; otherwise raises exceptions 
  • (delete-directory/files file-or-dir-path) - delete the path whether it is a file or directory, and if it is an non-empty directory, the sub files and directories will first be deleted recursively 
It seems that delete-directory/files is the most convenient to use.  However, there might be situations you want to ensure that you are deleting either a directory or a file, and if you are deleting a directory you only want to delete an empty directory, and we want to make sure our API reflects that, so we will have 3 separate calls:

;; delete files use delete! 
(query handle 'delete! `((path . path1) ...))
;; delete empty-only directories uses rmdir!
(query handle 'rmdir! `((path . path1) ...))
;; delete either file or directories use rm-rf! 
(query handle 'rm-rf! `((path . path1) ...)) 

Similar to SQL's delete statement, the above query can delete multiple paths at once.  However, given transactions is not supported - the deletes are done independently, and if any of the delete fails the rest will stay undeleted.

The following addition to file-query accomplishes the above:

Friday, September 18, 2009

Create a Driver for bzlib/dbi (3) - Continuing Filesystem Driver

Previously we have discussed the internals of DBI and how to extend it, as well as crafted the first draft of the filesystem database driver.  We are now onto #3 of the DBI extension series.  We'll continue by adding the ability to save data to the driver.

Save The Data

Saving the data is equivalent to both insert and update in SQL venacular.  The simplest version looks like below:

(define (file-query handle stmt (args '())) 
  ... 
  (case stmt
    ... 
    ((save!)
     (call-with-output-file (path-helper (assoc/cdr 'path args))
       (lambda (out) 
         (write-bytes (assoc/cdr 'content args) out))
       #:exists 'replace))
    (else 
     (error 'file-query "unknown statement: ~a" stmt))))

With the above we now can save data into a particular file with the following usage:

(query handle 'save! `((path . "/foo/bar/baz.txt") (content . #"this is the content")))  
But unfortunately, there are a few hiccups:
  • there is no verification that path and content key/value pairs are passed in (for 'list and 'open queries the path key/value pairs are optional) 
  • there is no guarantee the directory of the path exists (and if not it will result in an error) 
  • the above saving is not an atomic operation and can corrupt the data
So we'll address each of the issue to ensure we have a solid implementation for saving data.

Argument Verification 

To verify the arguments, we can use let/assert! from the next version of bzlib/base (which will be released together with this driver) as follows:

(define (file-query handle stmt (args '())) 
  ... 
  (case stmt
    ... 
    ((save!)
     (let/assert! ((path (assoc/cdr 'path args))
                   (content (assoc/cdr 'content args)))
                  (call-with-output-file (path-helper path)
                    (lambda (out) 
                      (write-bytes content out))
                    #:exists 'replace)))
    (else 
     (error 'file-query "unknown statement: ~a" stmt))))

let/assert! checks to see if the values returned are false, and if so raises the error, else binds the variable call the inner expression.  It also behaves like let* instead of let, as the subsequent variable can see the previous variable.

Gaurantee of Directory Path

To ensure the directory for the path already exists - we can utilize make-directory* to create the parent directory for the path, but we need to make sure the parent path of the path is not a file (so we can create a directory):

Thursday, September 17, 2009

Create a Driver for bzlib/dbi (2) - Filesystem Driver

In our previous post - Create a Driver for bzlib/dbi (1) - DBI Internals - we discussed the motivation and DBI internals so we can get started to implement our filesystem-based driver.  If you find some of the following a bit opaque - read the previous post for some explanations.


Caveats about Filesystems and the Driver

Since each filesystem has different behaviors it is difficult to guarantee particular performance characteristics about the drivers. For now the following are the caveats:
  • No transactional support - it takes more effort to build in transactional support for bare filesystems - something we will not tackle for now 
  • No atomic writes for Windows filesystem - as Windows filesystem does not fully support atomic rename (Windows program holds lock on the file if it is opened during the rename and will cause an error), we also cannot make guarantee that writes will be successful in Windows.  Under Unix variants we can guarantee atomic writes 
Furthermore, since our goal is to have a simple driver, the following are also out of scope:
  • SQL statement mappings - for now we will not support SQL statements
  • Prepared statements - since we are not supporting complex SQL mappings, there is no reason to have prepared statement capabilities; i.e. the prepare call will be a no op.
With caveats out of the way - let's determine what we should be able to do at a minimum:
  • manage all files within a single directory as the database data 
  • open files by path and return their contents 
  • listing the files in a particular directory (within the base directory) 
  • save data against a particular path (this can either be an "insert" or "update" operation) atomically on unix platform (might cause errors on Windows given the limitation of Windows platform) 
  • delete a particular file 
  • delete a particular directory 
  • create a particular directory (even without the intermediate directories) 
 Allright - let's get started.

Connect, Disconnect, Prepare, and Transaction Handling 

Since we want to have a directory representing the root of the database, our database connection is really the root directory:

#lang scheme/base 
(require (planet bzlib/base)
         (planet bzlib/dbi)
         )

(define (file-connect driver path) 
  (assert! (directory-exists? path)) ;; assert! comes from bzlib/base 
  (make-handle driver path (make-immutable-hash-registry) 0)) 

Disconnect is even more straight forward, since there isn't any external resources that have to be released:

(define (file-disconnect handle)
  (void)) 

And since prepare is out of scope, it is also a NOOP:

(define (file-prepare handle stmt)
  (void)) 

Furthermore, transaction support is also out of scope - we have more NOOPs:

(define (file-begin handle)
  (void))
(define (file-commit handle)
  (void))
(define (file-rollback handle)
  (void)) 

The default transaction functions will not suffice here since they issue the corresponding SQL statements against the handle.

Assuming we have the corresponding file-query defined we now have a complete driver with:

(registry-set! drivers 'file
               (make-driver file-connect
                            file-disconnect
                            file-query
                            file-prepare
                            file-begin
                            file-commit
                            file-rollback))
    
Now we just need to flesh out file-query, which is the meat of the driver:

Create a Driver for bzlib/dbi (1) - DBI Internals

Since we claimed bzlib/dbi as an extensible abstract database interface for PLT Scheme, let's take a look on how to extend it.  As an exercise we'll create a database driver that uses the filesystem as the database. 

Motivation 

What we want is to use the filesystem as a simple form of database, where the path of the file is the key, and the content is the value.  We should be able to select the data, update, delete, and insert new data.  Let's see what we can come up with.

Overview of Creating an Database Driver

The first step is to require bzlib/base (contains bzlib/base/registry) and bzlib/dbi (contains the interface that we need to extend):

#lang scheme/base
(require (planet bzlib/base) (planet bzlib/dbi)) 

Then we need to create four functions for the driver that matches the following signature:

(connect driver . args) ;; a connect function that takes in the database driver definition, and a variable list of arguments
(disconnect handle) ;; a disconnect function that takes in the database handle
(prepare handle statement) ;; a prepare function that takes in the database handle and a statement
(query handle statement args) ;; a query function that takes in the database handle, the statement (or the key to the prepared statement, and the args as a key/value plist

The above four functions are sufficient for creating a database driver, and if we need to provide custom transaction functions, we will customize three additional functions:

(begin-trans handle) ;; to provide the begin transaction clause for the database handle
(commit handle) ;; to provide the commit transaction clause for the database handle
(rollback handle) ;; to provide the rollback transaction clause for the database handle 

Overriding the transaction functions is an optional step as you can use the default version, which are named default-begin, default-commit, and default-rollback.  All of the default version basically issues the corresponding SQL statements to the underlying database connection.

Then we just need to register the functions as a database driver:

Tuesday, September 15, 2009

Flex Compiler Integration Optimization

This is a continuation of the FLEX Integration Series - please refer to the previous posts for more details:

  1. Building FLEX Integration with SHP
  2. FLEX Compiler Integration with SHP

Now we'll cover how to optimize the integration for performance.

The easiest way to determine whether a new compilation is necessary is by comparing the timestamp of the source code vs the timestamp of the object code - even the venerable make utilize this simple check.

So what we need to do is to determine the timestamp of the shp script that we call mxml, and then compare it against the flash video file.  Since mxmlc compilation time is long, if the timestamp of the shp source script has a later timestamp than the flash video, it is very safe to assume that the source script have been changed.

Determining the Timestamp 

PLT Scheme offers (this-expression-source-directory) and (this-expression-source-file-name) to help determine the location.  But as those are macros and applies to the location of their presence, they are not exactly helpful to our cause unless we want to write them out everytime we have to use mxml like this:


(mxml (this-expression-source-directory) path ...)  
We want the value to be automatically available rather than manually available, so we'll have to improve our SHP handler.


(define __PATH__ (make-parameter (shp-handler-path ($server)))) ;; looks like C macro... 
 (define (evaluate-script path)

  (evaluate-terms (file->values path) path))


(define (evaluate-terms terms path)
  (require-modules! terms) ;; first register the required modules 
  ;; then we filter out the required statement and evaluate the rest of the terms as a proc. 
  (eval `(lambda ,(terms->args terms) 
           (parameterize ((__PATH__ path))
             . ,(terms->exps terms)))
        handler-namespace))
Now we have the path available within the SHP script scope, and we can use it to check the timestamp.

Flex Compiler Integration with SHP

This is a continuation of Building FLEX Integration with SHP - please refer to it for refresher.

The next step of the integration is to do FLEX development directly within SHP. If you are used to develop FLEX apps in IDE such as Flex Builder, you might not necessary need this capability.  But consider the following:
  • potential for modularizing the flex scripts
  • potential for graceful degradations into ajax or pure html 
  • possibility for automating the generation of simple flex-based solutions 
 All of which are powerful building blocks for web development, so we'll give it a shot.  Let's get started.

Goal

We want to be able to write in MXML and ActionScripts in SHP, and have the script being compiled in real-time into the flash movie and served to browser transparently.

Example:

Here's a mxml shp script for hello world:

(mx:app (mx:script "private function clickHandler(evt:Event):void {
    messageDisplay.text = \"I am Glad, it does.\";
}")
     (mx:label "Flex without Flex Builder")
     (mx:button #:label "Yes, It Works!" #:click "clickHandler(event)")
     (mx:label #:id "messageDisplay"))
Which should be compiled into the following mxml:

<mx:Application xmlns:mx="http://www.adobe.com/2006/mxml">
<mx:Script>
<![CDATA[
private function clickHanlder(evt:Event):void {
    messageDisplay.text = "I am Glad, it does.";
}
 ]]>
</mx:Script>
<mx:Label text="Flex without Flex Builder" />
<mx:Button label="Yes, It Works!" click="clickHanlder(event)" />
<mx:Label id="messageDisplay" />
</mx:Application> 
And then compiled into a flash video file, and finally served to the client via the (flash) inclusion.  And we should check the timestamp of the SHP script against the compile flash video so we won't waste CPU cycle by compiling the same script over and over again.

The first thing to do is to ensure we can generate the correct MXML.  It is mostly straight forward:


(define (mx:app . widgets) 
  `(mx:Application ((xmlns:mx "http://www.adobe.com/2006/mxml"))
                   . ,widgets))

(define (mx:label (text "") #:id (id #f))
  `(mx:Label ((text ,text)
              ,@(if (not id) '() `((id ,id))))))

(define (mx:button #:label (label "") #:click (click #f))
  `(mx:Button ((label ,label) 
               ,@(if (not click) '() `((click ,click))))))




;; more mxml widget definitions  
With the above we now can generate the MXML, albeit in an incomplete form.  The next step would then be to get the generated mxml *saved* to a predefined location, instead of serving them.

Monday, September 14, 2009

Building Flex Integration with SHP

Besides AJAX, FLEX is the other major player in the RIA world (Silverlight might be a future contender, but it currently is not).  We already provided basic integration with Javascripts, now we want to provide the same level of integration with FLEX.


The lightest layer of integration would be to manage the flex files as static resources, and just make sure they work well within SHP.  But we can also opt for a closer integration by having SHP compile the flash video files, and make the compilation seamless.  Let's develop the light integration first.

What we want are the following:
  • keep the flex video files in a static directory similar to other resources, i.e. in the $htdocs/flex/ directory
  • simplify the generation of the <object> tag to load the appropriate video file
The tag should be as simple as the following SHP script:

;;  foo.shp ;; loads the foo.swf flash video
(flash "foo" #:width 550 #:height 400) 

Which should then be translated into the following xexpr:


<object width="550" height="400" id="foo"
<param name="allowscriptaccess" value="always" /> 
<param name="movie" value="/flex/foo.swf" />
<embed src="/flex/foo.swf" width="550" height="400" name="foo" 
      type="application/x-shockwave-flash" allowscriptaccess="always" >
</embed></object>

This is quite straight forward:

(define flash-base-path (make-parameter "/flash")) 

(define (path->id path)
  (string-join (path-helper path) "_"))

(define (path-helper path)
  (cond ((pair? path) path)
        ((string? path) (regexp-split #px"\\/" path ))
        ((path? path) (path-helper (path->string path)))))

(define (flash-url path)
  (string-join (cons (flash-base-path) (path-helper path)) "/"))

(define (flash (path ($pathinfo)) #:id (id #f) #:base (base (flash-base-path)) #:height (height 400) #:width (width 400))
  (define (helper path id width height)
  `(object ((width ,width) 
            (height ,height) 
            (id ,id))
           (param ((name "movie") (value ,path)) "")
           (param ((name "allowscriptaccess") (value "always")) "")
           (embed ((src ,path)
                   (width ,width)
                   (height ,height)
                   (name ,id)
                   (type "application/x-shockwave-flash")
                   (allowscriptaccess "always")
                   ) "")))
  (helper (flash-url path)
          (if (not id) (path->id path) id)
          (number->string width)
          (number->string height)))
With the ability to generate the flash embedding xexpr, we now just need to create the static flash directory, require the module (we'll call it bzlib/flexer), and then use the above shp script in our code.

We'll talk about the tighter integration next time.  Cheers.

Saturday, September 12, 2009

Extensible Abstract Database Interface for PLT Scheme

There are not a lot of choices of relational database interfaces for PLT - part of the reason is just that the community is not yet the size of Perl or Python, so not as many infrastructure packages yet.

However, PLT does have the interfaces for three of the most important open source databases - postgresql, mysql, and sqlite, so your open source database needs are likely addressed. But the challenge is that each sports a different interface, so you couldn't really abstract the database away to the degree that you can in other languages.

sqlid was probably the closest thing for PLT to have a single interface and multiple drivers, but it hasn't appeared to be active for a while, so I decided to take the plunge to create an abstracte database interface so all three packages can function uniformly. Hopefully it can attract collaborations from driver writers.

Intoducing bzlib/dbi - extensible abstract database interface for PLT

Inspired by Perl's DBI, the package bzlib/dbi follows the separation between the interface and the driver, so if you want to use a particular driver, you just need to require:

(require (planet bzlib/dbi)
(planet bzlib/dbd-<driver>))
Currently there are three such drivers available:
All of the four packages are released under LGPL.

To connect to the database, you just need to issue connect and pass it the symbol that identifies the driver, along with the arguments expected by the underlying driver. Once you have required bzlib/dbi and the appropriate driver, for example, to connect against schematics/spgsql:

(define handle (connect 'spgsql '#:server "localhost" '#:port 5432 ...))

And to connect against jaymccarthy/sqlite:

(define handle (connect 'jsqlite ':temp:)) ;; creates a temp database

To connect against jaz/mysql:

(define handle (connect 'jazmysql "localhost" ...))

Note that if the underlying driver expects keyword arguments, pass in the keyword with a quote, so PLT will treat it as an argument rather than a keyword for connect (which will result in an exception, since connect has no keyword arguments).

To disconnect, just do:

(disconnect handle)

Query and Named Parameters

The goal of bzlib/dbi is to keep the interface as simple as possible, so there is currently only one function that handles the queries:

(query handle "select * from ..." args)


The arguments is a list of pairs of symbol and values. It is currently not optional (so if you have no arguments you would still have to pass in '()).

The query statements takes in named parameteres instead of ordinal parameters, so instead of writing:

select * from table1 where c1 = ? and c2 = ?

You write

select * from table1 where c1 = ?c1 and c2 = ?c2 -- ?c1 & ?c2 are named parameters.

Then you pass in '((c1 . val1) (c2 . val2)) as the argument.

The nice thing about named parameters is that you do not have to worry about matching the order of the parameters, and you only need to specify the value once if you have to pass in the same value to multiple positions. The name can consist of regular alphanumeric characters, underscores, dashes, and exclamation mark.

bzlib/dbi currently standardizes on using question mark as the placeholder and translates the query and the parameters to their corresponding version for each of the drivers. This can be customized.

Prepared Statements

bzlib/dbi also provides the ability to create prepared statements:

;; (prepare handle name-symbol query)
;; example
(prepare handle 'select-table1 "select * from table1 where c1 = ?c1 and c2 = ?c2")
(query handle 'select-table1 `((c1 . ,val1) (c2 . ,val2)))
Basically - the prepared statement is created and kept along with the handle, and identified via the name symbol. And it is run via the same query interface by passing in the name symbol (instead of the query) along with the arguments.

Unlike other database interfaces where the prepared object is returned to the caller for passing around, bzlib/dbi hides them to 1) simplify the interface, and 2) minimize the chance of error when the same underlying handle is used by multiple threads at once - I settle on this design choice when running into issues with a prepared statement run by a separate thread where the handle was terminated by its originating thread so you wouldn't have to manage the same object with multiple threads.

Transactions

You can trigger transactions manually with the SQL statement.

(begin-trans handle)
(commit handle)
(rollback handle)

Or you can use the with-trans syntax, which wraps the transaction management with exception handling - so if there are errors triggered it would automatically cause rollback of the transaction.

(with-trans (handle1 ...)
exp ...)

You can wrap multiple database handles and if one errors all will be rolled back - this is a simple distributed transaction.

Data Type Mapping

The data type mapping depends on the underlying driver, but the following are the mapping for the general types:
  • integer <-> sql integer
  • number <-> sql float
  • srfi/19 date time <-> sql date/time
  • string <-> sql string/text
  • bytes <-> sql binary/blobs
  • null <-> sql null
For bzlib/dbd-spgsql, there is also a mapping for the array type as a list.

The recordset is returned as a list of lists. The first row holds the column names (so there will always be one row in a select query even if there are no results).

Additional Query Helpers

The following provides additional helpers that wraps around query to simplify the developer's tasks in certain situations:
  • exec - used for non select queries
  • rows - strips the column names and just return the rows
  • row - returns the first row (and raise error if no rows)
  • row/false - returns the first row or #f if there are no rows
  • exists? - a shorthand for row/false (looks more natural for people used to SQL)
  • cell - returns the first cell of the first row or throw an error if there are no rows
  • cell/null - returns the first cell of the first row or returns null
  • cell/false - returns the first cell of the first row or returns #f
All of the above helpers also takes in an optional converter parameter right after the args parameter, so you can directly process the result before it returns. The converter should be a function that can take in as many arguments as there are columns in the query. The full signature of these helpers look like:

(query-helper handle symbol-or-query key/val-args converter)


Extensible Database Interface

An important principle of the design is to make the database interface extensible, so code can be reused and new database driver can be developed more easily (of course it would still require intimate knowledge with the particular database's protocol). Hence the interface is extensible with the chaining of drivers.

The following drivers are currently available for extending the capability of the three database drivers.
  • app - provides a kill-safe interface over the raw database driver
  • pool - provides a database connection pool interface over the raw database driver
Kill-Safe App Driver

To make use of the kill-safe app driver, you need to pass in the 'app driver identifier to connect:

(define kill-safe-handle (connect 'app #f <inner-driver-args> ...))

So for example, if you want to make schematics/spgsql kill-safe, use the following:

(define handle (connect 'app #f 'spgsql '#:server "localhost" '#:port 5432 ...))

And if you want to do so with jaymccarthy/sqlite (which doesn't do much since jaymccarthy/sqlite is FFI-based and already serializes all threads):

(define handle (connect 'app #f 'jsqlite "database.db"))

What happens is that bzlib/dbi will initiate the app driver, which then initiates the underlying database driver. The #f arg is the argument that tells the app driver whether there is a pool driver that helps handle the response. You only need to explicitly pass this value in if you are initiating app driver directly.

Database Connection Pool

The database connection pool driver further builds on top of the app driver and manages the database connections. It works in conjunction with the app driver. For example - to initiate a database pool of maximum 10 database connections:

(define pool-handle (connect 'pool 10 'spgsql '#:server "localhost" '#:port 5432 ...))

You do not need to specify the app driver in this case, since the pool driver does that for you internally.

The pool driver is a bit special in that it does not directly handle the queries, but offers up one of its connections to the caller and let the caller use the connection until the caller thread dies. To retrieve a connection from the pool driver, do:

(query pool-handle 'connect '())

which would return the underlying app driver handle.

The app and pool driver demonstrates how bzlib/dbi can be extended. You can certainly extend it further to fulfill your needs (an example of possible future extension would be to create a load balancer handle that sits in front of a database cluster).

TODO

There are a lot more the design of bzlib/dbi can accomodate, but one of the important thing to do is to collaborate with the driver writers so there can be a joint effort. Since bzlib/dbi is the new kid on the block, the effort will have to come later after people have chance to test out the interface. But now you can switch between the three major database with relative ease.

Erlang-Style Programming in PLT

Similar to Erlang, PLT Scheme also offers microthreads. While the PLT's concurrency implementation might not yet be as capable as Erlang's (currently PLT is not yet multicore enabled), we can still employ similar development style.

The base of PLT's thread primitives are documented in the PLT reference documentation. Specifically, to spawn a thread, we just need to call (thread <procedure>). To send the thread a message, use (thread-send <thd> <msg>). The target thread should call (thread-receive) to retrieve the message. This basic pattern works well with two threads communicating with each other. Once more than two threads are involved, we run into issues with just (thread-receive) since we have no way to verify which thread sends which message, let along sending back the right response to the right receipient. We'll need something more.

Introducing bzlib/thread - implementing the erlang selective receive pattern in PLT Scheme. The code is released under LGPL.

In Erlang, the receive is pattern match enabled, and we want something similar in PLT scheme. bzlib/thread provides the pattern match enabled receive for PLT Scheme, with the syntax receive/match:

(require (planet blib/thread)) ;; load the package.
;; default syntax - with pattern match
(receive/match
(match-pattern exp ...) ...)

;; timer syntax - just the timer.
(receive/match
(after time exp ...))

;; extended syntax - combine the pattern match with timer.
(receive/match
(match-pattern exp ...) ...
(after time exp ...))

The match-pattern has the exact same syntax as scheme/match, since scheme/match provides the underlying matching capability. You can also specify an after clause so you can specify the number of seconds (does not have to be integer) elapsed for a timer-based event to trigger even without thread messages.

Beyond Erlang's Receive Capability

While PLT Scheme's concurrency capability is not yet as capable as Erlang, its language facility has more to offer than Erlang's. By default PLT Scheme provides a powerful synchronization framework that includes many different types of events (the after clause is built on top of the alarm event), and it would be a shame if we cannot take advantage of all those event capabilities within our receive/match, hence receive/match is enhanced with a sync clause:

;; with the sync clause - it is also pattern-match enabled...
(receive/match
(pattern-match exp ...) ...
(after time exp ...)
(sync (pattern exp ...) ...))

Hence you can pass in a list of custom events so the syntax will handle all of the synchronizations within one clause. The sync clause also have pattern matching, so you can use it to match against the specific events that was triggered and dispatch to the correct clause branch.

The receive/match form the basis of multi-threads communications.

Communication Bewteen Multiple Threads

As stated earlier, the challenge with the bare thread-send and thread-receive is that you have to handle your own message dispatching if there are multiple threads trying to communicate with each other. While receive/match provides the additional pattern matching capability, it only simplifies your effort to coordinate all of the threads, but you still have to devise the scheme in which you can identify the sending thread (so you can send back an appropriate response). To do so, we need to codify the structure of the message so it includes information about the sender. The simplest way is to add the sending thread into the message as follows:

(thread-send thd (list (current-thread) args))

Then you just need to ensure your receive/match matches the signature:

(receive/match ((list (? thread? thd) args) (do-whatever) ...) ...)
bzlib/thread codifies the above signature in thread-call, so all you have to do is:

(thread-call thd args)
;; or with a timeout
(thread-call thd args timeout)

and the above message signature will be sent.

Once you've received the message via receive/match and extracted your args for processing, you can then use thread-reply to send a message back to the sender:

(thread-reply sender result)
;; or you can reply on behalf of another thread
(thread-reply sender result thread-on-behalf-of)


Handling Exceptions and Respond to Sender

What if the processing raised an exception and you need to send the exception back to the sender? To avoid confusion between a regular reply (which holds legit values) and an exception, you should use send-exn-to to return the exception, which does the following:

(thread-send thd (cons exn thread) #f) ;; #f means no error is raised if the receiving thread is dead.

Then you just match for the exception pattern to check whether an exception was raised:

(receive/match ((cons (? exn? e) (? thread? sender)) (do-something...)) ...)


Casting Messages

The above thread-call, thread-reply, and send-exn-to codifies the communication signatures between threads so the correct reply can be sent back to the originator, but what if you do not need to respond back to the originator? In that case you have a "cast" pattern, and bzlib/thread provides thread-cast and thread-cast* for you.

(thread-cast thd arg)
(thread-cast* thd arg arg1 ...) ;; equals (thread-cast thd (list arg arg1 ...))

They are very similar to the bare thread-send, except they are written with the kill-safe pattern, so they'll first attempt to wake the receiving thread if it was suspended, and then send over the arguments. Use thread-cast or thread-cast* if you do not need to get a response back.

Applications and Getting Closer to OTP

Erlang is famous for its nine-nines uptime, and their OTP modules have a lot to do with it. It takes a lot of effort to implement OTP and get all of the bugs out, so OTP is not coming to PLT soon. But app which is part of bzlib/thread is the first step toward constructing OTP for PLT.

App basically provides a simple structure over the receiving thread, so you have an structure that you can pass around and manipulate. The function make-application tries to simplify the creation of an application:

(make-application call cast init-state)

You just need to pass the call function (run when triggered via thread-call) and the cast function (run when triggered via thread-cast), as well as the init-state, which are the values that the application will hold internally between the thread calls or casts.

The call function should have the following signature:

(sender-thread passed-in-args app-state . -> . (cons/c result app-state))

The returned app-state do not have to be the same as the passed in app-state, but it needs to be compatible to the function for the next call.

The cast function should have the following signature:

(passed-in-args app-state . -> . (cons/c result app-state))

The result will be disgarded since there isn't a response back to the sender.

app-call and app-cast are provided as wrappers over thread-call and thread-cast. Unlike thread-call and thread-cast, app-call and app-cast takes variable parameter lists.

(app-call app cmd #:timeout (number? +inf.0) arg1 ...)
(app-cast app cmd arg1 ...)

cmd is a symbol that you can use to dispatch to the correct function within your app, assuming your app provides multiple function as its APIs.

That's it for now - have fun programming in erlang style in PLT Scheme.

Wednesday, September 2, 2009

Conditional Module Inclusion and Compilation

Inspired by the thread of "conditional module inclusion and compilation" in PLT mailing list, bzlib/os provides a way to conditionally run code depending on the system type.

To start, require bzlib/os:
(require (planet bzlib/os))


Conditional Require

To require modules only within windows, use require/windows.

(require/windows <require-spec> ...)

The <require-spec> takes the same form as regular require statements. In Mac OS or Unix, the above statement evalutes to (void).

To do the same for Mac OS, use require/macosx. And use require/unix for Unix.

To specify multiple platform in one statement - use require/os.

(require/os (:windows <required-spec> ...)
(:macosx <required-spec> ...)
(:unix <required-spec> ...)
(else <required-spec> ...))

Use :windows to specify the window's branch inclusion. Use :macosx for Macs, and :unix for Unix. All are optional.

Use else to specify non-platform specific includes. If the else branch exists, it must be the last branch.

Conditional Provide

There are also os-dependent provide statements, provided to mirror the require statements (even though it's probably less likely to be used).
  • provide/windows evals to provide on Windows
  • provide/macosx evals to provide on Macs
  • provide/unix evals to provide on Unix
  • provide/os has the same structure has require/os - use :windows, :macosx, and :unix to write OS-dependent provide branch, and use else to write OS independent branch. If the else branch exists it must be the last branch


Both the require/* and provide/* behaves the same as regular require and provide in that they can only be called at top level or module level (provide only works at module level).

Conditional Expressions

The most general form is conditional expressions, which can be called at any position.

For windows, use +:windows:

(+:windows <exp> <exp2>) evaluates to <exp>, and <exp2> otherwise.

(+:windows <exp>) evaluates to <exp>, and (void) otherwise. It is equivalent to (+:windows <exp> (void)).

For Macs, use +:macosx in the same way as above. For Unix, use +:unix.

If you need to write more than 2 OS branch, use +:os, which is the foundation of all above macros:

(+:os (:windows exp)
(:macosx exp)
(:unix exp)
(else exp))

Similar to require/os and provide/os, one of the branches with labels of :windows, :macosx and :unix get evaluated for the respective OS. The else branch is evaluated for platform independent expression; and if it exists it must be the last expression.

It is possible to write (+:os), which gets evaluated to (void).

Caveats

Eli has cautioned that one must make sure no .zo files are copied across platforms by using this pattern.

In general one should write platform independent code, but of course that would not be possible in all situations. It is always better to isolate the OS-specific code to their own modules, and make sure all those modules all expose the same signatures; Unit can aid with this pattern.

Example:
  • bar.ss exposes bar and baz
  • bar-windows.ss exposes bar and baz that work specifically for windows
  • foo.ss includes bar.ss & bar-windows.ss via (require/os (:windows "bar-windows.ss") (else "bar.ss"))