Tuesday, October 20, 2009

BZLIB/DATE & BZLIB/DATE-TZ 0.2 Now Available

bzlib/date and bzlib/date-tz are now available via planet. They are released under LGPL. You can find the previous documentation for previous usage.

The changes included are:
  • re-exports for SRFI-19 functions
  • Wrapper functions for PLT date objects (you can use the functions with PLT date objects instead of with SRFI date objects)
  • day comparison functions
  • RFC822 date parsers and generators
  • additional date manipulation functions

SRFI19 Re-export
Previous you have to explicitly include srfi/19 to use the functions within SRFI19 as follows:

(require srfi/19 (planet bzlib/date)) 
Now you just have to do the following:

(require (planet bzlib/date/srfi)) 
And almost all srfi/19 functions will be re-exported along with the functions within bzlib/date.

The exceptions are date->string and string->date, neither of which are exported from srfi/19. This is because we may want to use those names for our own date parsers and generator functions. I'll examine the details before deciding whether to re-export those or create our own.

PLT Date Wrappers

You now can use PLT date objects instead of srfi/19 date objects (I do not really know why they are different date objects in the first place...). You can just do the following:

(require (planet bzlib/date/plt)) 

Monday, October 19, 2009

Building Parser Combinators in Scheme (2) - Higher Order Combinators

Previously we started building parser combinators for parsing a symbol that has the following signature (seq alpha (zero-more (one-of alpha numeric))), and we stopped at refactoring the numeric and alpha parsers, and ended up with return, fail, and char-test.

char-test Expanded

char-test gives us a good base for building a bunch of other character-based parser:

;; char= returns a parser to test whether the next char equals c
(define (char= c)
  (char-test (lambda (it)
               (char=? it c)))) 

;; in-chars returns a parser to match against the list of chars
(define (in-chars lst)
  (char-test (lambda (it) 
               (member it lst)))) 

;; not-in-chars is the reverse of in-chars 
(define (not-in-chars lst)
  (char-test (lambda (it) 
               (not (member it lst))))) 

;; in-char-range returns a parser to match char between from & to chars
(define (in-char-range from to)
  (char-test (lambda (it)
               (char<=? from it to))))

;; the oppposite of in-char-range
(define (not-in-char-range from to)
  (char-test (lambda (it)
               (not (char<=? from it to)))))
So we can build parsers such as the following:

;; a parser to test for the backslash
(define backslash (char= #\\)) 
;; writing numeric using in-chars
(define numeric (in-chars '(#\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9)))
;; writing equivalent to regexp \S (non-whitespace) 
(define non-whitespace (not-in-chars '(#\tab #\return #\newline #\space #\vtab)))
;; lower-case alpha using in-char-range 
(define lower-case (in-char-range #\a #\z)))
;; upper-case alpha using in-char-range
(define upper-case (in-char-range #\A #\Z))) 
It would be nice to write alpha in terms of upper-case and lower-case as defined above:

(define alpha (one-of upper-case lower-case)) 
So let's see how we can write one-of.

Higher order Parsers

The idea of one-of is straight forward - take in a list of the parsers to test against the input, one parser at a time. The first one that succeeded would be returned:

Friday, October 16, 2009

Building Parser Combinators in Scheme

I do not write code for the sake of writing code, but sometimes the best way to understand a concept is to write it up. I can think of many situations where I have no clue what the heck is going on by reading the documentations or even the tutorials, but I start to understand what's going on once I start to hack up with a solution.

Parser combinator is one of those things for me. Based on what I can find, Haskell was where the term is coined and expanded, but unfortunately I cannot read Haskell well enough to fully follow the excellent monadic parser combinator paper, let along trying to understand Parsec, and think about how it would be translated into scheme.

PLT Scheme contains a parser combinator library, but its documentation is written for someone who is completely familiar with the parser combinator and how it would work within scheme, so I couldn't start with it (in contrast, I was able to figure out parser-tools in short order, even though my previous experience with lex & yacc isn't a lot either).

Searching online, there are a couple of parser combinator written in scheme:
It wasn't much, but it was something that I can build on. Let's get started...

Immediate Obstacle - Reading from Port Instead of String or List

Parser combinator promises top-down parser development in almost BNF-like fashion, including infinitely look ahead and unlimited backtracking, all of which are extremely fascinating powers.

But reading the monadic parser combinator leaves me scratching my head wondering how this would work with ports. The parser combinator appears to derive its backtracking power via reading the string into a list of characters. This is nice and all but it doesn't work with ports, since once a character is read from a port it is gone and you lose the ability to backtrack (yes - with some ports it is possible to backtrack but this does not work with all ports).

I wasn't able to find an answer to this problem via the above URLs either, so I'll have to come up with my own solution.

What I came up with was the following:

Thursday, October 15, 2009

Parsing "Encoded Word" in RFC Headers (3) - Parsing & Decoding

In the previous two posts we have built capabilities to generate encoded words, now it's time to decode them. We'll start with detecting whether a string is an encoded word.

Testing for Encoded Word

Remember that an encoded word has the following format:

=?<charset>?<Q or B>?<encoded data>?=
The above format can be succintly specified via regular expression:

(define encoded-word-regexp #px"=\\?([^\\?]+)\\?(?i:(b|q))\\?([^\\?]+)\\?=")

(define (encoded-word? str)
  (regexp-match encoded-word-regexp str))
Since the format is not recursive, regular expression is good enough, even though it can be considered as ugly. You can certainly try using other approaches, such as a hand written lexer, or using parser-tools to do the job. For now we keep things simple.

Decoding

Once we can test whether a string is an encoded word, we can then use it to handle the decoding:

(define (encoded-word->string str)
  (if-it (encoded-word? str)
         (apply decode-encoded-word (cdr it))
         str))
If the string is an encoded word, we decode it, otherwise we return it verbatim. This way it allows regular string to be passed into this function.

The decode-encoded-word function looks like the following:

Wednesday, October 14, 2009

Parsing "Encoded Word" in RFC Headers (2) - Charset Handling & Multiple Encoded Words

In the previous post we discussed the Q and B encodings, and ended with a bug on mismatching charset if the charset is not utf-8, let's try to fix the bug here.

It would be nice if we can use local charsets such as iso-8559-1 or big5 if we know for sure that the charset contains all of the characters that appears in the string (of course, it is the developer's responsibility to choose the right charset; the code will error out if the charset does not match the data).

PLT Scheme provides a convert-stream to help handle converting bytes from one charset to another. We can build helpers that takes strings or bytes and return string or bytes on top of this function. What we want are something like:

(bytes/charset->string #"this is a string" "ascii") ;; => returns a string
(bytes/charset->bytes/utf-8 <bytes> <charset>) ;; => returns a bytes
The idea is that we'll convert the input data to input-port, and then retrieve the data from the output-port, which will be a bytes port.

So let's start with a helper function that'll take in an input-port, and the charsets and then return a bytes:

(define (port->bytes/charset in charset-in charset-out)
  (call-with-output-bytes 
   (lambda (out)
     (convert-stream charset-in in charset-out out))))
Then we can have the following:

(define (bytes->bytes/charset bytes charset-in charset-out)
  (port->bytes/charset (open-input-bytes bytes) charset-in charset-out))
And we can define converting bytes to and from utf-8:

(define (bytes/charset->bytes/utf-8 bytes charset)
  (bytes->bytes/charset bytes charset "utf-8")) 

(define (bytes/utf-8->bytes/charset bytes charset)
  (bytes->bytes/charset bytes "utf-8" charset))
And finally we can then return strings on top of these two functions:

;; there are more to handle (specifically charsets).
(define (bytes/charset->string bytes charset)
  (bytes->string/utf-8 (bytes/charset->bytes/utf-8 bytes charset)))

(define (string->bytes/charset string charset)
  (bytes/utf-8->bytes/charset (string->bytes/utf-8 string) charset))
With the above functions, we can now ensure to convert the encoded word into the correct charset:

Tuesday, October 13, 2009

Parsing "Encoded Word" in RFC Headers

If you want to correctly handle internet message headers as defined in RFC822 or as improved by RFC2822, you'll find that you currently have no way of handling encoded words, which is defined separately in RFC1342.

Below is the example of encoded words in message headers from RFC1342:

From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
CC: =?ISO-8859-1?Q?Andr=E9_?= Pirard <PIRARD@vm1.ulg.ac.be>
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
 =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
which should be decoded into

From: Keith Moore <moore@cs.utk.edu>
To: Keld Jørn Simonsen <keld@dkuug.dk>
CC: André Pirard <PIRARD@vm1.ulg.ac.be>
Subject: If you can read this you understand the example.
But currently, net/head cannot handle the encode words and they are not parsed:

(extract-all-fields <the-above-string>)
;; => 
'(("From" . "=?US-ASCII?Q?Keith_Moore?= ")
 ("To" . "=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= ")
 ("CC" . "=?ISO-8859-1?Q?Andr=E9_?= Pirard ")
 ("Subject"
  .
  "=?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=\r\n =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?="))
So we'll need to handle it ourselves. Let's get started.

The Format of an Encoded Word

An encoded word has the following format:

=?<charset>?<Q or B>?<encoded data>?=
And an encoded word should not exceed 75 bytes (including all the delimiters). If the string being encoded cannot fit in the length, then multiple encoded words should be separated by space or linear folding whitespace (\r\n\s*). Encoded words can coexist with plain text in the same header (shown above in the Cc header).

Monday, October 5, 2009

Introducing BZLIB/DATE & BZLIB/DATE-TZ - Date Time Manipulation Libraries

BZLIB/DATE & BZLIB/DATE-TZ are now available on planet. They provide additional date manipulation capability on top of SRFI/19, including timezone manipulation.

They are released under LGPL.

Usage and Installation

(require (planet bzlib/date)) 
(require (planet bzlib/date-tz)) 
bzlib/date provides date manipulations. bzlib/date provides timezone manipulations. Their usages are separately discussed below.

bzlib/date

To create a date object, you can use bulid-date, which provides a more natural year/month/day/hour/minute/second order of the parameters:

(build-date <year> <month> <day> <hour> <minute> <second> #:tz <offset>) 
And it would do the right thing if you enter February 31st:

(build-date 2009 2 31) ;; => #(struct:tm:date 0 0 0 0 3 3 2009 0) 
By default, the tz offset is 0, which equates to GMT (see below for timezone support). Only year, month, and day are required.

Date Comparisons
The following function compares two dates to determine their orders:
  • (date>? <d1> <g2>)
  • (date<? <d1> <g2>)
  • (date>=? <d1> <g2>)
  • (date<=? <d1> <g2>)
  • (day=? <d1> <g2>)
  • (date!=? <d1> <g2>)
  • (date===? <d1> <g2>)
day=? only compares the year/month/day values, and date===? means they are the same date with the same tz offset.

Conversions

You can convert between date and seconds with (date->seconds <date>) and (seconds->date <seconds>).

You can add to a date with (date+ <date> <number-of-days>). The number of day can be a non-integer.

You can find out the gaps between two dates with (date- <date1> <date2>).

You can create an alarm event with date via (date->alarm <date>) or (date->future-alarm <date>). The difference between the two is that date->future-alarm will return false if the date is in the past.

Dealing with Weekdays

To find out the weekday of a particular date, you can use (week-day <date>).

To find out the date of the nth-weekday (e.g., first sunday, 3rd wednesday, last Friday) of a particular month, use nth-week-day:

(nth-week-day <year> <month> <week-day> <nth> <hour> <minute> <second> #:tz <offset>) 
For the week-day argument, use 0 for Sunday, and 6 for Saturday. For the nth argument, use 1, 2, 3, 4, 5, or 'last. hour, minute, second, and offset are optional (same as build-date, and the other functions below that have them).

To find out the date of a particular weekday relative to another date, use one of the following:
  • week-day>=mday
  • week-day<=mday
  • week-day>mday
  • week-day<mday
They all share the same arguments, which are year, month, week-day, month-day, hour, minute, second, and offset.
The usage is something like:

;; the sunday after May 15th, 2009
(week-day>mday 2009 5 0 15) ;; 5/17/2009 
;; the friday before September 22nd, 2008 
(week-day<mday 2009 9 5 22) ;; 9/19/2009 
The hour, minute, second, and offset parameters are there for you to customize the return values:

;; the sunday after May 15th, 2009
(week-day>mday 2009 5 0 15 15 0 0 #:tz -28800) ;; 5/17/2009 15:00:00-28800
;; the friday before September 22nd, 2008 
(week-day<mday 2009 9 5 22 8 30 25 #:tz 14400) ;; 9/19/2009 08:00:00+14400

bzlib/date-tz

By default, you need to parameterize the current-tz parameter, which defaults to America/Los_Angeles. The timezone names are the available names from the olson database.

To determine the offset of any date for a particular timezone, use tz-offset:

(parameterize ((current-tz "America/New_York")) 
  (tz-offset (build-date 2008 3 9))) ;; => -18800 
(parameterize ((current-tz "America/New_York")) 
  (tz-offset (build-date 2008 3 10))) ;; => -14400 
If you want to separate between the standard offset and the daylight saving offset, you can use tz-standard-offset or tz-daylight-saving-offset:

(let ((d1 (build-date 2008 3 9))
        (d2 (build-date 2008 3 10)))
    (parameterize ((current-tz "America/New_York"))
      (values (tz-standard-offset d1)
              (tz-daylight-saving-offset d1)
              (tz-daylight-saving-offset d2))))
;; => -18800 (std)
;; => 0 (dst on 3/9/2008)
;; => 3600 (dst on 3/10/2008) 

Conversion

To reset a date's tz offset, you can use the helper function date->tz, which will reset the offset for you:

(let ((date (build-date 2008 3 10 #:tz -18800))) 
  (parameterize ((current-tz "America/New_York")) 
    (date->tz date))) 
;; => #(struct:tm:date 0 0 0 0 10 3 2008 -14400)
This function is meant for you to fix the offsets for dates that belong to a particular timezone but did not correctly account for the offset - it does not switch the timezone for you.

Couple other functions makes it even easier to work with timezone.

(build-date/tz <year> <month> ...)
(date+/tz <date> <number-of-days>)
They basically creates the date object and calls date->tz so the offset is properly adjusted based on the timezone.

Besides using current-tz, you can also pass it explicitly to tz-offset, tz-standard-offset, tz-daylight-saving-offset, date->tz, build-date/tz, and date+/tz. You pass it in in the following forms:

(tz-offset <date> "America/Los_Angeles")
(tz-daylight-saving-offset <date> "Asia/Kolkata")
(tz-standard-offset <date> "Europe/London")
(date->tz <date> "Europe/London")
(build-date/tz <year> <month> <day> #:tz "America/New_York")
(date+/tz <date> <number-of-days> "America/Los_Angeles")

Convert from One Timezone to Another

To covert the timezone of a date so you get the same date in a different timezone, use tz-convert:

(tz-convert <date> >from-timezone> <to-timezone>)
All parameters are required.

(tz-convert (build-date 2008 3 10 15) "America/New_York" "America/Los_Angeles")
;; => #(struct:tm:date 0 0 0 12 10 3 2008 -25200) ;; 2008/3/10 12:00:00-25200
(tz-convert (build-date 2008 3 10 15) "America/New_York" "GMT")
;; ==> #(struct:tm:date 0 0 0 19 10 3 2008 0) ;; 2008/10/10 19:00:00+00:00

That's it for now - enjoy.

Thursday, October 1, 2009

Handling Time Zone in Scheme (5): Pre-Convert the Rules

This is part five of the timezone series - you can find the previous posts on this subject to get up to speed on the details:
  1. motivation and overview
  2. parsing the zoneinfo database
  3. compute the offsets
  4. convert the rules into dates

We previously have finished the first draft of tz-offset, and it works correctly in majority of the situations. But unfortunately, there is one issue with it:
The code will not work correctly when there are gaps between the years where the rules are applicable.
Fortunately, under regular use of the code, we will not encounter this issue. But the issue definitely exists. Below is the US rule:

  ("US"
   (1918 1919 - 3 (last 0) (2 0 0 w) 3600 "D")
   (1918 1919 - 10 (last 0) (2 0 0 w) 0 "S")
   (1942 1942 - 2 9 (2 0 0 w) 3600 "W")
   (1945 1945 - 8 14 (23 0 0 u) 3600 "P")
   (1945 1945 - 9 30 (2 0 0 w) 0 "S")
   (1967 2006 - 10 (last 0) (2 0 0 w) 0 "S")
   (1967 1973 - 4 (last 0) (2 0 0 w) 3600 "D")
   (1974 1974 - 1 6 (2 0 0 w) 3600 "D")
   (1975 1975 - 2 23 (2 0 0 w) 3600 "D")
   (1976 1986 - 4 (last 0) (2 0 0 w) 3600 "D")
   (1987 2006 - 4 (match 0 >= 1) (2 0 0 w) 3600 "D")
   (2007 +inf.0 - 3 (match 0 >= 8) (2 0 0 w) 3600 "D")
   (2007 +inf.0 - 11 (match 0 >= 1) (2 0 0 w) 0 "S"))
You can see that there is a gap between 1945 and 1967 for applicable rules. You probably are not going to calculate the offsets for 1948 most of the time, but it would be nice if we do not have to face the potential issues.

What we want is to have the last rule of 1945 continue to be applicable until 1967 in this case. And that means we need to be able to skip over multiple years, which our current design does not account for. The nice thing is that the situation is not as dire as it sounds, since most of the timezones that I inspected visually do not make use of the "US" rule during this gap. But it would be nice to know such potential bug will not exist.

As we have found out in the previous posts, inferring the "previous" rules with the data format is difficult, and it's easier to compute the applicable rules for a given year, the easiest solution is to pre-compute the dates for every year that we need to worry about. In this case, any of the gaps will automatically be filled with the previously applicable rules, with the applicable years.

The Applicable Years

It should be obvious that the timezone concept has definite bound toward the past, as GMT was not established until 1675, and US does not adopt the timezone until 1918. The minimum year from the zoneinfo database is 1916: