Monday, September 28, 2009

Handling Time Zone in Scheme (2): Parsing The Zoneinfo Database

Continuing from the motivation post on the difficulty of handling timezone, we'll now focus on solving the first problem - parsing the zoneinfo database.

As previously discussed, the zoneinfo database format is pretty straight forward:
  • comments starts from # until the end of the line
  • records are line oriented, and there are no line continuation
  • there are 3 main record types:
    • rule - describes daylight saving rules
    • zone - describes different time zones and how the offset (and the applicable daylight saving rules) change over the years
    • link - describes aliases for the different zones

What we want to accomplish is to parse the zoneinfo database and convert into scheme values that are more consumable. Let's try to map them out.

Overall Structure

It would be nice if we can parse the files into structure that looks like:

(zoneinfo (rule ("US" rule1 ...) ...) 
          (zone ("America/New_York" zone-rule1 ...) ...) 
          (link (Link1 ...) ...)) 
The above value would be easy to load back into memory (just a singe read operation). And all of the related rules are all grouped together.

Since the records are line oriented - when we parse them it would look like the following:

(rule "US" ...)
(rule "US" ...) 
...
(zone "America/New_York" ...) 
(zone "America/New_York" ...) 
...
(link ...) 
...
Assuming we are able to generate the above records, we can group them via the group we introduced during the development of the memcached client, by first grouping on rule, zone, and link, and within each group we then further group based on the sub key, and then cons 'zoneinfo in front.

;; 2-layer grouping 
(cons 'zoneinfo 
      (map (lambda (kv)
             (cons (car kv)
                   (group (cdr kv))))
           (group zis)))
Once the transformation is done - we just need to serialize the data out to a file.

(call-with-output-file <target>
  (lambda (out)
     (print zoneinfo out)) 
  #:exists 'replace) 
Then we just need to make sure that we can parse the files (yes - the zoneinfo database consists of multiple files) so we can generate the records to feed into the 2-layer grouping. Since append the results from parsing multiple files is trivial, we'll just discuss the parsing of a single file.

Parsing The Zoneinfo File

PLT Scheme offers a lot of fantastic support for parsing files. We will not go through them all here, but at the high level, there is the parser tools that are basically lex and yacc in scheme, and there is also the parser combinators that you can use to construct your monadic combinator parsers. At a low level, you can use all of the read-* functions to read just about anything you want from an input port.

But if your data are mostly symbol-like (i.e. tokens separated by whitespaces), then the plain-old read can get you very far. If necessary, you can customize the reader by customizing the underlying readtable.

PLT Scheme use a readtable internally to determine how to make sense of the input. For example, by default scheme will treat the characters after a semicolon (;) as comments (unless it exists in a string). And we can customize it to treat the characters appearing after the pound (#, or hash if you prefer) as comments. All we need to do is to make a new readtable and tell the new readtable to treat # as ; in the default readtable (the default readtable is #f in PLT Scheme).

(make-readtable (current-readtable) #\# #\; #f)) 
And this turns out to be the only difference we need to deal with in order to parse the zoneinfo database.

Given that the zoneinfo database is a line-oriented, we'll then first read in the individual lines, and then use read to read in the tokens from the lines, and finally convert the tokens into their appropriate representations.

(define (read-zoneinfo in)
  (define (read-helper in)
    (define (helper acc)
      (parameterize ((current-readtable
                      (make-readtable (current-readtable)
                                      #\# #\; #f)))
        (let ((term (read in)))
          (cond ((eof-object? term)
                 (reverse acc))
                (else (helper (cons term acc)))))))
    (helper '()))
  (define (helper acc)
    (let ((l (read-line in 'any)))
      (cond ((eof-object? l)
             (reverse acc))
            ((equal? l "")
             (helper acc))
            (else
             (let ((res (read-helper (open-input-string l))))
               (cond ((null? res)
                      (helper acc))
                     ((equal? (car res) 'Rule)
                      (helper (cons (apply parse-rule (cdr res))
                                    acc)))
                     ((equal? (car res) 'Zone)
                      (helper (cons (parse-zone (cdr res)) acc)))
                     ((equal? (car res) 'Link)
                      (helper (cons (apply parse-link (cdr res)) acc)))
                     ((equal? (caar acc) 'zone)
                      (helper (cons (parse-zone res (cdar acc)) acc)))
                     (else
                      (helper (cons res acc)))))))))
  (helper '()))
read-zoneinfo completes the bases of the zoneinfo parser. What's left to do is to fill out the details of parse-zone, parse-rule, and parse-link.

Rule Parsing

The man page of zic states the below as the format of rule:

Rule  NAME  FROM  TO    TYPE  IN   ON       AT    SAVE  LETTER/S

NAME is an arbitrary name - and we just need to convert it into string (you can argue this is not even necessary, but I like names in strings...).

FROM and TO specifies the years in integer that the rules are in effect, and in place of numbers, maximum, minimum and their abbreviations might be used to denote positive and negative infinity (TO might also have a only to mean repeat the value of FROM).

(define (from-year-helper year)
  (case year 
    ((max maximum) +inf.0) 
    ((min minimum) -inf.0) 
    (else year))) 

(define (to-year-helper from to)
    (case to
      ((maximum max) +inf.0)
      ((min minimum) -inf.0)
      ((only) from)
      (else to)))

TYPE has either a - to denote the default type (which means the time between FROM to TO are all applicable), or can be used to extend the type by executing external commands. So far there are nothing else but - in the zoneinfo database:
(define (type-helper type from/y to/y)
    (case type
      ((-) type)
      (else
       (error 'type-helper
              "unknown type ~a in ~a"
              type
              (list name from to type in on at save letter/s)))))
IN denotes the month in English (abbreviation possible) for which the rule applies - we want to convert it to the equivalent number:

  (define (month-helper in)
    (if-it (month->num in #t)
           it
           (error 'in-month-helper
                  "unknown month ~a in ~a"
                  in
                  (list name from to type in on at save letter/s))))
month->num simply converts a month in English into its equivalent numbers:

(define (month->num month (number-ok? #f))
  (define (helper mon)
    (cond ((symbol? mon)
           (helper (symbol->string mon)))
          ((string? mon)
           (string-downcase mon))
          (else #f)))
  (if (and number-ok? (number? month))
      (let ((it (modulo month 12)))
        (if (= it 0) 12 it))
      (assoc/cdr (helper month)
                 '(("jan" . 1)
                   ("january" . 1)
                   ("feb" . 2)
                   ("feburary" . 2)
                   ("mar" . 3)
                   ("march" . 3)
                   ("april" . 4)
                   ("apr" . 4)
                   ("may" . 5)
                   ("jun" . 6)
                   ("june" . 6)
                   ("jul" . 7)
                   ("july" . 7)
                   ("aug" . 8)
                   ("august" . 8)
                   ("sep" . 9)
                   ("sept" . 9)
                   ("september" . 9)
                   ("oct" . 10)
                   ("october" . 10)
                   ("nov" . 11)
                   ("november" . 11)
                   ("dec" . 12)
                   ("december" . 12)))))
ON holds the date (within the month) when the rule takes place. It can hold one of the following forms:
  • 5 - 5th of the month
  • lastSun - last sunday of the month
  • Sun>=8 - sunday on or after 8th
  • Sun<=25 - sunday on or after 25th
For the format of lastSun - we want to convert to '(last 0); and for the format of Sun>=8 - we'll convert to '(match 0 >= 8):

(define (on-the-day-helper on month)
  (define (days-greater-helper days)
    ;; we need to spit the day up
    (if-it (regexp-match #px"(Sun|Mon|Tue|Wed|Thu|Fri|Sat)(>|<|>=|<=)(\\d+)"
                         days)
           (let ((wday
                  (if-it (weekday->num (cadr it) #t)
                         it
                         (error 'wday-helper "Unkonwn weekday ~a" (cadr it))))
                 (match (string->symbol (caddr it)))
                 (day (string->number (cadddr it))))
             (list 'match wday match day))
           (error 'on-the-day-helper
                  "Unrecognized pattern ~a in ~a"
                  days
                  (list on month))))
  (cond ((symbol? on)
         (case on
           ((lastSun) ;; means the last sunday.
            (list 'last 0))
           ((lastMon) ;; means last monday.
            (list 'last 1))
           ((lastTue) ;; means last monday.
            (list 'last 2))
           ((lastWed) ;; means last monday.
            (list 'last 3))
           ((lastThu) ;; means last monday.
            (list 'last 4))
           ((lastFri) ;; means last monday.
            (list 'last 5))
           ((lastSat) ;; means last monday.
            (list 'last 6))
           (else ;; we need to split it up...
            (days-greater-helper (symbol->string on)))))
        ((number? on)
         ;; (list 'day on)
         on)
        (else
         (error 'on-the-day-helper
                "unknown on the day ~a in ~a"
                on
                (list on month)))))
AT holds the time of the date when the rule takes place, it can take one of the following forms:
  • 2 - 2:00:00
  • 2:00 - 2:00:00
  • 15:00 - 24-hour format
  • 1:28:14 - hour:minute:second
  • - - 0:00:00
And the date might be followed by w, which means wall clock time (possibly including daylight saving offsets), s means standard time (only the timezone offset; no daylight saving offsets), and g, u, and z means the GMT (no offsets). Without the indicator it means wall-clock time.

(define-struct time (hour minute second nano tz))

(define (parse-time str (make-if-not-time? #t))
  (if-it (regexp-match #px"^[+-]?(\\d+)(:(\\d+)(:(\\d+))?)?" str)
         (apply make-time 
                (map (lambda (n)
                       (if (not n) 0
                           (string->number n)))
                     (list (second it)
                           (fourth it)
                           (sixth it) #f #f)))
         (if (not make-if-not-time?) #f
             (make-time 0 0 0 0 0))))

(define (parse-extended-time str)
  (let ((time (parse-time str))
        (unit (regexp-match #px"\\d+\\s*(w|u|g|z|s)?\\s*$" str)))
    (list (time-hour time)
          (time-minute time)
          (time-second time)
          (if unit
              (if-it (second unit)
                     (string->symbol it)
                     'w)
              'w))))
;; this ought to produce a particular time (no tz attached)
(define (extended-time-helper t)
  (if (number? t)
      (list t 0 0 #f)
      (case t
        ((-) (list 0 0 0 #f))
        (else
         (parse-extended-time (symbol->string t))))))
SAVE holds the same data format as the AT field, except that there isn't a suffix to indicate whether it's a wall or standard time, and it has a sign (to indicate positive or negative offsets):

(define (parse-offset str (make-if-not-time? #t))
  (let ((time (parse-time str make-if-not-time?))
        (sign (regexp-match #px"^\\s*(\\+|-)?\\d+" str)))
    (if (not time) #f
        ((cond ((not sign) +)
               ((equal? (second sign) "-") -)
               (else +)) (h->s (time-hour time)
                               (time-minute time)
                               (time-second time))))))
  (define (save-helper save)
    (if (number? save)
        (h->s save)
        (case save
          ((-) 0)
          (else
           (parse-offset (symbol->string save)
                         )))))
With the above, then parse-rule can be written as follows:

(define (parse-rule name from to type in on at save letter/s)
  (define name-helper symbol->string)
  (define (letter/s-helper letter/s)
    (case letter/s
      ((-) "")
      (else (symbol->string letter/s))))
  (let* ((name (name-helper name))
         (from-year (from-year-helper from))
         (to-year (to-year-helper from-year to))
         (type (type-helper type from-year to-year))
         (in-month (month-helper in))
         (on-the-day (on-the-day-helper on in-month))
         (at-when (extended-time-helper at))
         (offsets (save-helper save))
         (letter/s (letter/s-helper letter/s)))
    (list 'rule name from-year to-year type in-month on-the-day
          at-when offsets letter/s)))
parse-zone and parse-link can be structured similar and will be left as reader exercises.

At this point we are able to parse the zoneinfo database, and the next step would be to manipulate the parsed values into data structures that we can use to calculate offsets. Stay tuned.

No comments:

Post a Comment