Tuesday, October 13, 2009

Parsing "Encoded Word" in RFC Headers

If you want to correctly handle internet message headers as defined in RFC822 or as improved by RFC2822, you'll find that you currently have no way of handling encoded words, which is defined separately in RFC1342.

Below is the example of encoded words in message headers from RFC1342:

From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
CC: =?ISO-8859-1?Q?Andr=E9_?= Pirard <PIRARD@vm1.ulg.ac.be>
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
which should be decoded into

From: Keith Moore <moore@cs.utk.edu>
To: Keld Jørn Simonsen <keld@dkuug.dk>
CC: André Pirard <PIRARD@vm1.ulg.ac.be>
Subject: If you can read this you understand the example.
But currently, net/head cannot handle the encode words and they are not parsed:

(extract-all-fields <the-above-string>)
;; => 
'(("From" . "=?US-ASCII?Q?Keith_Moore?= ")
 ("To" . "=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= ")
 ("CC" . "=?ISO-8859-1?Q?Andr=E9_?= Pirard ")
  "=?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=\r\n =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?="))
So we'll need to handle it ourselves. Let's get started.

The Format of an Encoded Word

An encoded word has the following format:

=?<charset>?<Q or B>?<encoded data>?=
And an encoded word should not exceed 75 bytes (including all the delimiters). If the string being encoded cannot fit in the length, then multiple encoded words should be separated by space or linear folding whitespace (\r\n\s*). Encoded words can coexist with plain text in the same header (shown above in the Cc header).

There are only two encodings defined for the encoded words, Q and B. They are almost identical to quoted-printable and base64, with some minor exceptions:
  • Q use _ to substitute for space
  • B is not terminated by \r\n

We'll first generate encoded words, and then we'll parse them back.

Q Encoding
Since Q more or less work the same as quoted-printable, we can use net/qp as the base, and wrap around qp-encode and qp-decode.

The decoding would be more straight forward since we just have to first replace _ with #x20, which translates to space in ASCII:

(define (q-decode bstr)
  ;; convert all _ to #\space first...
  (qp-decode (regexp-replace* #px"_" bstr (list->bytes (list #x20)))))
The encoding also works similarly, except that we need to encode more characters than qp-encode, since the encoding need to avoid conflict with the encoded word delimiters (=, ?), and it cannot include spaces, tabs, newlines, etc:

;; convert the integer to bytes... 
(define (char->q-bytes c)
  (bytes->list (string->bytes/utf-8 (string #\= c))))

(define BYTE:_ (char->integer #\_))
(define Q-BYTES:_ (char->q-bytes #\_))
(define BYTE:space (char->integer #\space))
(define Q-BYTES:space (list (char->integer #\_)))
(define BYTE:tab (char->integer #\tab))
(define Q-BYTES:tab (char->q-bytes #\tab))
(define BYTE:open-paren (char->integer #\())
(define Q-BYTES:open-paren (char->q-bytes #\())
(define BYTE:close-paren (char->integer #\)))
(define Q-BYTES:close-paren (char->q-bytes #\)))
(define BYTE:? (char->integer #\?))
(define Q-BYTES:? (char->q-bytes #\?))

(define (q-encode bstr)
  (define (push bytes acc)
    (cond ((null? bytes) acc)
           (push (cdr bytes) (cons (car bytes) acc)))))
  (define (helper in acc)
    (let ((c (read-byte in)))
      (cond ((eof-object? c) ;; we are done...
             (list->bytes (reverse acc)))
            ((= c BYTE:_)
             (helper in (push Q-BYTES:_ acc)))
            ((= c BYTE:space)
             (helper in (push Q-BYTES:space acc)))
            ((= c BYTE:tab)
             (helper in (push Q-BYTES:tab acc)))
            ((= c BYTE:open-paren)
             (helper in (push Q-BYTES:open-paren acc)))
            ((= c BYTE:close-paren)
             (helper in (push Q-BYTES:close-paren acc)))
            ((= c BYTE:?)
             (helper in (push Q-BYTES:? acc)))
            (else (helper in (cons c acc))))))
  (helper (open-input-bytes (qp-encode bstr)) '())) 
B Encoding

Similarly, B works almost the same as base64, which is provided by net/base64. The decode works exactly the same so we just rename base64-decode to b-decode, and we just need to trim the \r\n at the end of a base64 encoding:

(define (b-encode bstr)
  (let ((bout (base64-encode bstr)))
    (subbytes bout 0 (- (bytes-length bout) 2))))
With the encoding mechanism now being available, it would be straight forward to generate a single encoded word:

(define (encode-encoded-word charset encode str)
  (format "=?~a?~a?~a?=" 
          (string-downcase charset)
          (string-upcase encode)
          ((cond ((string-ci=? encode "q") q-encode)
                 ((string-ci=? encode "b") b-encode)) (string->bytes/utf-8 str))))
Calling it would generate the following result:

> (encode-encoded-word "utf-8" "q" "Keld Jørn Simonsen")
> (encode-encoded-word "utf-8" "b" "If you can read this you understand the example.")
The above code, however, has a bug - and that is that the charset will not match up with the actual charset of the string if the charset is not utf-8.

Generally speaking this is not a big issue, since utf-8 is really superior to just about every other charset and under normal situation that is the proper choice. However, we should not have such bug in our code, so we'll fix it in the next post. Stay tuned.

No comments:

Post a Comment