Wednesday, October 14, 2009

Parsing "Encoded Word" in RFC Headers (2) - Charset Handling & Multiple Encoded Words

In the previous post we discussed the Q and B encodings, and ended with a bug on mismatching charset if the charset is not utf-8, let's try to fix the bug here.

It would be nice if we can use local charsets such as iso-8559-1 or big5 if we know for sure that the charset contains all of the characters that appears in the string (of course, it is the developer's responsibility to choose the right charset; the code will error out if the charset does not match the data).

PLT Scheme provides a convert-stream to help handle converting bytes from one charset to another. We can build helpers that takes strings or bytes and return string or bytes on top of this function. What we want are something like:

(bytes/charset->string #"this is a string" "ascii") ;; => returns a string
(bytes/charset->bytes/utf-8 <bytes> <charset>) ;; => returns a bytes
The idea is that we'll convert the input data to input-port, and then retrieve the data from the output-port, which will be a bytes port.

So let's start with a helper function that'll take in an input-port, and the charsets and then return a bytes:

(define (port->bytes/charset in charset-in charset-out)
  (call-with-output-bytes 
   (lambda (out)
     (convert-stream charset-in in charset-out out))))
Then we can have the following:

(define (bytes->bytes/charset bytes charset-in charset-out)
  (port->bytes/charset (open-input-bytes bytes) charset-in charset-out))
And we can define converting bytes to and from utf-8:

(define (bytes/charset->bytes/utf-8 bytes charset)
  (bytes->bytes/charset bytes charset "utf-8")) 

(define (bytes/utf-8->bytes/charset bytes charset)
  (bytes->bytes/charset bytes "utf-8" charset))
And finally we can then return strings on top of these two functions:

;; there are more to handle (specifically charsets).
(define (bytes/charset->string bytes charset)
  (bytes->string/utf-8 (bytes/charset->bytes/utf-8 bytes charset)))

(define (string->bytes/charset string charset)
  (bytes/utf-8->bytes/charset (string->bytes/utf-8 string) charset))
With the above functions, we can now ensure to convert the encoded word into the correct charset:


(define (encode-encoded-word charset encode str)
  (format "=?~a?~a?~a?=" 
          (string-downcase charset)
          (string-upcase encode)
          ((cond ((string-ci=? encode "q") q-encode)
                 ((string-ci=? encode "b") b-encode)) 
           (string->bytes/charset str charset))))
Notice now that converting the same string with different charset will result in different encoded word:

> (encode-encoded-word "iso-8859-1" "q" "Keld Jørn Simonsen")
"=?iso-8859-1?Q?Keld_J=F8rn_Simonsen?="
> (encode-encoded-word "utf-8" "q" "Keld Jørn Simonsen")
"=?utf-8?Q?Keld_J=C3=B8rn_Simonsen?="
So now the bug is fixed.

Convert a String of Arbitrary Length into Encoded Word String

In cases of a string exceeding the encoded word length of 75, we'll need to convert the string into multiple encoded words, separated by linear folding whitespace (\r\n\s).

Since both Q and B encoding will lengthen the actual bytes (increasing by 33% in case of B), we will not be able to encode 75 bytes; instead, we can only encode 75 bytes minus the delimiters (12 bytes) and divide by 133%, which will give us total of 48 bytes of characters per encoded word.

Also - since some of the characters will be multi-bytes, we want to make sure we do not break up the string right in the middle of a character. We want to make sure we break around the characters.

Let's get started.

The following function will split a string up according to a maximum bytes length:

(define (split-string-by-bytes-count str num)
  (define (maker chars)
    (list->string (reverse chars)))
  (define (helper str i chars blen acc)
    (if (= i (string-length str)) ;; we are done here!!!... 
        (reverse (if (null? chars) acc
                     (cons (maker chars) acc)))
        (let* ((c (string-ref str i))
               (count (char-utf-8-length c))) 
          (if (> (+ count blen) num) ;; we are done with this version....
              (if (= blen 0) ;; this means the character itself is greater than the count.  
                  (helper str (add1 i) '() 0 (cons (maker (cons c chars)) acc))
                  (helper str i '() 0 (cons (maker chars) acc)))
              (helper str (add1 i) (cons c chars) (+ count blen) acc)))))
  (helper str 0 '() 0 '()))
What it does is to accumulate the characters according to the maximum bytes count, and if the addition of the next character's bytes length exceeds the maximum bytes count, then we do not include that character in the current split. In the case where the maximum bytes count is lower than the character's bytes length, that character gets its own string (i.e. if you pass in 0 you'll get per character split).

> (split-string-by-bytes-count "孫中山畢業於香港西醫書院" 0)
("孫" "中" "山" "畢" "業" "於" "香" "港" "西" "醫" "書" "院")
Once we can split the string according to maximum bytes count, we can now separately encode the splitted strings (and then join them together with \r\n\s):

(define (string->encoded-words s charset)
  (define (helper s)
    (case (string-type s)
      ((ascii) s)
      ((latin-1) (encode-encoded-word "iso-8859-1" "q" s))
      (else (encode-encoded-word charset "b" s))))
  (map helper (split-string-by-bytes-count s 48))) 

(define (string->encoded-word-string s (charset "utf-8"))
  (string-join (string->encoded-words s charset) "\r\n "))
Notice that in the above we have tests to see whether the string is an ascii string or a latin-1 string, because we do not have to encode ascii, and Q is a better encoding for latin-1 string. Also notice that charset only impacts the encoding of strings that containing characters outside of latin-1 characters.

The definition of string-type is defined as follows:

(define (char-type c)
  (let ((i (char->integer c))) 
    (cond ((< i 128) 'ascii)
          ((< i 256) 'latin-1)
          (else 'unicode))))

(define (string-type s)
  (define (helper len i prev)
    (if (= len i) prev
        (let ((type (char-type (string-ref s i))))
          (case type 
            ((unicode) type)
            ((latin-1) 
             (helper len (add1 i) (case prev
                                    ((ascii) type)
                                    (else prev))))
            (else (helper len (add1 i) prev))))))
  (helper (string-length s) 0 'ascii))
With the above, we can now encode strings into encoded words:

> (string->encoded-word-string "Keld Jørn Simonsen")
;; => 
=?iso-8859-1?Q?Keld_J=F8rn_Simonsen?=
> (string->encoded-word-string "伦敦(英文:London,讀音:/ˈlʌndən/ 文件-播放)是英格蘭和英國的首都、第一大城及第一大港")
;; => 
=?utf-8?B?5Lym5pWmKOiLseaWhzpMb25kb24s6K6A6Z+zOi/LiGzKjG5kyZluLyDmlofku7Yt?=
 =?utf-8?B?5pKt5pS+KeaYr+iLseagvOiYreWSjOiLseWci+eahOmmlumDveOAgeesrOS4gA==?=
 =?utf-8?B?5aSn5Z+O5Y+K56ys5LiA5aSn5riv?=
> (string->encoded-word-string "China (simplified Chinese: 中国; traditional Chinese: 中國; Hanyu Pinyin: zh-zhongguo.ogg Zhōngguó (help·info); Tongyong Pinyin: Jhongguó; Wade-Giles: Chung1kuo2) is a cultural region, an ancient civilization, and, depending on perspective, a national or multinational entity extending over a large area in East Asia.")
;; => 
=?utf-8?B?Q2hpbmEgKHNpbXBsaWZpZWQgQ2hpbmVzZTog5Lit5Zu9OyB0cmFkaXRpb25hbCBD?=
 =?utf-8?B?aGluZXNlOiDkuK3lnIs7IEhhbnl1IFBpbnlpbjogemgtemhvbmdndW8ub2dnIFpo?=
 =?utf-8?B?xY1uZ2d1w7MgKGhlbHDCt2luZm8pOyBUb25neW9uZyBQaW55aW46IEpob25nZ3U=?=
 =?iso-8859-1?Q?=F3;_Wade-Giles:_Chung1kuo2=)_is_a_cultural_region?=
 , an ancient civilization, and, depending on per
 spective, a national or multinational entity ext
 ending over a large area in East Asia.
At this point, the generation of encoded word string is complete. Our next step is to parse such an encoded word string back into its original form. Stay tuned.

No comments:

Post a Comment