Below is the example of encoded words in message headers from RFC1342:
From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
CC: =?ISO-8859-1?Q?Andr=E9_?= Pirard <PIRARD@vm1.ulg.ac.be>
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
=?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
which should be decoded into
From: Keith Moore <moore@cs.utk.edu>
To: Keld Jørn Simonsen <keld@dkuug.dk>
CC: André Pirard <PIRARD@vm1.ulg.ac.be>
Subject: If you can read this you understand the example.
But currently, net/head
cannot handle the encode words and they are not parsed:
(extract-all-fields <the-above-string>)
;; =>
'(("From" . "=?US-ASCII?Q?Keith_Moore?= ")
("To" . "=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= ")
("CC" . "=?ISO-8859-1?Q?Andr=E9_?= Pirard ")
("Subject"
.
"=?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=\r\n =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?="))
So we'll need to handle it ourselves. Let's get started. The Format of an Encoded Word
An encoded word has the following format:
=?<charset>?<Q or B>?<encoded data>?=
And an encoded word should not exceed 75 bytes (including all the delimiters). If the string being encoded cannot fit in the length, then multiple encoded words should be separated by space or linear folding whitespace (\r\n\s*
). Encoded words can coexist with plain text in the same header (shown above in the Cc header). There are only two encodings defined for the encoded words,
Q
and B
. They are almost identical to quoted-printable
and base64
, with some minor exceptions:Q
use_
to substitute for spaceB
is not terminated by \r\n
We'll first generate encoded words, and then we'll parse them back.
Q Encoding
Since
Q
more or less work the same as quoted-printable
, we can use net/qp
as the base, and wrap around qp-encode
and qp-decode
.The decoding would be more straight forward since we just have to first replace
_
with #x20
, which translates to space in ASCII:
(define (q-decode bstr)
;; convert all _ to #\space first...
(qp-decode (regexp-replace* #px"_" bstr (list->bytes (list #x20)))))
The encoding also works similarly, except that we need to encode more characters than qp-encode
, since the encoding need to avoid conflict with the encoded word delimiters (=
, ?
), and it cannot include spaces, tabs, newlines, etc:
;; convert the integer to bytes...
(define (char->q-bytes c)
(bytes->list (string->bytes/utf-8 (string #\= c))))
(define BYTE:_ (char->integer #\_))
(define Q-BYTES:_ (char->q-bytes #\_))
(define BYTE:space (char->integer #\space))
(define Q-BYTES:space (list (char->integer #\_)))
(define BYTE:tab (char->integer #\tab))
(define Q-BYTES:tab (char->q-bytes #\tab))
(define BYTE:open-paren (char->integer #\())
(define Q-BYTES:open-paren (char->q-bytes #\())
(define BYTE:close-paren (char->integer #\)))
(define Q-BYTES:close-paren (char->q-bytes #\)))
(define BYTE:? (char->integer #\?))
(define Q-BYTES:? (char->q-bytes #\?))
(define (q-encode bstr)
(define (push bytes acc)
(cond ((null? bytes) acc)
(else
(push (cdr bytes) (cons (car bytes) acc)))))
(define (helper in acc)
(let ((c (read-byte in)))
(cond ((eof-object? c) ;; we are done...
(list->bytes (reverse acc)))
((= c BYTE:_)
(helper in (push Q-BYTES:_ acc)))
((= c BYTE:space)
(helper in (push Q-BYTES:space acc)))
((= c BYTE:tab)
(helper in (push Q-BYTES:tab acc)))
((= c BYTE:open-paren)
(helper in (push Q-BYTES:open-paren acc)))
((= c BYTE:close-paren)
(helper in (push Q-BYTES:close-paren acc)))
((= c BYTE:?)
(helper in (push Q-BYTES:? acc)))
(else (helper in (cons c acc))))))
(helper (open-input-bytes (qp-encode bstr)) '()))
B EncodingSimilarly,
B
works almost the same as base64
, which is provided by net/base64
. The decode works exactly the same so we just rename base64-decode
to b-decode
, and we just need to trim the \r\n
at the end of a base64 encoding:
(define (b-encode bstr)
(let ((bout (base64-encode bstr)))
(subbytes bout 0 (- (bytes-length bout) 2))))
With the encoding mechanism now being available, it would be straight forward to generate a single encoded word:
(define (encode-encoded-word charset encode str)
(format "=?~a?~a?~a?="
(string-downcase charset)
(string-upcase encode)
((cond ((string-ci=? encode "q") q-encode)
((string-ci=? encode "b") b-encode)) (string->bytes/utf-8 str))))
Calling it would generate the following result:
> (encode-encoded-word "utf-8" "q" "Keld Jørn Simonsen")
"=?utf-8?Q?Keld_J=C3=B8rn_Simonsen?="
> (encode-encoded-word "utf-8" "b" "If you can read this you understand the example.")
"=?utf-8?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW91IHVuZGVyc3RhbmQgdGhlIGV4YW1wbGUu?="
The above code, however, has a bug - and that is that the charset will not match up with the actual charset of the string if the charset is not utf-8
. Generally speaking this is not a big issue, since
utf-8
is really superior to just about every other charset and under normal situation that is the proper choice. However, we should not have such bug in our code, so we'll fix it in the next post. Stay tuned.
No comments:
Post a Comment