Thursday, October 15, 2009

Parsing "Encoded Word" in RFC Headers (3) - Parsing & Decoding

In the previous two posts we have built capabilities to generate encoded words, now it's time to decode them. We'll start with detecting whether a string is an encoded word.

Testing for Encoded Word

Remember that an encoded word has the following format:

=?<charset>?<Q or B>?<encoded data>?=
The above format can be succintly specified via regular expression:

(define encoded-word-regexp #px"=\\?([^\\?]+)\\?(?i:(b|q))\\?([^\\?]+)\\?=")

(define (encoded-word? str)
  (regexp-match encoded-word-regexp str))
Since the format is not recursive, regular expression is good enough, even though it can be considered as ugly. You can certainly try using other approaches, such as a hand written lexer, or using parser-tools to do the job. For now we keep things simple.

Decoding

Once we can test whether a string is an encoded word, we can then use it to handle the decoding:

(define (encoded-word->string str)
  (if-it (encoded-word? str)
         (apply decode-encoded-word (cdr it))
         str))
If the string is an encoded word, we decode it, otherwise we return it verbatim. This way it allows regular string to be passed into this function.

The decode-encoded-word function looks like the following:


(define (decode-encoded-word charset encode str) 
  (bytes/charset->string ((cond ((string-ci=? encode "q") q-decode)
                                ((string-ci=? encode "b") b-decode)) (string->bytes/utf-8 str))
                         (string-downcase charset) 
                         ))
Of course - if the charset and the bytes do not match, it would error out, which is a sensible choice since the only time that would have occurred would be due to bugs in the generation.

Now that we can handle decoding a single encoded word, we need to handle decoding a string with multiple encoded words intermixing with non encoded words.

Decoding Multiple Encoded Words

While RFC822 does not define an actual maximum length for the header values, it considers headers > 72 characters as "long" since the users wanted (back then) to be able to read the headers in a terminal setting, and hence they build in the ability to "fold" a line into multiple lines with the use of LFWS (\r\n\s).

So a line of

"this is a line and it continues \r\n
 on the next line"
Should be folded into

"this is a line and it continues on the next line"
And since an encoded word can have maximum length of 72 bytes, having multiple of them means that the line will most likely be folded, with a high likelihood that each single line within consists of a single encoded word (or it is not enocoded).

We have previously discussed on how to fold such a line with
read-folded-line, so we can use it as a basis for reading in the folded line first and then try to parse out the encoded words from the folded line, but this requires quite a bit of work since:
  • our regex test for encoded word will consumed and throw away bytes that are not encoded words, which is not what we want
  • if we do not want to throw away the bytes we will have to look for a different approach - either writing a custom lexer or use parser-tools
  • if we take that approach then what we have written so far is useless
Or is it? Let's see how far we can salvage what we have before having to look for another solution.

As we stated above, a very likely scenario for multi-encoded-word line is that each encoded word will be on its own line (and if one of the line is not encoded it should not have encoded words), so a very simple approach would be to let decode-encoded-word handle the conversion while read-folded-line is accumulating and folding over the lines. This will require us to modify read-folded-line:

(define (read-folded-line in (convert identity)) 
  (define (folding? c)
    (or (equal? c #\space)
        (equal? c #\tab)))
  (define (return lines) 
    (apply string-append "" (reverse lines)))
  (define (convert-folding lines)
    (let ((c (peek-char in)))
      (cond ((folding? c) 
             (read-char in)
             (convert-folding lines))
            (else
             (helper lines)))))
  (define (helper lines)
    (let ((l (read-line in 'return-linefeed)))
      (if (eof-object? l) 
          (return lines)
          (let ((c (peek-char in)))
            (if (folding? c) ;; we should keep going but first let's convert all folding whitespaces... 
                (convert-folding (cons (convert l) lines))
                ;; otherwise we are done... 
                (return (cons (convert l) lines)))))))
  (helper '()))
Then we can write the decoder as follows:

(define (encoded-word-string->string str)
  (read-folded-line (open-input-string str) encoded-word->string))
Which will handle encoded word string that is generated "normally" where each encoded word will reside on its own line.

Handling General Case of Multiple Encoded Words on the same Line

While the above encoded-word-string->string should handle normally generated encoded word string out there, it still cannot handle situations where multiple encoded words resides on the same line, or if encoded words coincide with non encoded words on the same line. Such situation can occur if the generation strategy is to encode each word individually (in a way this is why it's called "encoded word") - it's there in the RFC1342 example:

... 
CC: =?ISO-8859-1?Q?Andr=E9_?= Pirard 
... 
If we try to decode it with what we have we'll lose the non encoded word:

> (encoded-word-string->string "=?ISO-8859-1?Q?Andr=E9_?= Pirard")
"André " ;; we lost Pirard
How can we solve this problem? Can we push what we have further or do we need to buckle down and look at using parser-tools?

Fortunately the format of encoded words helps us out. As defined in RFC1342, only way the above situation would exist is if they are separated by either spaces (which are significant) on the same line. Hence we can split the line by space, and then decode the individual word, and then join back by space:

(define (encoded-word-string->string str)
  (define (helper line)
    (string-join (map encoded-word->string (regexp-split #px" " line)) 
                 " "))
  (read-folded-line (open-input-string str) helper))
That's it - now we can generate and parse encoded words in RFC message headers. Enjoy.

No comments:

Post a Comment