UTF-7: help me understand

From: Mike Brown (mbrown@corp.webb.net)
Date: Tue Apr 11 2000 - 22:48:23 EDT


In the absence of charset information, IE 5.0, when in encoding auto-detect
mode, decides that a document containing "+" followed by characters in the
Base64 alphabet followed by "-" indicates the document is UTF-7 encoded.
Because of this, I am now on a mission to understand UTF-7.

I am having trouble.

GNU recode tells me that the single character U+2022 BULLET is, in UTF-7,
"+ICI". (echo -n "0x2022" | recode u2/x2..utf7)

The example at http://www.hut.fi/u/jkorpela/chars.html#utf7ex tells me that
the character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS is "+AOQ-"

After going over RFC 1521 and RFC 1642 a million times, I still can't figure
out how those are correct.

The + as a beginning of a Modified Base64 sequence I understand. The - or
newline as the end of such a sequence I understand. What I do not understand
is how having less than 4 characters in the sequence ("ICI" or "AOQ") is
well-formed.

Here is what I am trying:

1. U+2022 = hex 2022
2. hex 2022 = binary octets 00100000 00100010
3. those octets are divided into sextets with zero-padding out to 24 bits as
per the RFCs. I'll use "o" here to represent the 0's used for padding:
001000 000010 0010oo oooooo
4. the scalar values of these octets: 8 2 8 0
5. those values are indexes into the array of [encoded] characters in the
Base64 alphabet: I C I A

So I don't see how "+ICI" is well-formed and "+ICIA" isn't. I also don't see
how you could ever have less than 4 characters in a Modified Base64
representation of a single Unicode character.

What am I missing?

   - Mike
___________________________________________________________
Mike J. Brown, software engineer, Webb Interactive Services
XML/XSL stuff: http://www.skew.org/ http://www.webb.net/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT