UTF-7: help me understand

From: Mike Brown (mbrown@corp.webb.net)
Date: Tue Apr 11 2000 - 22:48:23 EDT

Next message: Yves Arrouye: "Re: UTF-8 code in HTML"
Previous message: Markus Scherer: "Re: UTF-8 code in HTML"
Next in thread: Deborah Goldsmith: "Re: UTF-7: help me understand"
Maybe reply: Deborah Goldsmith: "Re: UTF-7: help me understand"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

In the absence of charset information, IE 5.0, when in encoding auto-detect
mode, decides that a document containing "+" followed by characters in the
Base64 alphabet followed by "-" indicates the document is UTF-7 encoded.
Because of this, I am now on a mission to understand UTF-7.

I am having trouble.

GNU recode tells me that the single character U+2022 BULLET is, in UTF-7,
"+ICI". (echo -n "0x2022" | recode u2/x2..utf7)

The example at http://www.hut.fi/u/jkorpela/chars.html#utf7ex tells me that
the character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS is "+AOQ-"

After going over RFC 1521 and RFC 1642 a million times, I still can't figure
out how those are correct.

The + as a beginning of a Modified Base64 sequence I understand. The - or
newline as the end of such a sequence I understand. What I do not understand
is how having less than 4 characters in the sequence ("ICI" or "AOQ") is
well-formed.

Here is what I am trying:

1. U+2022 = hex 2022
2. hex 2022 = binary octets 00100000 00100010
3. those octets are divided into sextets with zero-padding out to 24 bits as
per the RFCs. I'll use "o" here to represent the 0's used for padding:
001000 000010 0010oo oooooo
4. the scalar values of these octets: 8 2 8 0
5. those values are indexes into the array of [encoded] characters in the
Base64 alphabet: I C I A

So I don't see how "+ICI" is well-formed and "+ICIA" isn't. I also don't see
how you could ever have less than 4 characters in a Modified Base64
representation of a single Unicode character.

What am I missing?

- Mike
___________________________________________________________
Mike J. Brown, software engineer, Webb Interactive Services
XML/XSL stuff: http://www.skew.org/ http://www.webb.net/

Next message: Yves Arrouye: "Re: UTF-8 code in HTML"
Previous message: Markus Scherer: "Re: UTF-8 code in HTML"
Next in thread: Deborah Goldsmith: "Re: UTF-7: help me understand"
Maybe reply: Deborah Goldsmith: "Re: UTF-7: help me understand"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT