RE: UTF-7 - I'm not really smarter

From: Keutgen, Walter (walter.keutgen@be.unisys.com)
Date: Tue Mar 28 2006 - 11:31:17 CST

Next message: Richard Wordingham: "Re: UTF-7 - I'm not really smarter"

Previous message: Otto Stolz: "Re: UTF-7 - I'm not really smarter"
Maybe in reply to: Kornkreismuster@web.de: "UTF-7 - I'm not really smarter"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Otto,

the reason is that Unicode refuses any escape sequence mechanism of any kind, here + ... -.

Walter

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Otto Stolz
Sent: Dienstag, den 28. März 2006 19:15
To: Kornkreismuster@web.de
Cc: unicode@unicode.org
Subject: Re: UTF-7 - I'm not really smarter

Hello,

Kornkreismuster@web.de schrieb:
> Reading this [RFC 2152], I got the feeling it only encodes UTF-16 encoded Texts,
> but I think that's not true.

The description in RFC 2152, chapter 4, is probably misleading to
the uninitiated. The key to understanding is that all UTFs are
equivalent: they encode the same character set, viz. the whole Uni-
code, and any string encoded in one UTF can be easily transformed
into any other.

So, all references in chapter 4 of RFC 2152 to UTF-16, and to
16-bit code elements, are only meant to facilitate the description
of the algorithm. You can describe the UTF-7 encoding algorithm
(with a grain of salt) thusly:
1. encode the source string in UTF-16 (regardless of its previous
encoding);
2. convert every three UTF-16 code units into 8 bytes using a modified
base-64 algorithm (hence, every byte encodes 6 bit);
3. enclose the result between a plus and a minus sign.
Alternatively, runs of "harmless" characters may be encoded in ASCII,
instead of applying steps 1..3, above.

The latter alternative renders UTF-7 indeterminate: a character
string may be encoded in several ways, cf. my example in
<http://www.systems.uni-konstanz.de/Otto/Vortrag/Charset/Unicode-Grundlagen.html#UU-7>
-- in contrast to UTF-8, UTF-16, and UTF-32. I guess, this is the
main reason for not having UTF-7 in the Uncode standard.

Regards,
Otto Stolz

Next message: Richard Wordingham: "Re: UTF-7 - I'm not really smarter"
Previous message: Otto Stolz: "Re: UTF-7 - I'm not really smarter"
Maybe in reply to: Kornkreismuster@web.de: "UTF-7 - I'm not really smarter"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Mar 28 2006 - 11:34:03 CST