Re: UTF-5 specification

From: Robert A. Rosenberg (bob.rosenberg@digitscorp.com)
Date: Thu Mar 02 2000 - 18:14:42 EST


At 07:30 AM 03/02/2000 -0800, Doug Ewell wrote:
>I've been working on implementing a UTF-5 encoder and decoder based on
>the specifications in the file
>
>http://ftp.univie.ac.at/netinfo/internet-drafts/draft-jseng-utf5-01.txt
>
>and I am running into problems with what I will call "UTF-5 mode,"
>which I apparently need to be able to switch into and out of, but which
>is not mentioned anywhere in the spec.
>
>Section 3, "Examples of UTF-5," states:
>
> > The Unicode sequence "A<NOT IDENTICAL TO><ALPHA>." (0041, 2262,
> > 0391, 002E) may be encoded as follows:
> >
> > "K1I262J91IE"
>
>In this example, the two ASCII characters 'A' and '.' are encoded in
>UTF-5 along with the non-ASCII characters, U+2262 and U+0391.
>
>Section 4.b, "Internationalization of Simple Mail Transfer Protocol
>Address," states:
>
> > For example, an SMTP Email address for "yamaguchi@asahi.ninhon"
> > (5C71 53J3 '@' 671D 65E5 '.' 65E5 672C) can be represented in
> > UTF-5 "LC71L3E3@M71DM5E5.M5E5M72C". This is a valid [RFC822] Email
> > address which will not be rejected. It will then be the responsiblity
> > of the user interface to render "LC71L3E3@M71DM5E5.M5E5M72C" properly
> > as "yamaguchi@asahi.ninhon".
>
>In this example, the two ASCII characters '@' and '.' are NOT encoded
>in UTF-5 along with everything else, but remain in ASCII.
>
>So in the same document, we are told first that the character U+002E
>should be encoded in UTF-5, and then that it should not. This creates
>a problem for encoders, since they must know when to encode characters
>like U+002E and when not to. It also creates a problem for decoders,
>which must figure out how and when to switch into and out of UTF-5 mode
>within a "UTF-5" string or document.
>
>This notion is inconsistent with Section 2.5, "Detecting a UTF-5
>string," which states:
>
> > Nevertheless, if the string is sufficiently long, it is possible to
> > do some detection of UTF-5 string based on the fact that
> > 1. UTF-5 strings only have characters within '0'-'9' and 'A'-'V'.
> > 2. UTF-5 strings have a well-defined inital octet of 'G' to 'V'.
> > 3. The 'G' character always occurs as the inital and only octet.
> > In other word, the shortest UTF-5 sequence is "G". For example,
> > "GF" is not a valid UTF-5 sequence.
>
>The encoded e-mail address "LC71L3E3@M71DM5E5.M5E5M72C" in Section 4.b
>violates rules 1 and 2, and thus would not qualify as a UTF-5 string
>according to these criteria.
>
>There are other potential ambiguities. The specification says that
>characters in the range U+0000 through U+000F are represented by
>quintets in the range 10000 through 11111 (binary), and converted
>thereby to characters in the range 'G' through 'V'. This would seem
>to imply that control characters like Carriage Return (U+000D) and
>Line Feed (U+000A) should be encoded as 'Q' and 'T' respectively. The
>extreme example of this is trying to store pure UTF-5 strings in C or
>C++ null-terminated character arrays, while encoding the null character
>U+0000 itself as the letter 'G' as specified in Section 3.
>
>It appears that UTF-5 was designed solely to allow non-ASCII characters
>in Internet domain names and e-mail addresses, and the problem of what
>to do about characters like '@' and '.' was ignored. But in a proper
>specification, these ambiguities should not exist. Compare the UTF-5
>document to the specification of UTF-7 (RFC 2152). In that document,
>it is specified clearly which characters are encoded and which are not,
>and when and by what means it is necessary to switch modes. (This is
>not to imply that UTF-7 is all that simple to implement, but at least
>the specification is complete.)
>
>In short, the UTF-5 specification needs to acknowedge the need to
>switch into and out of UTF-5 mode. It should specify when certain
>characters are to be left in ASCII rather than being encoded into
>UTF-5, and it should provide guidelines for decoders about how
>"invalid" UTF-5 characters ([^0-9A-V]) are to be handled in a UTF-5
>stream. These details must be covered explicitly by the spec, not
>left as "undefined" for each implementation to handle differently.
>
>-Doug Ewell
> Fullerton, California

You are not looking at the problem correctly. In the case of an Email
Address, the syntax is name@domain. In the example shown, the CONTENTS of
name and domain are rendered in UTF-5 NOT the full string. Thus you pass
the 3 sections of the address (which are delineated by the "@" and the ".")
through the converter SEPARATELY. IOW: You must parse the string based on
its format to extract the UTF-5 sections (as well as syntax validate it).



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT