I've been working on implementing a UTF-5 encoder and decoder based on
the specifications in the file
and I am running into problems with what I will call "UTF-5 mode,"
which I apparently need to be able to switch into and out of, but which
is not mentioned anywhere in the spec.
Section 3, "Examples of UTF-5," states:
> The Unicode sequence "A<NOT IDENTICAL TO><ALPHA>." (0041, 2262,
> 0391, 002E) may be encoded as follows:
In this example, the two ASCII characters 'A' and '.' are encoded in
UTF-5 along with the non-ASCII characters, U+2262 and U+0391.
Section 4.b, "Internationalization of Simple Mail Transfer Protocol
> For example, an SMTP Email address for "firstname.lastname@example.org"
> (5C71 53J3 '@' 671D 65E5 '.' 65E5 672C) can be represented in
> UTF-5 "LC71L3E3@M71DM5E5.M5E5M72C". This is a valid [RFC822] Email
> address which will not be rejected. It will then be the responsiblity
> of the user interface to render "LC71L3E3@M71DM5E5.M5E5M72C" properly
> as "email@example.com".
In this example, the two ASCII characters '@' and '.' are NOT encoded
in UTF-5 along with everything else, but remain in ASCII.
So in the same document, we are told first that the character U+002E
should be encoded in UTF-5, and then that it should not. This creates
a problem for encoders, since they must know when to encode characters
like U+002E and when not to. It also creates a problem for decoders,
which must figure out how and when to switch into and out of UTF-5 mode
within a "UTF-5" string or document.
This notion is inconsistent with Section 2.5, "Detecting a UTF-5
string," which states:
> Nevertheless, if the string is sufficiently long, it is possible to
> do some detection of UTF-5 string based on the fact that
> 1. UTF-5 strings only have characters within '0'-'9' and 'A'-'V'.
> 2. UTF-5 strings have a well-defined inital octet of 'G' to 'V'.
> 3. The 'G' character always occurs as the inital and only octet.
> In other word, the shortest UTF-5 sequence is "G". For example,
> "GF" is not a valid UTF-5 sequence.
The encoded e-mail address "LC71L3E3@M71DM5E5.M5E5M72C" in Section 4.b
violates rules 1 and 2, and thus would not qualify as a UTF-5 string
according to these criteria.
There are other potential ambiguities. The specification says that
characters in the range U+0000 through U+000F are represented by
quintets in the range 10000 through 11111 (binary), and converted
thereby to characters in the range 'G' through 'V'. This would seem
to imply that control characters like Carriage Return (U+000D) and
Line Feed (U+000A) should be encoded as 'Q' and 'T' respectively. The
extreme example of this is trying to store pure UTF-5 strings in C or
C++ null-terminated character arrays, while encoding the null character
U+0000 itself as the letter 'G' as specified in Section 3.
It appears that UTF-5 was designed solely to allow non-ASCII characters
in Internet domain names and e-mail addresses, and the problem of what
to do about characters like '@' and '.' was ignored. But in a proper
specification, these ambiguities should not exist. Compare the UTF-5
document to the specification of UTF-7 (RFC 2152). In that document,
it is specified clearly which characters are encoded and which are not,
and when and by what means it is necessary to switch modes. (This is
not to imply that UTF-7 is all that simple to implement, but at least
the specification is complete.)
In short, the UTF-5 specification needs to acknowedge the need to
switch into and out of UTF-5 mode. It should specify when certain
characters are to be left in ASCII rather than being encoded into
UTF-5, and it should provide guidelines for decoders about how
"invalid" UTF-5 characters ([^0-9A-V]) are to be handled in a UTF-5
stream. These details must be covered explicitly by the spec, not
left as "undefined" for each implementation to handle differently.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT