On Thu, 5 Jun 1997, Timothy Pardridge wrote:
> In message <9706041245.AA11103@unicode.org> you recently said:
> > > Any comments on
> > > ftp://ds.internic.net/internet-drafts/draft-ietf-acap-mlsf-00.txt
> > > ?
> > > Language tags are encoded by mapping them to upper-case, then
> > > adding hexidecimal A0 to each octet. The result is broken up into
> > > groups of five octets followed by a final group of five or fewer
> > > octets. Each group is prefixed by a UTF-8-style length count with
> > > the low bits set to 0.
> > If I have not misunderstood UTF-8 or "MLSF" completely:
> > A.
> > 1. A UTF-8-style length count with the low bits set to 0 is
> > **not** an "illegal" UTF-8 "start character code" octet.
> I think they are unusual though because the low order bits (except
> the highest one) will have at least one bit set becuase of the
> character being represented.
> e.g 00000yyyyyxxxxxx fits in two bytes. yyyyy is non zero since
> otherwise one byte could be used. The bytes are 110yyyyy and 10xxxxxx.
> MLSF would use 11000000 which can never occur in UTF-8.
We have just recently had a discussion initiated by a company that
wanted to have some of their implementation pecularities standardized
as an UTF-8 variant. I don't yet know how this discussion has ended.
The standard and the code suggest that you accept encodings even
if they use one byte too much.
UTF-8 has some redundancy, and this is a very valuable thing.
It is obviously starting to become a favorite target for attacks.
But it should stay as is. If several parties bite a bit off here
and a bit there, chances are that we won't have anything left
in the end, and even worse, that those various parties will
badly bite each other.
> > 2. Adding hexadecimal A0 to the "ASCII" codes for A-Z produces
> > something that is an "illegal" UTF-8 continuation octet, but
> > *is* a legal "start character code" octet (111xxxxx, where
> > each x may be 1 or 0 independently of the others, with some
> > exclusions).
> > I think this would confuse most UTF-8 decoders, and is unlikely
> > to be silently ignored.
> He may well be assuming an implementation where a count byte triggers
> a loop which reads a number of following bytes. As you say there are
> other ways of implementing a decoder.
What we have to assume is not one or another decoder, but the
total of all decoders. And that doesn't leave much room.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT