Re: Over-long Control Characters in UTF-8

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Aug 02 1999 - 02:05:58 EDT


We have been busy at work cleaning up this in Unicode 3.0, but as it
happens, book publishing has its built-in time lags.

Anyway, in Unicode 3.0 the text is now quite clear on that the shortest
encoding *must* be used. You are permitted, as a receiver, to not spend
time rejecting longer encodings, but senders must not produce them and cannot
count on receivers to understand them.

>I think, this is a big mistake. Adding a check for whether the unique
>shortest encoding has been used is trivial. Just check, whether a UTF-8
>sequence starts with any of the following illegal byte combinations:
>
> 11000000
> 11100000 100xxxxx
> 11110000 1000xxxx
> 11111000 10000xxx
> 11111100 100000xx
>

I would personally urge any writers of UTF-8 decoders to add the small
amount of code needed to reject non-minimal sequences, and to enable this
whenever possible. Your argument on the ease of checking this are good.

It is too late to make any technical changes to the second edition of
ISO/IEC 10646-1 - at this stage there is no justification to hold its
publication for any single change. However, you raise a serious issue and
you should present this issue to WG2 either through your national body or
as an expert contribution.

A./



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT