Re: Over-long Control Characters in UTF-8

From: Francois Yergeau (yergeau@alis.com)
Date: Mon Aug 02 1999 - 09:53:39 EDT

Next message: John Cowan: "Re: Over-long Control Characters in UTF-8"
Previous message: Michael Everson: "Re: Latin-1's apostrophe, grave accent, acute accent"
Maybe in reply to: Markus Kuhn: "Over-long Control Characters in UTF-8"
Next in thread: John Cowan: "Re: Over-long Control Characters in UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

À 03:27 01/08/99 -0700, Markus Kuhn a écrit :
>In general, are there any requirements in the Unicode and ISO 10646-1
>standard with regard to handling UTF-8 sequences that are longer than
>necessary? I haven't found any.

10646-1 Annex R.4 makes over-long sequences illegal. The left-hand column
of Table 4 lists mutually exclusive ranges of UCS-4 values while the
right-hand column shows the corresponding (unique) UTF-8 encoding.

In turn, section 2.2 a) makes any CC-data-element containing such a
sequence non-conformant, since the chosen encoding scheme is not respected.

>All this leads me to the conclusion that it is probably a good idea to
>extend Annex R.7 in ISO 10646-1:2000 to also declare over-long UTF-8
>sequences as malformed and to require UTF-8 decoders to treat them like
>other malformed sequences, e.g. signal them as a transmission error,
>substitute U+FFFD for them, but under no circumstances treat them as the
>corresponding UCS value.

Unfortunately, section 2.3 (Conformance of devices) does not force a
receiving device to flag such a non-conforming sequence, because of the
hole noted by Markus in R.7: this section identifies a few "malformed
sequences" (excluding over-long sequences) and then goes: "If a receiving
device that has adopted the UTF-8 form receives a malformed sequence,...
then it shall interpret that malformed sequence in the same way that it
interprets a character that is outside the adopted subset that has been
identified for the device (see 2.3c)." And section 2.3 does not catch any
other error cases.

Back to Markus' original question:
>If I write a UTF-8 decoder for a terminal emulator, shall I accept and
>execute control characters even if they are part of an UTF-8 sequence
>that is longer than necessary?

I'd say no, such a sequence is non-conformant and interpreting it can lead
to bad consequences. It would be nice if the standards *required* its
rejection.

-- 
François Yergeau

Next message: John Cowan: "Re: Over-long Control Characters in UTF-8"
Previous message: Michael Everson: "Re: Latin-1's apostrophe, grave accent, acute accent"
Maybe in reply to: Markus Kuhn: "Over-long Control Characters in UTF-8"
Next in thread: John Cowan: "Re: Over-long Control Characters in UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT