From: Philippe VERDY (firstname.lastname@example.org)
Date: Mon Dec 13 2004 - 07:23:11 CST
Lars Kristan wrote:
> What I was talking about in the paragraph in question is what happens if you want to take unassigned codepoints and give them a new status.
You don't need to do that. No Unicode application must assign semantics to unassigned codepoints.
If a source sequence is invalid, and you want to preserve it, then this sequence must remain invalid if you change its encoding.
So there's no need for Unicode to assign valid code points for invalid source data.
There's enough space *assigned* as invalid (or assigned to non-characters) in all UT forms, that allow an application to create a local conversion scheme which will perform a bijective conversion of invalid sequences:
- for example in UTF-8: trailing bytes 0x80 to 0xBF isolated or in excess, or even the invalid lead bytes 0xF8 to 0xFF
- for example in UTF-16: 0XFFFE, 0xFFFF
- for example in UTF-32: same as UTF-16, plus all code units above 0x10FFFF
Using PUA space or some unassigned space in Unicode to represent invalid sequences present in a source text will be a severe design error in all cases, because that conversion will not be bejective and could map invalid sequences to valid ones without further notice, changing the status of the original text which should be kept as incorrectly encoded, until explicitly corrected or until the source text is reparsed with another more appriate encoding.
(In fact I also think that mapping invalid sequences to U+FFFD is also an error, because U+FFFD is valid, and the presence of the encoding error in the source is lost, and will not throw exceptions in further processings of the remapped text, unless the application constantly checks for the presence of U+FFFD in the text stream, and all modules in the application explicitly forbids U+FFFD within its interface...)
This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 10:50:31 CST