Re: Corrigendum #9

From: CE Whitehead <>
Date: Tue, 24 Jun 2014 09:16:00 -0400

Markus Scherer said what sounds right to me to recommend (maybe what he says should be said in Corrigendum 9):

From: Markus Scherer <>
Date: Thu, 12 Jun 2014 01:37:49 -0700
> If your library makes an explict promise to remove noncharacters, then it
> should continue to do so.
> However, if your library is understood to pass through any strings, except
> for the advertised processing, then noncharacters should probably be
> preserved.
ME: Am I to believe from the above, that,
regarding (which rejects the bold interpretation but I don't think that's what Markus's email does) --
the "'bold interpretation' of internal exchange of noncharacters" may continue
where deletion of a noncharacter is never a good idea, and should not happen, that unrecognized noncharacters should simply be silently ignored then,
with "all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points," thus "mapped to unique code unit sequences";
while, at the same time (albeit as I understand things only if the type of encoding is recognized),
noncharacters may replaced with the scalar for unassigned code points (U+FFFD)? In this latter case the non-character is no longer mapped one-to-one with a scalar as all noncharacters will have been replaced with U+FFFD. So is that one-to-one mapping recommendation going to be changed or not?

* * *
I also have a quesiton on Peter's notes on TUS 6.0 rule C7 (which followed the Unicode 4.0 correction apparently if I understand correctly; maybe I should have sent this question as a separate email)
From: Peter Constable <>
Date: Fri, 13 Jun 2014 05:14:30 +0000
> TUS 6.0:
> C2 = TUS5.0, C2

"C7 When a process purports not to modify the interpretation of a valid coded character
sequence, it shall make no change to that coded character sequence other than the possible
replacement of character sequences by their canonical-equivalent sequences."

> Interestingly, the change to C7 does not permit non-characters to be replaced or removed at all while claiming not to have left the interpretation intact.
ME: if two sequences are canonically equivalent except that one has noncharacters in it, are these still canonically equivalent? (just a wild question; would be nice to have an answer in the faq on noncharacters or somewhere; mabye I missed the answer and it was there).
* * *
Sentinels, Security

Regarding the sentinels; I am an outsider but assume that with Corrigendum 9 U+FFFE will continue to be mentioned as having generally (not always?) standard use throughout;
in Chapter 16.7 it is currently mentioned; I assume it will still be --
according to info. in the FAQ and elsewhere:
 "U+FFFE. The 16-bit unsigned hexadecimal value U+FFFE is not a Unicode character value, and should be taken as a signal that Unicode characters should be byte-swapped before interpretation. U+FFFE should only be intepreted as an incorrectly byte-swapped version of U+FEFF"

Yes, it would be nice also to have info about security effects I agree of any other sentinels particularly U+FFFF and U+10FFFF
-- but I envision most security effects would be caused by removing without replacing one of these (is that right?)

Hope these questions are helpful.

--C. E. Whitehead


Unicode mailing list
Received on Tue Jun 24 2014 - 08:18:00 CDT

This archive was generated by hypermail 2.2.0 : Tue Jun 24 2014 - 08:18:00 CDT