From: Lars Kristan (lars.kristan@hermes.si)
Date: Tue Dec 14 2004 - 06:51:05 CST
Doug Ewell wrote:
> Philippe VERDY wrote:
>
> > (In fact I also think that mapping invalid sequences to
> U+FFFD is also
> > an error, because U+FFFD is valid, and the presence of the encoding
> > error in the source is lost, and will not throw exceptions
> in further
> > processings of the remapped text, unless the application constantly
> > checks for the presence of U+FFFD in the text stream, and
> all modules
> > in the application explicitly forbids U+FFFD within its
> interface...)
>
> Mapping invalid sequences to U+FFFD is explicitly permitted by
> conformance clause C12a (TUS 4.0, p. 61):
>
> "When faced with [an] ill-formed code unit sequence while transforming
> or interpreting text, a conformant process must treat the first code
> unit... as an illegally terminated code unit sequence -- for
> example, by
> signaling an error, filtering the code unit out, or representing the
> code unit with a marker such as U+FFFD REPLACEMENT CHARACTER."
>
> Of course, any subsequent process that handles this text would have to
> understand this convention, and not choke if handed a U+FFFD.
Thank you Doug. I had deliberately left this one unanswered and knew it
wouldn't pass.
Actually, you are both right. Doug is right in saying that applications in
general MUST not treat U+FFFD any differently from any other valid
codepoint. And the standard is absolutely right in defining it so.
Philippe is right in saying that the presence of U+FFFD is fishy. But forgot
to define the realm. In clear text, it is no more odd than any word that
doesn't pass the spell checker, it is just easier to find, nothing else. But
in security contexts, it could be treated differently. So, once you start
limiting the codepoints you allow, like no spaces, no punctuation, no this
or that, then you should perhaps also think about U+FFFD. But definitely not
in general.
So, in a security context U+FFFD might be rejected. Although, in purely
centralized security sublayer (for example a filesystem), it doesn't need to
be (I was also discussing this elsewhere). But if one feels safer if it is,
then they're free to chose so. Again, that's for a security context, not in
general.
BTW, what are the properties of U+FFFD? In English please, do not point me
to the standard. Like, can it be a part of an identifier, is it an
'alphanumeric'? Let me speculate. It should be a letter (it probably more
often originally was than wasn't). I would accept it for identifiers
(variables, filenames). It has no case properties. And it is obviously not a
space.
Lars
This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 06:59:01 CST