Validity and properties of U+FFFD (was RE: Roundtripping in Unico de)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Tue Dec 14 2004 - 06:51:05 CST

  • Next message: Arcane Jill: "Re: Roundtripping in Unicode"

    Doug Ewell wrote:
    > Philippe VERDY wrote:
    >
    > > (In fact I also think that mapping invalid sequences to
    > U+FFFD is also
    > > an error, because U+FFFD is valid, and the presence of the encoding
    > > error in the source is lost, and will not throw exceptions
    > in further
    > > processings of the remapped text, unless the application constantly
    > > checks for the presence of U+FFFD in the text stream, and
    > all modules
    > > in the application explicitly forbids U+FFFD within its
    > interface...)
    >
    > Mapping invalid sequences to U+FFFD is explicitly permitted by
    > conformance clause C12a (TUS 4.0, p. 61):
    >
    > "When faced with [an] ill-formed code unit sequence while transforming
    > or interpreting text, a conformant process must treat the first code
    > unit... as an illegally terminated code unit sequence -- for
    > example, by
    > signaling an error, filtering the code unit out, or representing the
    > code unit with a marker such as U+FFFD REPLACEMENT CHARACTER."
    >
    > Of course, any subsequent process that handles this text would have to
    > understand this convention, and not choke if handed a U+FFFD.

    Thank you Doug. I had deliberately left this one unanswered and knew it
    wouldn't pass.

    Actually, you are both right. Doug is right in saying that applications in
    general MUST not treat U+FFFD any differently from any other valid
    codepoint. And the standard is absolutely right in defining it so.

    Philippe is right in saying that the presence of U+FFFD is fishy. But forgot
    to define the realm. In clear text, it is no more odd than any word that
    doesn't pass the spell checker, it is just easier to find, nothing else. But
    in security contexts, it could be treated differently. So, once you start
    limiting the codepoints you allow, like no spaces, no punctuation, no this
    or that, then you should perhaps also think about U+FFFD. But definitely not
    in general.

    So, in a security context U+FFFD might be rejected. Although, in purely
    centralized security sublayer (for example a filesystem), it doesn't need to
    be (I was also discussing this elsewhere). But if one feels safer if it is,
    then they're free to chose so. Again, that's for a security context, not in
    general.

    BTW, what are the properties of U+FFFD? In English please, do not point me
    to the standard. Like, can it be a part of an identifier, is it an
    'alphanumeric'? Let me speculate. It should be a letter (it probably more
    often originally was than wasn't). I would accept it for identifiers
    (variables, filenames). It has no case properties. And it is obviously not a
    space.

    Lars



    This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 06:59:01 CST