Re: Validity and properties of U+FFFD (was RE: Roundtripping in Unico de)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 14 2004 - 15:02:40 CST

  • Next message: Mike Ayers: "RE: Roundtripping in Unicode"

    Lars asked:

    > BTW, what are the properties of U+FFFD? In English please, do not point me
    > to the standard.

    ?!

    It has the general category of "Symbol Other" [gc=So].

    > Like, can it be a part of an identifier,

    It does not have the ID_Start or the ID_Continue property, which
    you could determine for yourself by referring to the standard:

    http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt

    That doesn't prevent a formal syntax definition for a language
    from including it within the BNF for defining and identifier,
    but in general, no, it would not appear in identifiers, just as
    most other symbols would not.

    > is it an 'alphanumeric'?

    No.

    > Let me speculate. It should be a letter

    No.

    > (it probably more
    > often originally was than wasn't).

    You are referring here to speculation regarding what uninterpretable
    sequence in some other character encoding was *converted* to U+FFFD
    on conversion to Unicode. But that is irrelevant to the properties
    of U+FFFD itself.

    That is tantamount, for example, to claiming that the C0 control
    code 0x1A SUBSTITUTE should be defined as a "letter", simply because
    it is often used in signalling a conversion substitution in
    8-bit tables.

    > I would accept it for identifiers (variables, filenames).

    If you are defining your own language, that would be your
    prerogative, of course. But if you are using standard languages
    like C, C++, Java, C#, SQL, etc., it is unlikely that you would
    be correct in that approach.

    > It has no case properties. And it is obviously not a
    > space.

    True.

    There is much, much more to know about Unicode character properties
    than just what can be inferred from an attempt to apply the
    POSIX model to UTF-8. A good place to start would be Unicode
    Technical Report #23, The Unicode Charater Property Model:

    http://www.unicode.org/reports/tr23/

    And after that, yes, I would point you to the standard.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 15:05:09 CST