Re: Stateful encoding mechanisms

From: Philippe VERDY (
Date: Fri May 20 2005 - 15:36:35 CDT

  • Next message: Dean Snyder: "Re: ASCII and Unicode lifespan"

    > Message du 20/05/05 22:13
    > Philippe Verdy wrote at 5:04 PM on Friday, May 20, 2005:
    > >From: "Dean Snyder" <>
    > >> By the way, can you indeed tell us what the "unique status" of the code
    > >> unit 0xDF02 is? And if it has one, why it is not spelled out in the
    > >> standard?
    > >
    > >It is in the standard:
    > >* the code unit 0xDF02 is a surrogate.
    > >* the codepoint U+DF02 is permenantly a non-character:
    > >* there's no assigned character on U+DF02, it will never be assigned to
    > >character by ISO/IEC 10646-1 or Unicode, because it is already bound to a
    > >non-character.
    > This does not define any unique status for 0xDF02; instead it defines a
    > status that 0xDF02 shares with all the other 1023 low surrogates. A
    > strange definition indeed of unique.
    > The interpretation of 0xDF02 is context-bound and that, by definition,
    > makes its "status" multiple, and therefore non-unique. Contrary to what
    > Ken has implied ["In UTF-16, 0xD800 does not set a "state" which then
    > alters the interpretation of a subsequent code unit"], the
    > interpretation of 0xDF02 IS directly influenced by its preceding high
    > surrogate. To put it another way, it is only the COMBINATIONS of high
    > and low surrogates that yield unique results.
    > Leaving out the BOM, the interpretations of all non-surrogate code units
    > in a UTF-16 text stream are context-free; the interpretations of all
    > surrogate code units in the same stream are context-bound. That is why I
    > am referring to surrogates as a stateful encoding mechanism, and subject
    > to fragment fragility.

    Whatever you think there, you're trying to defeat something that is not a problem. You are discussing here about code units or bytes, but they are only needed for the serialization of text data. They are not plain text by themselves, and ABSOLUTELY NO code unit even has any semantic. A code unit is just a piece of integer. All the semantics of characters in Unicode and in ISO/IEC 10646 does not live there.

    So remember that: code units are only some ways to represent characters in a necessarily limited encoding space which needs fixed sizes for units. Any other encoding form or encoding scheme is possible in addition to the code units defined in the default encoding forms and schemes described in Unicode.

    Any way, once you speak about serialization, you have a unidirectional stream with a start and end, but this is not the proper processing level for Unicode and ISO/IEC 10646. So all the artefacts like BOM and surrogates are not directly part of the character model and they do not fragilize it, because they are independant of it.

    You could say the same about other serialization formats like MIME's transfer-encoding syntaxes used to encapsultate data without altering it. UTF's are just like MIME envelopes, they don't define the content.

    But if you think that really, which is the correct level to keep the semantics of text:
    - Abstract characters (assigned code points)?
    - Combining sequences?
    - Grapheme clusters?
    - Syllables?
    - Words?
    - Phrases?
    - Sentences?
    - Paragraphs?
    - Chapters?

    You want to find a "solid" abstraction that spans several levels of analysis of text. As far as I know, the only level at which a text keeps its semantics is the unbroken text itself as a whole, because all the rest is necessarily context-dependant.

    Surrogates have their own semantics at the code unit level: they have the semantics of a low or high surrogate and an ordering number, the combination of both giving them a unique identity in the code unit space, like all other "normal" code units. This is much enough for this level of analysis, and really not complicate to parse, so it is perfect for actual (de)serializers implementations.

    (Yes this requires a state variable to decode it, but it's so easy to manage algorithmically...)

    You cannot give code units more semantics at this level, because all character properties defined in Unicode are only defined in the space of code points or the bijective space of coded abstract characters.

    So I maintain: code units are NOT characters, and theres's no surrogate character in Unicode, and no BOM character in Unicode, so they don't have character properties.

    This archive was generated by hypermail 2.1.5 : Fri May 20 2005 - 15:37:54 CDT