Re: Invalid code points

From: Asmus Freytag (
Date: Thu Jun 04 2009 - 22:54:32 CDT

  • Next message: Damon Anderson: "Re: Fonts across platforms...."

    On 6/4/2009 7:22 PM, verdy_p wrote:
    > I also agree that the only useful interest that I see for U+FFFC is as a placeholder when it is needed for indicating the position where an external binary object is to be inserted...

    So far so good.

    The rest of your message mixes some good points with a bit of speculation.

    You are correct that for an XML document, or any other "plain text
    encoded" higher level protocol, one would not use U+FFFC, but use the
    syntax constructs of that protocol.

    You are also correct that the information "an object was inserted here"
    is of limited use when a rich text file is converted to plain text.
    There might be users who would wish to have this information, but most
    systems don't insert a U+FFFC in that case.

    You get into speculation where you try to imagine the possible, actual
    uses for this character. It was encoded not primarily for data
    interchange, but to solve a common implementation problem: inline images
    and objects can be formatted like characters (for example, underlines
    might be applied to them). By providing an actual *character* in the
    text buffer, such text formatting can be kept regular (i.e. all
    character styling applies to actual character offsets).

    Most *binary* data interchange protocols are based more or less directly
    on the in-memory representation of a rich text document. For that
    reason, it is those protocols that are most likely to contain a U+FFFC
    in the (text part) of the binary data stream.

    I know, that is not intuitive, but that's what was encoded.

    Later, much later, the UTC realized that there were other, similar needs
    to have "internal-use" code points that are stripped out during
    plain-text conversion. This has lead to the concept of noncharacters,
    and the 34 existing, permanently reserved code points were augmented by
    32 newly designated noncharacters, to give a set of 66 codes that can be
    used for similar, internal-use placeholders.

    The U+FFFC OBJECT REPLACEMENT CHARACTER was left as is - i.e. it's a
    character, not a noncharacter - which makes its use in plain text
    optional. You may use it to indicate where an object had to be stripped,
    but many implementations choose not to.


    This archive was generated by hypermail 2.1.5 : Thu Jun 04 2009 - 22:57:47 CDT