Re: Invalid code points

From: William_J_G Overington (wjgo_10009@btinternet.com)
Date: Thu Jun 04 2009 - 04:13:49 CDT

  • Next message: Arne Goetje: "Re: Fonts across platforms...."

    On Wednesday 3 June 2009, Kenneth Whistler <kenw@sybase.com> wrote:

    > William Overington suggested:
    >
    > > The suggestion of using b64-encoded binary data could
    > perhaps
    > > be adapted by placing a Unicode U+FFFC OBJECT
    > REPLACEMENT CHARACTER
    > > in front of the b64-encoded binary data.  That
    > way, the parameter
    > > passing would always be in Unicode characters and the
    > presence of
    > > a U+FFFC character would indicate that subsequent
    > characters in
    > > the parameter should be interpreted as b64-encoded
    > binary data.
    >
    > It may perhaps be belaboring the obvious, but U+FFFC
    > OBJECT
    > REPLACEMENT CHARACTER is not defined that way, and would
    > not
    > indicate that (or anything else) about subsequent
    > characters
    > in a string parameter.

    Ken is correct.

    >
    > Any attempt to use U+FFFC in that way would be very
    > unlikely to
    > be interpreted as such by any Unicode-conformant system,
    > and
    > in fact is nothing more than an arbitrary attempt to
    > establish
    > a text convention which would consist of a higher-level
    > protocol.

    Well, not quite arbitrary. The problem is to develop a demonstration of a new idea of passing objects using a text parameter. Ruszlán Gaszanov asked "What's wrong with passing b64-encoded binary data?" and I suggested that "Passing b64-encoded binary data could be ambiguous as to whether it was text or b64-encoded binary data." and suggested a way that either text or b64-encoded binary data could be passed as a parameter.

    The Unicode Standard includes the following document.

    http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf

    The document has the following on page 26.

    quote

    U+FFFC. The U+FFFC object replacement character is used as an insertion point for objects located within a stream of text. All other information about the object is kept outside the character data stream. Internally it is a dummy character that acts as an anchor point for the object’s formatting information. In addition to assuring correct placement of an object in a data stream, the object replacement character allows the use of general stream-based algorithms for any textual aspects of embedded objects.

    end quote

    So, my suggestion needs to be altered so that the parameter passing mechanism, upon detecting a U+FFFC character, places all subsequent characters from after the U+FFFC character into a separate storage place. The passed parameter is thus then true Unicode that may, but need not, contain a U+FFFC character.

    > One could equally well (and probably with equal outcome)
    > assert
    > that a U+25E7 SQUARE WITH LEFT HALF BLACK character would
    > indicate
    > that subsequent characters in a parameter should be
    > interpreted
    > as b64-encoded binary data.

    Well, no, because the suggestion of using U+FFFC does have a clue for humans as to what might be meant.

    > Or for that matter, that
    > subsequent
    > characters in a string should be interpreted as a chocolate
    > chip
    > cookie recipe.
    >
    > --Ken
    >
    >
    >

    Well, U+003C LESS-THAN SIGN gets used for many purposes in some documents.

    William Overington

    4 June 2009



    This archive was generated by hypermail 2.1.5 : Thu Jun 04 2009 - 04:16:12 CDT