Re: Encoding of Logos, Personal Gaiji et cetera for electronic library archiving (formerly Re: Hypersurrogates)

From: Doug Ewell (
Date: Sat Aug 29 2009 - 12:09:46 CDT

  • Next message: John H. Jenkins: "Re: Encoding of Logos, Personal Gaiji et cetera for electronic library archiving (formerly Re: Hypersurrogates)"

    "William_J_G Overington" <wjgo underscore 10009 at btinternet dot com>

    > Allocation of the sequence of two codepoints for each glyph would be
    > done either by the Unicode Consortium and ISO jointly,

    Then it's "encoding," as Asmus said.

    > or maybe by delegated registration centres, such as government trade
    > mark offices and national libraries in the various jurisdictions of
    > the world, each being allocated a block of encoding space to use for
    > specific purposes.

    Then it's sort of like ISO 2022: still encoding, but decentralized so
    that implementations pick and choose only the pieces they care about and
    ignore the rest.

    > I know that various items such as logos and personal gaiji are not
    > encoded at present due to rules, not due to space considerations.
    > However, I feel that progress in information technology needs a way of
    > encoding all such items in plain text for electronic library
    > archiving.

    You may be in a bit of luck, because the encoding of emoji as used in
    Japanese cell phones shows that previous attitudes within WG2 and UTC
    against encoding novel and short-lived symbols, and even logos, may be
    relaxing. (A set of corporate logos, with representative glyphs
    excised, was included in early emoji proposals.)

    But the items to be encoded still need to be proposed explicitly, even
    if part of a block of hundreds. Even with emoji, there was never a
    proposal to set aside an empty block and let others populate it as they
    saw fit.

    You are far better off thinking of the graphical thingies of the world
    as belonging to one of two categories, just as UTC and WG2 do:

    1. The ones that merit formal encoding can be proposed and accepted
    into Unicode/10646, and will get their own code points, and can be used
    in any conformant Unicode/10646 implementation (preferably with a font
    that supports them).

    2. The others can be represented with one of the 137,468 available
    private-use code points, or with an inline image (as people used to
    represent ordinary characters outside their code page, before Unicode
    came along), or with a higher-level escape sequence, as Asmus described.

    You are *not* better off proposing a third category between 1 and 2,
    things that Unicode "encodes but doesn't encode." Several people over
    the years have commented on how unlikely that is to happen.

    As a matter of fact, from time to time people in positions of authority
    make statements that private-use mechanisms (in Unicode or elsewhere)
    are inherently evil and problematic, and should not be used. This is
    why I find such statements so frustrating and counterproductive: they
    lead some people to seek non-private-use solutions to essentially
    private-use problems. Perhaps we should be grateful that Unicode has
    not deprecated the PUAs altogether.

    > Hopefully this thread will raise awareness of this issue and hopefully
    > some people reading this will post agreement that something needs to
    > be done, not necessarily using my encoding idea, but that something
    > needs to be done. I feel that it is not very good if copied and
    > pasted text from archived documents needs strings with ampersand or
    > colon in them to signify the meaning of characters.

    Propose the characters. Don't just state that there are gazillions of
    characters that need to be represented in documents and are not

    Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14  ˆ

    This archive was generated by hypermail 2.1.5 : Sat Aug 29 2009 - 12:13:10 CDT