Re: Emoji and Search Engines

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jan 05 2009 - 19:34:14 CST

  • Next message: Michael D'Errico: "Re: Emoji: emoticons vs. literacy"

    André Szabolcs Szelp responded:

    > 2009/1/6 Kenneth Whistler <kenw@sybase.com>:
    > > The "proper solution" envisioned here would *obsolete* the need to
    > > resort to character-based, non-extensible hacks for transmitting
    > > pictographic symbols in the way the wireless carriers in Japan now
    > > are doing -- but it would not solve the *present problem* of
    > > dealing with the de facto existing characters *as* characters,
    > > which is what we are up against here.
    >
    > So are Ewellic, Verdurian, Røzhxh etc., etc. characters existing *as*
    > characters.

    But this contention ignores the essential difference between
    Ewellic (and all the other denizens) of the PUA codes in the
    ConScript registry, and the wireless emoji sets. Mark
    Crispin pointed it out:

    crispin>> There are hundreds of millions of mobile phones with
    crispin>> the current [emoji] set.

    Implementation on hundreds of millions of devices connected
    to the internet and to the search engines and databases
    operating on those data streams makes this a quintessential
    case of an encoding requiring a public, *standard* solution.
    Ewellic, on the other hand -- as the author of the script
    himself has attested on this list -- is not widely used, nor
    does it require more than a private agreement for PUA code
    points for the few who might actually wish to exchange data.

    > And any future ad-hoc code-point assignment to any
    > possible ad-hoc entities, be them signs, letters, characters, sound
    > files, whatever.

    Ad hoc assignment of numbers to entities by somebody does not
    render such entities, ipso facto, candidates as characters
    in the UCS.

    Even in the cases (such as PUA encodings of ConScripts) where
    there clearly is both an existential and functional case to
    be made for those encodings as *character* encodings, there
    will always be an extended fringe of private use that will
    not rise to the level of appropriateness for encoding in
    Unicode, IMO.

    > I have understood the argument of the UTC why they want these emoji in
    > Unicode ("they are currently handled by operators as characters in a
    > particular encoding scheme"), but I have not heard answers how they
    > wish to proceed in future if someone wants to have arbitrary
    > characters accepted based on the same argument/precedent. ("something
    > handled in some context as characters in a particular encoding
    > scheme")
    >
    > I would be grateful if UTC could sketch an anticipated procedure.

    It is very straightforward, and has been detailed for years at:

    http://www.unicode.org/pending/proposals.html

    Anyone who thinks they have a case for something to be encoded
    as characters in Unicode can write up a proposal, submit it,
    and then be prepared to defend it through the several years it
    takes to reach consensus in both committees (UTC and WG2) and
    to shepherd it through the multiple layers of ballotting
    involved.

    People who expect a deductive decision procedure which could
    be applied ahead of time to a candidate "entity" for encoding,
    to determine absolutely whether it should be encoded as
    a character or not, are, again IMO, likely to be sorely
    disappointed. Character encoding is not a rational science --
    it is one part politics, one part technology, and one part
    history, with a few dashes of randomness and whimsey tossed
    in for seasoning.

    And as Asmus just pointed out, that's why we have committees
    debating all this, instead of rule books and registries.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jan 05 2009 - 19:35:36 CST