Re: Emoji: emoticons vs. literacy

From: Michael D'Errico (mike-list@pobox.com)
Date: Mon Jan 05 2009 - 16:03:18 CST

  • Next message: André Szabolcs Szelp: "Re: Emoji: emoticons vs. literacy"

    > one possible scenario is that a vendor migrates all of their text
    > messaging to being based on HTML rather than plain text, and the
    > emoji get represented in messages as proprietary URL references.

    Using URL's would suggest that the emoji are not plain text. Several
    key Unicode Consortium members are vehemently arguing that they are,
    in fact, plain text.

    Although I think they are not plain text, I do believe that they can
    be encoded in Unicode such that markup is not required. The thing
    about encoding the emoji that I disagree with is that each one needs
    its own code point. Think of the emoji as words and phrases; what is
    needed is an alphabet to "write" them. An alphabet is a) small,
    b) closed, and c) able to represent an unlimited number of "words".
    Think of how much mileage we get out of 26 latin letters.

    The suggestion I made on the UnicoRe list was to provide the mobile
    phone companies with 11 code points: an emoji_base character to
    represent the start of an emoji, plus 10 special emoji digits to be
    used to indicate an index into a list from the Emoji Specification
    (to be published by the phone companies). You could eliminate the
    base character if distinct emoji are separated by spaces.

    After much thought, I think that this alphabet should be greatly
    expanded to include not just digits, but letters and punctuation.
    This would allow an emoji of a cow to be spelled emoji_C emoji_O
    emoji_W. If the UTC developed, say, a 256-element Unicode subset
    to be used for this Direct Unicode Markup "script" (Mike's DUM
    Proposal), then you could replicate it 255 times in Plane D for
    private use, and 255 times in Plane C for things such as emoji.
    One side effect of creating a common alphabet for this is that an
    application wouldn't need to know what any of them meant, yet the
    example emoji above could be displayed as "[cow]" (much better
    than "e-2FC"). This would be true even for the private use area.

    If I were to design the script, I'd start with ASCII minus controls
    except TAB, CR, and LF, and then add in the most common diacritics
    (no sense wasting space with composite characters), and defer to the
    experts on what other common characters should be included. Thus
    you would have the letter "A" at U+D0041, U+D0141, U+D0241, etc. for
    private use, and also at U+C0041, U+C0141, etc. to be assigned to
    things like the emoji.

    Companies could design their products using the Structured Private
    Use Plane (plane D), with up to 255 custom object types with identi-
    fiers all spelled using the DUM alphabet (differing only in the
    prefix which identifies the object type). Then if the application
    enjoys enough success to get an official Unicode rubber stamp, the
    only thing that needs to be done is to change the prefix from one in
    plane D to one in Unicode proper. The requirements for encoding
    would be proper use of the DUM alphabet, evidence of enough use, and
    an organization assigned to handle publication of the identifiers.

    A secondary use of the DUM alphabet would be to simply write text
    using it (using a particular prefix). A map application might use
    different alphabets to write the parts of an address: number, street,
    city, state/province, country. ISO would love this, since it would
    be similar to their X.520 naming system. The best thing about it in
    my view is that it allows you to do things that currently you can
    only do with XML. This scheme would bring plain text closer to the
    power of XML, yet with much less complexity.

    Getting back to the emoji -- with the DUM alphabet and a specification
    from the phone companies, it is trivial to allow users to exchange
    their own custom emoji. As an example, they might decide to use base-
    64 encoding surrounded by curly braces. The base-64 data would some-
    how encode the custom graphic image, yet would be transmitted as plain
    text. Of course all characters would come from the emoji alphabet.

    And for people that absolutely *HATE* the emoji, or any of the other
    yet-to-be-invented uses of the DUM alphabet, it would be a trivial
    task to filter out those code points based on their prefix.

    That's all I have time to write for now. I can clarify anything that
    wasn't clear enough.

    Note: If you think this is the dumbest thing you've ever heard and
    want to comment on it, please provide a reason why you think it's
    dumb instead of just stating so.

    Mike



    This archive was generated by hypermail 2.1.5 : Mon Jan 05 2009 - 16:05:22 CST