Re: Emoji: emoticons vs. literacy

From: James Kass (
Date: Sat Jan 03 2009 - 10:53:23 CST

  • Next message: Doug Ewell: "Re: Emoji: emoticons vs. literacy"

    Asmus Freytag wrote,

    >> The existence of a private agreement is a given, otherwise
    >> neither interpretation nor processing would be desired. In
    >> contexts where the nature of the private agreement cannot
    >> be determined, no interpretation is possible. Processing can
    >> be done on uninterpreted strings. I don't need to be able to
    >> speak Hindi in order to enter, store, search, and collate text
    >> written in Devanagari, and neither does my plain-text editor.
    >But your plain text database on your web server cannot present Hindi
    >words in the order a user of your website in India would expect them,
    >unless the text (that is its character codes) can be interpreted.

    Method # 1 (easy way)

    First download any of several free on-line hindi dictionaries.
    Extract a word list to a database file. Just the one field:
    VAR1, character field, length to be determined by the longest dict. word
    USE <mainfilename>
    INDEX ON <fieldname> TO <indexfilename>
    I'd store it as UTF-8 mojibake after first normalizing the
    data in the same manner to be performed on any incoming
    word lists needing sorting. Then I'd store the incoming
    data in a two field database:
    VAR1, character field, also UTF-8 mojibake after normalization
    REFNO, numeric field
    SELE 1
    USE <mainfilename> INDEX <indexfilename>
    SELE 2
    USE <newlistfilename>
    x = SPACE(40)
    y = 0
         DO WHIL .NOT. EOF()
         x = TRIM(VAR1)
         SELE 1
         FIND &x
         y = RECNO()
         SELE 2
         REPL REFNO WITH y
    USE <newlistfilename>
    SORT ON REFNO TO <newsortedlistfilename>
    USE <newsortedlistfilename>
    * and you're ready to use your new sorted list.
    * Caveat, above not tested, it seems straightforward.

    That's a straight binary comparison, exact. The machine
    doesn't need to interpret text; the machine's programmer
    does that. Someone would probably throw you a word
    which wasn't in your dictionary, though.

    Method # 2 (easy way)
    Download an existing set of collation rules for Hindi, study
    them, then write your sorting code accordingly.

    Method # 3 (easy way)
    Hire a database administrator fluent in Hindi, and pay your
    new employee to write the code.

    Method # 4 (easiest way)
    Use somebody else's Hindi subroutine and/or list. (It has been
    said that we all stand on the shoulders of giants.)

    Now, suppose that, instead of Devanagari Hindi, it's Verdurian
    ConScript PUA. It's *your* web server running *your* database.
    So you ought to know all about Verdurian and write your code

    If your web server running your database is storing Verdurian
    from other web sites, you must first identify it as Verdurian.
    (Font-face mark-up clues looked up in a researched database of
    font names linked to researched information in other databases.
    Frequency counts might be used to determine which fonts would
    get researched and which fonts would be disregarded.)
    (Use the hypothetical tag identifying the PUA scheme, go to
    where the Verdurian PUA scheme information is hosted in
    a hypothetically consistent fashion [from one scheme hosting
    web site to the next] and get that information. Check periodically
    for updates. The hypothetical tag includes information about
    where to go for needed data, of course.)

    If, instead of Verdurian, it's some other unknown PUA script
    *without* any PUA scheme information (unknown font or
    lack of scheme tag or unknown scheme tag), then the page
    isn't *about* you, it's an example of public exchange of user
    defined characters which is a secret. You can still index this
    stuff as binary strings for search/comparison purposes, but
    that's about all you can do.

    Even if your database is grabbing data straight from a cell
    phone, the cell phone message protocol should be sufficient to
    determine the vendor, and by extension, it's scheme.

    >> Success in interpreting the text, then, lies in determining the
    >> nature of the private agreement. This is not a new concept,
    >> it has been discussed here previously, unless I'm mistaken.
    >> Mark-up was one method mentioned, if I recall correctly.
    >> Search engines can interpret mark-up.
    >If that was as easy and straightforward, we wouldn't have a Unicode

    It's as easy and straightforward to write first time code to
    interpret Klingon text as it was to write the first time code
    to interpret Telugu. (Indeed, it's even easier, Klingon is not
    a complex script!)

    We have a Unicode Standard for standardized plain-text
    exchange. Search engines mainly are indexing rich-text
    pages. Cell phone vendors in Japan are exchanging icons
    using PUA plain-text as mark-up. The rich-text exchanged
    by Japanese phone vendors sometimes ends up on rich-
    text web pages where it might be grabbed by a rich-text
    search engine.

    Any perceived rich-text problem here requires a
    rich-text solution.

    The Japanese phone vendors may well continue to use their
    PUA characters for icon exchange. Suppose they want to
    enter a non-Japanese market, say, Latin America. Wouldn't
    the new subscribers to their service want their *own* icons,
    reflecting their own cultures? And wouldn't those vendors
    whip some up and extend their proprietary user-defined
    icon sets? How about expanding sales and service in
    southeast Asia?

    If UTC has a working relationship with these icon-making
    vendors, wouldn't it be better to work with them to switch
    from Shift-JIS to Unicode? To help them understand and
    implement complex script shaping rules on their sets?
    Imagine, if they switched to Unicode and got the complex
    shaping worked out, they might have a good chance in
    southeast Asian markets. Isn't *that* the proper role of
    Unicode -- education about and promotion of the computer
    plain-text encoding standard?

    Or, would it be better when those vendors increase their
    icon sets as new markets are added and existing fads change
    to eagerly await each icon addition so that they can be
    promoted into Unicode?

    >If I remember correctly, before Unicode, everybody had their own
    >character sets, and in Japan, every vendor had their own. In order to
    >communicate you had to know what character set the other party was
    >using. ISO 2022 even had internal markup (control sequences) to allow
    >switching of character sets on the fly.
    >Interestingly enough, vendors, users and implementers voted with their
    >feet to abandon such systems and go to a unified encoding where the
    >semantics of each code are unambiguous on the character level, where
    >there's no need to switch on the fly, and where the processes can be
    >written without undue complication.

    This is revisionist, in my opinion. It wasn't that easy,
    it was an uphill climb. There are even still some people
    out there stuck with Unix systems locked into 8859-01.
    Pasting a Malayalam Unicode text word into the search
    box on Unicode's own mail list archive page results in
    some kind of conversion of the material into ASCII
    mojibake, for which the expected match isn't found.

    >Suitability requirements are different between ordinary and
    >compatibility characters - that's a long held design principle for the
    >Unicode Standard.

    These aren't compatibility characters unless the book definition
    in 5.0 has been trashed/revised.

    >> We shouldn't exclude text-like characters from being included
    >> in a plain-text encoding standard as long as all the criteria are
    >> met.
    >Criteria for encoding are different between ordinary and compatibility
    >characters. Requiring that the criteria for ordinary character are to be
    >met, is tantamount to freezing all encoding of compatibility characters.
    >That's not a useful starting point.

    Compatibility characters are variants of characters which
    already exist in Unicode or they have a compatibility decomposition.
    This has shifted around some over the years, but that's it, isn't it?

    >But the ones that are not ordinary characters are not immediately out of
    >consideration. You need to triage these further and make a careful
    >deliberation whether they qualify (or not) as compatibility characters.

    Can you please point me to the new definition of compatibility
    in this regard?

    >> The vendors who invented this icon set should continue to use
    >> the PUA to exchange them. They are icons/signage and are
    >> being exchanged and interpreted by humans as icons/signage.
    >> Any machine interpretation of them should emulate what
    >> people are doing. It's OK for there to be some overlap between
    >> icons/signage and plain-text characters, after all, many of
    >> those icons are pictures of those characters.
    >This sounds like you are confusing the emoticon and the emoji discussion.

    An alternative would be that the emoticon and emoji discussion
    may be confusing.

    Emoticon is plain-text when it is a plain text string, usually ASCII
    interpreted by the reader. Some emoticon plain-text characters exist.
    Emoticon is rich-text (*.ICO, *.GIF) when an application replaces
    a text string with an icon or bypasses text strings altogether.
    Emoji are rich-text only, I think the Japanese use a different word
    for ASCII/text strings used as emoticons in the plain-text sense.

    Emoji and emoticon-as-rich-text are identical concepts. Do a web
    search on keywords emoji emoticon and see how many pages equate
    the two as opposed to how many distinguish them.

    >The fact that the request to provide a solution using non-PUA character
    >codes is so strongly supported by leading search engine manufacturer(s)
    >should give you pause here.

    Of course it does. Search engines deal with rich-text.

    (Both leading search engine manufacturers are strongly
    supporting this? Or just the one? How strongly?)

    How much support do the phone vendors have for this?
    Wouldn't they be chatting it up on their web sites?
    How about the user community, anyone ever take a poll
    to see if they think they're picking icons from an
    icon-picker and inserting them into their rich-text,
    or if they think they're exchanging plain-text?
    Do they understand or care about the difference?
    How about the designers of these icons and the programmers
    who chose, sensibly, to exchange these icons referenced as
    single user defined characters -- what is their conception?
    How does the government of Japan regard these items?
    Does the government of Japan push for plain-text encoding?
    Did the government of Japan standardize these in JIS?

    If the phone vendors want to work this out, they'll do
    so. Apparently they already have. If the search engines
    want to process/interpret emoji, they'll have to work it
    out. Plenty of alternatives.

    (To anyone who might have made it through, apologies
    for length.)

    Best regards,

    James Kass

    This archive was generated by hypermail 2.1.5 : Sat Jan 03 2009 - 10:55:32 CST