Re: Emoji: Public Review December 2008

From: Doug Ewell (doug@ewellic.org)
Date: Sun Dec 21 2008 - 11:59:01 CST


Asmus Freytag <asmusf at ix dot netcom dot com> wrote:

>> My input has been based on:
>>
>> (a) where the WG2 "Principles and Procedures" document (N3452) says
>> the line should be drawn,
>>
>> (b) where the Unicode Consortium and WG2 have drawn the line for the
>> past 15 years, and
>>
>> (c) what the most respected authorities within UTC, including Asmus,
>> have said for the past 10-plus years about where the line should be
>> drawn.
>
> The curious thing is that a, b and c are all evolving along with the
> standard. It's a myth to think that there was ever a bright line that
> was so clearly cut that you could evaluate a character's encodability
> as if by plugging in a few attributes into a formula and getting a
> YES/NO answer.

N3452 goes justifiably far out of its way to point out that there's no
bright line, and it's obvious from years and years of debates that
controversy can and does exist as to what characters should be encoded.
Anyone remember the wars over Phoenician?

But when N3452 specifically says, "the line is fuzzy, but here are some
examples of symbols that fall way over on the 'do not encode' side,"
it's fairly surprising when a set containing those examples has gathered
such wide support within Unicode experts.

>> I'm curious to see where HAPPY FACE WITH OPEN MOUTH AND CLOSED EYES
>> and BLOOD TYPE A and ROASTED SWEET POTATO and POOP fit into the
>> modern shared understanding of what is plain text.
>
> Value judgment?

The concept of "modern shared understanding" is indeed a value judgment.
Reasonable people continue to differ on what characters are "essential"
for plain text. Microsoft Word automatically converts "straight quotes"
to "smart quotes." The directional "smart" quotes have always been part
of Unicode. Word also automatically takes an ordinal number like "4th"
and superscripts the "th". There is no mechanism in Unicode for this.

Several of the emoji images are more like directional quotes, where most
people's value judgments would say they are plain text. Others are like
superscript "th", where some would say yes and some would say no. There
there is ROASTED SWEET POTATO, waaay over on the other side -- again, as
most people's value judgments would have it.

Perhaps I should rephrase: "I'm curious to see how this massive
existing user base uses HAPPY FACE WITH OPEN MOUTH AND CLOSED EYES [and
BLOOD TYPE A] and ROASTED SWEET POTATO and POOP as plain text, not just
as a picture of the thing." That is one non-ridiculous exemplification
of plain text. (On reflection, I could see how people would use BLOOD
TYPE A and friends in text databases and such, though there are many
other examples of letter-based symbols, like MPAA movie ratings, that do
not get equal treatment here.)

>> As to whether a symbol like ROASTED SWEET POTATO carries any
>> communicative value, beyond being a picture of a roasted sweet
>> potato, there can be plenty of disagreement.
>
> Irrelevant. perhaps. We never question the mathematicians (or the
> linguists, or the philologists) about the what they are putting in
> their notational systems or whether the ancients should or should not
> have used certain characters. I think that there can be no
> disagreement about the fact that these have communicative value to
> their users.

I thought it was explicitly stated, at the time the mathematical
alphabets and musical symbols were added, that they were not to be taken
as setting a precedent for other notational systems. Mathematics and
musical symbols were special, we were told. It seems that much of the
pro-emoji argumentation now cites these as typical examples of what
Unicode needs to support.

>> N3452 specifically mentions "pictures of cows" and "stop sign" as
>> examples of symbols that should not be encoded. Naturally it is a
>> bit of a surprise to see so much official and expert support behind
>> the encoding of COW and TRAFFIC LIGHT.
>
> Right. And as I wrote before, subject to change. Therefore, a future
> revision of this document is likely to use different examples.

That's encouraging -- the examples of what not to encode (presumably not
chosen as edge cases) are changed to reflect ephemeral fashions. Will
the new examples be obsolete in 5 years when another powerful industry
group encodes them and establishes its own user base?

> The Unicode Standard has contained language trying to define the
> scope. This language has had to be changed over time, because the
> understanding of what is and isn't plain text has evolved. It's still
> the case that one doesn't need the catalog of street signs as Unicode,
> because nobody is using this full set to communicate in text. The STOP
> sign is a different matter - it's becoming something that I can
> definitely imagine being used in interchange without literally being
> an encoding of a traffic sign.

So the text in N3452 about traffic engineers not using a stop-sign image
to talk about stop signs is now irrelevant?

> The decisive difference in the case of the emoji is that some other
> entities have created means for the interchange of such symbols as
> part of (functional) plain text.

Translation: These images are being encoded in Unicode because a large
and powerful industry consortium put them in their SJIS extensions and
shipped jillions of cell phones that support them, and they have now
reached critical mass such that they can force Unicode's hand.

There was once a principle, which I assume may now be removed from
future Unicode references, to the effect that Unicode was not obligated
to encode things appearing in national standards published *after* the
first version of Unicode (I think the magic year was 1993). One would
think that industry collections of pictures would be at least as
restricted, and of course Unicode always retained the *option* to encode
such things anyway; but the gist of the present "some other entities"
argument is that Unicode is obligated to encode the emoji because of the
Japanese industry definitions.

I'm reminded of DPRK 9566-97, the North Korean standard DBCS, which was
presented to (I believe) WG2 with the intent of adding all the thingies
that were not already in 10646. Most of the missing characters were
added, but there were some exceptions: the special extra-bold Hangul
characters for KIM IL SUNG and KIM JONG IL (yes, two separate KIMs and
ILs) and the "military victory" markers that essentially meant PLACE ON
MAP WHERE THE PROUD AND NOBLE NORTH ARMY CRUSHED THE SOUTHERN INFIDELS.
These were not encoded in Unicode, creating a potential round-tripping
gap that was determined to be acceptable.

But *all* of the symbols in this Japanese industry collection, including
those not even supported by one or two members of the consortium, must
apparently be encoded.

> You see, there's a difference between somebody saying, in effect, it
> would be cute to have a picture of a cow, so we could write about cows
> using the picture, and a usergroup who already is off creating texts
> with cows standing in for whatever they stand in for in their texts.
> In one case, Unicode would standardize ahead of actual usage - and
> that's a very dicey game, best avoided - and in the other case, it's
> trailing actual usage, and dragging it's feet - and that's not good.

The argument here is that all of these symbols have achieved widespread
use because of the prevalence in Japanese cell phones character sets.
It might be interesting (though perhaps impossible) to find out how much
usage some of these images have actually achieved in the real world.

As before, I'd personally be interested to see how people are using
ROASTED SWEET POTATO and what it is standing in for in text messages:

"We're having [RSP] [RSP] for dinner 2nite! Yum!! [emoji of cat
licking lips]"

> The concern about the emoji characters is not driven by concerns for
> the language of N3452, but based on value judgments about the
> entities. Otherwise, people would take a more dispassionate attitude
> and reflect that characters widely used as plain text in existing,
> commercially viable implementations, are clearly subject to encoding
> by Unicode and that they fit under the overall universalist mandate.
> Therefore, along with encoding the characters, the guidelines have to
> be updated (again) to match.

If "universality" here means what I think it is being used to mean, then
the most appropriate action would be to remove the relevant
"appropriateness" section of the guidelines and state that all sets of
characters deemed to have attained sufficient "wide use" shall be
encoded.

> A default mapping of hatching to colors would be a great thing to
> propose - it could come in handy in other situations.

I hope the decision is made to use (and extend in a consistent way) the
existing scheme used in heraldry, instead of making one up.

This does not change my opinion that WHITE HEART and BLACK HEART and RED
HEART and YELLOW HEART and GREEN HEART (renamed though they may have
been) do not belong in Unicode.

> I think "hatching chick" has a definite semantics that does not
> *require* animation. I would be generally in favor of encoding as many
> of these in a way that does not burden Unicode with encoding the
> animation aspect as such.

That's a scary sentence, actually. It appears to leave the door open
for some future emoji-like, animation-rich proposal to actually push the
animation aspect onto Unicode. A less scary version might have been: "I
would be in favor of encoding as many of these as possible, so long as
none of them require Unicode to encode the animation aspect as such."

--
Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ


This archive was generated by hypermail 2.1.5 : Fri Jan 02 2009 - 15:33:07 CST