Re: Emoji mappings in Shift JIS / CP932/943

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Sat, 3 Dec 2016 15:21:25 -0800

On Sat, Dec 3, 2016 at 2:37 PM, Christoph Päper <christoph.paeper_at_crissov.de
> wrote:

> If an existing character encoding forms the (sole) base of an addition to
> Unicode, shouldn’t it be part of the UTC’s job to document these sources?
> This was obviously done in the case of Japanese emoji, hence the existence
> of EmojiSources.txt, but for some reason that’s been kept separate from
> related mapping data files.
>

For the Japanese carriers, we had information about their Shift-JIS VDC
assignments (Vendor-Defined Characters) but not about their non-VDC
Shift-JIS usage. We only documented the VDC assignments, in a form that
documented our decisions on unifying symbols across the three main carriers
(which was in turn based on their 2006 cross-mapping agreement) and
encoding of Unicode code points. But I think you are right, there probably
was not really a good reason to put EmojiSources.txt into the UCD rather
than into MAPPINGS.

You could submit a proposal to move EmojiSources.txt to the MAPPINGS.

I’m not sure the documentation is equally well available for emojis (also)
> taken from ARIB, W*dings etc. (cf. https://twitter.com/FakeUnicode/status/
> 801740535073361920)

The W*dings are not charsets but symbol fonts which were used with a
generic Unicode PUA range. (After standardization, they may have gained
mappings for the new assignments.)

I think the ARIB symbols were lists of characters that wanted to be encoded
in Unicode so that PUA and VDCs could be avoided, so there probably was no
charset to map to either.

In any case, there might well be examples of characters from other charsets
whose mappings are documented in the proposal docs rather than in MAPPINGS.
If you find examples of such, you could collect the data and propose their
additions to MAPPINGS.

Remember that the Unicode Consortium is run by volunteers. Yes, many of us
work for member companies, but we often do Unicode work in addition to our
"main jobs". (Some continue to contribute even after retirement!)

and I have never seen an authoritative mapping from ASCII emoticons and
> line-art or from kaomojis to Unicode emojis. (There are plenty
> implementations of conversion routines, some open-source or well
> documented, others not.)
>

I would say that, as far as Unicode is concerned, the canonical "mapping"
for those are the Unicode *sequences* that are used to represent them.

> At this point, the Emoji vendor mappings are not very relevant any more
> because Unicode has added many Emoji symbols that are not in the old vendor
> charsets.
>
> Sure, but hardly anybody will ever want to convert Unicode emojis to Shift
> JIS, just (still rarely) the other way around.
>

Good point. I assume most do something like what we (Google) do: Take a
base Shift-JIS mapping (we use windows-932 I think), remove the VDC-range
mappings that conflict with a vendor's emoji range, and add the vendor's
emoji mappings. You can see examples for this in Android's ICU source tree.

For __ML at least, there seem to be more up-to-date mappings available at <
> https://www.w3.org/2003/entities/2007/htmlmathml.ent> or <
> https://html.spec.whatwg.org/multipage/entities.json>, but not in a CSV
> format as preferred at Unicode.
>
> I haven’t gone through all of them, but I think most entries claiming a
> missing equivalent character in Unicode are outdated.

Maybe the user community is better served via w3.org and/or whatwg.org; if
so, we could add a link from the MAPPINGS files to there.

markus
Received on Sat Dec 03 2016 - 17:22:27 CST

This archive was generated by hypermail 2.2.0 : Sat Dec 03 2016 - 17:22:28 CST