L2/01-179

Title: Disposition of East Asian mapping tables

Action: For discussion and decision by the UTC

Author: Ken Whistler, Asmus Freytag

Date: May 2, 2001

We keep getting reports about inconsistencies in the East Asian Mapping
tables - for an example see document L2/01-192.

Mark Davis has suggested just eliminating the mapping tables under
MAPPINGS/EASTASIA completely. He apparently thought there was already 
some UTC consensus to do so.

It is time for clear, once-and-for-all UTC decision.  These tables have
drifted for seven years, since 1994, and are the source of all kinds of
problems.

Some have asserted that "we need to fix them". But who would that be?  No
one has stepped up to own that problem since Glenn Adams and John Jenkins
initially submitted most of these tables in late 1993/early 1994.

Maybe it should belong to the "mapping folks", but to date, the main
author of the mapping TR has indicated a contrary preference to just nuke
the files.

However, retaining the mapping files would allow UTC to provide some guidance
as to which mappings fit with the assignment of other properties, most
importantly the EastAsianWidth. See document L2/001-189 and L2/001-192.

Finally, even if we remove the tables themselves, we may want to document
some of the existing inconsistencies among other sources of mapping tables,
perhaps in a manner of a simple table as done in document L2/001-192.

This needs to be decided by the UTC with a clear up-or-down decision,
combined with a commitment and a name signed up for the action item. Either
we:

1. Delete the MAPPINGS/EASTASIA files as unmaintainable.

2. Archive the MAPPINGS/EASTASIA files in the OBSOLETE directory
    and put up bigger "as is" "buyer beware" signs and explicitly
    disown them, but leave them available for what they are worth.

3. The UTC claims ownership of the mapping from the Unicode Standard
    to various East Asian standards and then actively updates and
    maintains those files.

The problem with number 3 is that it is a political hot potato. Each of
the relevant NB's may itself claim ownership of the mapping problem, and if
there are any points of disagreement, then we have a contentious issue to
deal with.

It has been asserted that the UTC should maintain "canonical" versions of
mapping tables to Big-5, and so on, and then let the vendors hash out any
differences in their own mapping tables. Perhaps this is what the UTC
should do, but so far the UTC has not shown the stomach to formally claim
that kind of role in the mappings.

UTC should make a determination one way or the other.



Below is some analysis by Asmus Freytag of specific problems raised by T.
Kubota in this document:
	http://www.debian.or.jp/~kubota/unicode-symbols.html

The analysis as it pertains to EastAsianWidth file is primarily
contained in document L2/001-189. What follows are those characters where
the mapping is in question, even though T. Kubota reported them as EAW 
problems.
In addition to mapping issues, we need to decide the status of JIS X0212 and
JIS X0213 as input legacy encodings to the EAW assignments.

The following are available as Full Width characters in the FFxx range.
Therefore, the mappings of these characters are incorrect. This appears to
be a *mapping file issue* as far as these characters are concerned

FILE JIS0208.TXT------
0x2140  U+005C  Na  # REVERSE SOLIDUS
0x215D  U+2212  N  # MINUS SIGN
0x2171  U+00A2  Na  # CENT SIGN
0x2172  U+00A3  Na  # POUND SIGN
0x224C  U+00AC  Na  # NOT SIGN


FILE CHINSIMP.TXT------
0x815F  U+005C  Na  # REVERSE SOLIDUS
0x817C  U+2212  N  # MINUS SIGN
0x8191  U+00A2  Na  # CENT SIGN
0x8192  U+00A3  Na  # POUND SIGN
0x81CA  U+00AC  Na  # NOT SIGN


FILE JIS0212.TXT------
0x2243  U+00A6  Na  # BROKEN BAR
0x2234  U+00AF  Na  # MACRON
0x2237  U+007E  Na  # TILDE

(This section should have been in L2/001-189 since it's more
of an EAW issue, not a mapping issue).

The following characters show up only in X0212. The relevant
issue to decide is "To what extent are there widespread
JIS X0212 implementations? My understanding is that this
standard is not widely supported, and with X0213 coming it
will be even less supported. Therefore I tend to exclude
it from the set of legacy encodings under consideration.

The decision to make is whether this belongs in the class
of legacy encodings supported by EAW and also whether we
need to consider X0213 as well.

0x226D  U+00A9  N  # COPYRIGHT SIGN
0x2238  U+0384  N  # GREEK TONOS
0x2239  U+0385  N  # GREEK DIALYTIKA TONOS
...
0x2B77  U+017C  N  # LATIN SMALL LETTER Z WITH DOT ABOVE

The following are again questionable mappings:

The next one is almost correct, it should be Na (?), if it
is used to map a single byte character in an EA legacy encoding.
This mapping should be confirmed, if possible, and the EAW
adjusted accordingly. The vendor mappings are font dependent,
i.e. they treat 0x7E as TILDE except in (some) fonts.

FILE SHIFTJIS.TXT------
0x7E  U+203E  N  # OVERLINE

This may be a mapping table error.
Microsoft maps this location to Hyphenation point, which is "A".

FILE BIG5.TXT------
0xA145  U+2022  N  # BULLET

This may be a mapping table error.
The Microsoft mapping is to SMALL IDEOGRAPHIC COMMA which is "W"

0xA14E  U+FF64  H  # HALFWIDTH IDEOGRAPHIC COMMA

This may be a mapping table error.
The Microsoft mapping is to CIRCLED PLUS, which is "A"
0xA1F2  U+2641  N  # EARTH

Is this the same or is it different from Macron? Should it be mapped to FW
macron? Microsoft maps it to 00AF which means that one would need to be
"A" 0xA1C3 is mapped to FW Macron in the MS mappings.

0xA1C2  U+203E  N  # OVERLINE