Re: Round trip mapping - SJI S to Unicode

From: Ken Lunde (lunde@adobe.com)
Date: Mon Aug 24 1998 - 10:43:09 EDT

Next message: Uma Umamaheswaran: "EURO SIGN in IBM coded character sets"
Previous message: Smita Desai [InConcert Software Engineer]: "Re: Round trip mapping - SJI S to Unicode"
Maybe in reply to: Smita Desai [InConcert Software Engineer]: "Round trip mapping - SJI S to Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Abhishek,

You wrote:

>> It does not matter because the characters are just duplicates.
>> For e.g. in the case of 0x879c --> Ux222a --> 0x81be, 0x879c, Ux222a and
>> 0x879c represent the same character. 0x879c and 0x81be are just duplicates.

The fundamental issue here is how to handle such beasts. Luckily, in
the case of Microsoft's Japanese character set, for every case of
duplicate encoding, there is *always* a preferred code point for
round-trip. For example, it is easy to force 0x81BE and 0x879C to
become Ux222A. But, when you convert back into Shift-JIS, what code
point to use? In this case, 0x81BE is the preferred code point. 0x879C
is a character from NEC Row 13, which was developed to work with
JIS78. However, several characters in NEC Row 13 were added to JIS83
(to Row 2), thus making some in NEC Row 13 duplicates if the rest of
the character set conforms to JIS83 (or JIS90 or JIS97), such as is
the case for Windows-J character set.

Anyway, the XKP specification defines what the preferred mappings
should be for such cases. See:

http://www.xkp.or.jp/

The basic rules are:

o If the character is in both JIS83 and NEC Row 13 (and possibly an
IBM Selected character), the JIS83 code point is preferred.

o If the character is in the IBM Selected set (NEC and IBM positions),
the IBM position is preferred.

o If the character is in NEC Row 13 and the IBM Selected set, the NEC
Row 13 code point is preferred.

I have developed a machine-readable file that contains these mappings,
and demonstrates what the preferred ones are. If anyone is interested,
I can send it privately (it is about 100K).

Interestingly, there is one case of three mappings:

0x81CA, 0xEEF9, 0xFA54 -> U+FFE2 -> 0x81CA

Hope this helps...

-- Ken

Next message: Uma Umamaheswaran: "EURO SIGN in IBM coded character sets"
Previous message: Smita Desai [InConcert Software Engineer]: "Re: Round trip mapping - SJI S to Unicode"
Maybe in reply to: Smita Desai [InConcert Software Engineer]: "Round trip mapping - SJI S to Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT