Re: Fun with UDCs in Shift-JIS

From: Lars Marius Garshol (larsga@garshol.priv.no)
Date: Fri Jan 18 2002 - 09:32:01 EST


* Lars Marius Garshol
|
| I've just discovered that it seems that Shift-JIS encodes a number
| of User-Defined Characters in the 0xF040 to 0xFCFC range, and that
| these

* Markus Scherer
|
| Yes, and every implementor may assign characters to them as they see
| fit.
 
I know. My problem is that people use them in web pages, and I need to
display them the way those people expect.

| The problem being that most likely they are all tagged as
| charset="Shift_JIS", without distinguishing the variant of what's in
| the Shift-JIS encoding. Unreliable tagging is very common. That's
| one good reason why we all advocate Unicode...

Sure, but none of that helps me in any way. People are publishing
these pages and I need to support them "correctly".
 
| Given how many Windows machines there are, and given that Shift-JIS
| seems to be more popular on Windows than on Unixes, let's look at
| the Shift-JIS<->Unicode mapping table for windows-932:
| http://oss.software.ibm.com/cvs/icu/charset/data/xml/windows-932-2000.xml?rev=1.1&content-type=text/x-cvsweb-markup
| (From our collection of mapping tables at
| http://oss.software.ibm.com/icu/charset/)
|
| Shift-JIS F040..F9FC appears to be contiguously and linearly mapped
| to U+E000..U+E757.

I'm afraid that's not what users expect to see. I know it's the right
solution in the general case, but it seems that I need to do whatever
MSIE does, since that is in effect what users expect to see.

| Other Shift-JIS variants from different platforms will use a
| different assignment, but I would try the Windows variant first for
| whatever web page you are looking at. As a receiver, maybe you can
| figure out which platform generated the file, from a <meta> tag or
| an http server identification.

In general that's impossible. The pages will generally not be labeled
at all, and if they are labeled they will be labeled with anything
from "shift-jis" to "x-sjis" or even "iso-8859-1". Some pages even do
"helpful" things like sticking comments with Shift-JIS-typical byte
signatures need the top of the pages to help auto-detection, rather
than actually reveal what charset they used.

So while there are many different Shift-JISes my chances of finding
out which one was used in each case is essentially nil. I need to find
the most common one and support that.
 
| As a recommendation, if you _have_ to _generate_ Shift-JIS web
| pages, you should avoid UDCs and instead use NCRs (with Unicode
| non-PUA[!] code points).

I'm not generating these pages, I'm trying to display them.
 
| The W3C has a page about the problems with Japanese charset
| identifiers and mapping tables.

That was a good lead. They give four different tables for Shift-JIS:

 - x-sjis-unicode-0.9
    this is the one I use already

 - x-sjis-jisx0221-1995
    this one I haven't been able to find

 - x-sjis-cp932
    the unicode.org version has the 0xFA40 - 0xFC4B range, an ICU
    version seems to cover the same range, with mapping to the PUA for
    the rest

 - x-sjis-jdk1.1.7
    I found a JDK 1.3 version of this, but it had nothing

I now have some more pieces of the puzzle, but I still don't have all
of them. Or do I? Is it just the 0xFA40 - 0xFC4B range that has real
characters in it?

--Lars M.



This archive was generated by hypermail 2.1.2 : Fri Jan 18 2002 - 09:04:45 EST