Multiple encoding used with Unihan database

From: <z-test_at_shiroha.jp>
Date: Tue, 17 Apr 2012 17:14:01 +0900

Good morning!

I frequently consult the Unihan database to get detailed information
about Japanese and Chinese characters, and I have noticed that at
least some pages are encoded in more than one encoding, that is to
say, although the main encoding is in "UTF-8" (as one would expect on
the Unihan site), certain characters on those pages are encoded in
"ISO-8859-1", which makes them unreadable until one forces a change
of the encoding.

I just looked at these pages:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=58b3
(character: 墳)
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=5893
(character: 墓)

The wrongly encoded characters appear here in the Hanyu Pinyin
column: the accented letters are from the ISO-8859-1 charset and not
from UTF-8 and will only become legible if one changes the encoding
setting to ISO-8859-1 (which renders, of course, much the rest of the
page unusable)

kHanyuPinyin 10485.060:fén,fèn
kHanyuPinyin 10470.090:mù

I suspect that the providers of this information would like to see
all of it to be encoded in UTF-8 and that the current encoding scheme
is just an accident. :-)

Thank you for your time!
Received on Tue Apr 17 2012 - 09:56:06 CDT

This archive was generated by hypermail 2.2.0 : Tue Apr 17 2012 - 09:56:33 CDT