Re: Multiple encoding used with Unihan database

From: Jim Breen <jimbreen_at_gmail.com>
Date: Wed, 18 Apr 2012 09:30:57 +1000

z-......_at_shiroha.jp wrote:
> Subject: Multiple encoding used with Unihan database
> I frequently consult the Unihan database to get detailed information
> about Japanese and Chinese characters, and I have noticed that at
> least some pages are encoded in more than one encoding, that is to
> say, although the main encoding is in "UTF-8" (as one would expect on
> the Unihan site), certain characters on those pages are encoded in
> "ISO-8859-1", which makes them unreadable until one forces a change
> of the encoding.
>
> I just looked at these pages:
> http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=58b3
> (character: 墳)
> http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=5893
> (character: 墓)
>
> The wrongly encoded characters appear here in the Hanyu Pinyin
> column: the accented letters are from the ISO-8859-1 charset and not
> from UTF-8 and will only become legible if one changes the encoding
> setting to ISO-8859-1 (which renders, of course, much the rest of the
> page unusable)
>
> kHanyuPinyin 10485.060:fén,fèn
> kHanyuPinyin 10470.090:mù
>
> I suspect that the providers of this information would like to see
> all of it to be encoded in UTF-8 and that the current encoding scheme
> is just an accident. :-)

This is very odd. The UniHan data files, which can be downloaded and which
presumably drive that WWW service, have that information correctly coded.

Quoting from Unihan_Readings.txt (Unicode 6.0):

U+58B3 kCantonese fan4
U+58B3 kDefinition grave, mound; bulge; bulging
U+58B3 kHangul 분
U+58B3 kHanyuPinlu fen2(46)
U+58B3 kHanyuPinyin 10485.060:fén,fèn
U+58B3 kJapaneseKun HAKA
U+58B3 kJapaneseOn FUN
U+58B3 kKorean PWUN
U+58B3 kMandarin FEN2
U+58B3 kTang *bhiən
U+58B3 kVietnamese phần
U+58B3 kXHC1983 0322.071:fén

My guess is the WWW service is using a pre-release version
which had some coding errors.

My advice is to download the data and search it directly.

Jim Breen

-- 
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne
Received on Tue Apr 17 2012 - 18:37:16 CDT

This archive was generated by hypermail 2.2.0 : Tue Apr 17 2012 - 18:37:28 CDT