Re: Multiple encoding used with Unihan database

From: Jim Breen <>
Date: Wed, 18 Apr 2012 09:30:57 +1000 wrote:
> Subject: Multiple encoding used with Unihan database
> I frequently consult the Unihan database to get detailed information
> about Japanese and Chinese characters, and I have noticed that at
> least some pages are encoded in more than one encoding, that is to
> say, although the main encoding is in "UTF-8" (as one would expect on
> the Unihan site), certain characters on those pages are encoded in
> "ISO-8859-1", which makes them unreadable until one forces a change
> of the encoding.
> I just looked at these pages:
> (character: 墳)
> (character: 墓)
> The wrongly encoded characters appear here in the Hanyu Pinyin
> column: the accented letters are from the ISO-8859-1 charset and not
> from UTF-8 and will only become legible if one changes the encoding
> setting to ISO-8859-1 (which renders, of course, much the rest of the
> page unusable)
> kHanyuPinyin 10485.060:fén,fèn
> kHanyuPinyin 10470.090:mù
> I suspect that the providers of this information would like to see
> all of it to be encoded in UTF-8 and that the current encoding scheme
> is just an accident. :-)

This is very odd. The UniHan data files, which can be downloaded and which
presumably drive that WWW service, have that information correctly coded.

Quoting from Unihan_Readings.txt (Unicode 6.0):

U+58B3 kCantonese fan4
U+58B3 kDefinition grave, mound; bulge; bulging
U+58B3 kHangul 분
U+58B3 kHanyuPinlu fen2(46)
U+58B3 kHanyuPinyin 10485.060:fén,fèn
U+58B3 kJapaneseKun HAKA
U+58B3 kJapaneseOn FUN
U+58B3 kKorean PWUN
U+58B3 kMandarin FEN2
U+58B3 kTang *bhiən
U+58B3 kVietnamese phần
U+58B3 kXHC1983 0322.071:fén

My guess is the WWW service is using a pre-release version
which had some coding errors.

My advice is to download the data and search it directly.

Jim Breen

Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne
Received on Tue Apr 17 2012 - 18:37:16 CDT

This archive was generated by hypermail 2.2.0 : Tue Apr 17 2012 - 18:37:28 CDT