I have come to the conclusion that Chinese support in Unicode is an epic failure due to the Han Unification and an apparent lack or foresight or understanding of Chinese characters.
As Asmus points out, the Unihan repertoire has been developed by native speakers of (Mandarin) Chinese, Cantonese, Japanese, Korean, and Vietnamese, including representatives from major font vendors, software companies, and universities throughout East Asia. The model used for Han unification was developed by JIS for Japanese character set standards and slightly refined by the Chinese. The issues you raise were all well-understood when the Unihan effort started over twenty years ago.
- There is no way to differentiate Japanese and Chinese characters other than trying to look at the text and guessing.
Why do you want to do this? In general, Japanese users want to see kanji
written with Japanese fonts, whether they're used to write Japanese or Chinese. The simplest way to have this happen is for the user to set their locale or manually set the font. Usually setting the locale is enough.
- It impossible to create a universal font because the same character code is used for two different glyphs, one Japanese and one Chinese.
Actually, it's impossible to create a universal font because TrueType and its descendants can't have more than 65,536 glyphs. Having a single font file with simultaneous support for Chinese and Japanese is relatively straightforward using TrueType Collection (.ttc) fonts. You do have to distinguish the the Japanese variant and the Chinese variant in the UI, of course; but typical users will want only one and not the other anyway. And depending on character codes to distinguish Chinese and Japanese wouldn't do you a lick of good, because it's the glyph count which is the problem, not the character count.
If you really, seriously do not want to work with more than one font, or get the locale from the system, or let the user switch between fonts—well, then, yes, you're hosed.
- Any document format that does not include a language attribute cannot reliably contain Japanese or Chinese characters. For example if I get an email encoding in UTF-16 there is no way to determine if it contains Japanese or Chinese unless it uses characters that are only in one of those two sets or some kind of AI/human reads it.
You mean "differentiate" instead of "contain," of course. And again, why do you want to do this? If it's only to set the appropriate font for display, then using a font derived from the locale should be enough. I'll reiterate: Japanese users want to see Japanese glyphs, period. They don't care what the language in question is.
- The initial allocation failed to include commonly used Japanese characters such as those for people's names, despite them being well defined and listed by the government and several dictionaries.
The initial allocation was a superset of the major character sets then in use for simplified and traditional Chinese, Japanese, and Korean. Common Cantonese-specific characters were missing, and support for Vietnamese was fairly limited, largely because those two were terra incognita
at the time. If Unicode was missing "commonly used Japanese characters" in 1991, it was because the Japanese failed to include them in their own standards, which makes this objection rather disingenuous.
In any event, when the matter was being argued over in the early 1990's, opponents of Unicode failed to produce a single kanji
which was in common use for personal names and yet not encoded. Even now, the bulk of the characters proposed for inclusion by Japan do not
include characters required for personal names, as opposed to the proposals coming from China.
I'm sorry to sound brusque, but this was a canard twenty years ago, and one would hope that we could simply let this particular horse rest in peace.
- The multiple encodings, while clever and mathematically sound, make implementing Unicode on limited memory and performance systems very challenging. The knock-on effect of this is not only that most embedded systems don't support Unicode, but many desktop apps don't either because Unicode failed to displace existing more practical options.
Conversion between the various UTFs is trivial and requires a small amount of code. Libraries and samples illustrating how to do it with complete error handling are readily available. It's also very fast. The typical way of doing it is to use one UTF internally and do the conversions at input and output.
Interconversion between various UTFs is a far, far simpler problem than interconversion between the dozens of East Asian character sets which antedate Unicode.
In any event, such software manages without Unicode by the simple fact that it is generally not trying to solve the same problem that Unicode is; that is, it is deliberately limiting its support to a small number of languages, perhaps as few as one. Writing software which provides simultaneous support for all the major languages of East Asia—let alone the dozens of scripts and hundreds of languages covered by Unicode 6.0—is far, far
more difficult without Unicode.
As an example, you can't allocate a buffer of N 16 bit words and expect to fit N characters into it because some characters need a 24 bit word.
No, you can't. Would you prefer to be unable to allocate a buffer of N 8-bit bytes and be unable to fit N characters in it because some are single-byte and some are double-byte? That's the way the major East Asian standards worked in the pre-Unicode days. Would you prefer to allocate buffers of N 32-bit words and fit N characters in them? That's what you would have to do without Han unification. And if that's the case, just use UTF-32.
(And you can't do that anyway, because the assumption that one character == one code unit in memory == one grapheme is a gross oversimplification for languages generally. It is mostly true in East Asia and the US, which is why people keep thinking it should be true everywhere all the time.)
- To anyone not familiar with Japanese or Chinese these problems would not be obvious or well documented, so some apps that support Unicode still don't work properly for those languages.
Out of curiosity, do you have specific applications in mind? I'm aware of software that still doesn't handle Unicode correctly in a number of different ways, but not any that has trouble with Chinese and Japanese.
Because of all this I am having to produce two versions of my product, one with a Chinese font and one with a Japanese font. I don't know Chinese so I somehow have to test it blind.
And you can't ship it with both and let the system locale or the user determine which font to use because—?