The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Fri Oct 24, 2014 11:56 am

All times are UTC - 6 hours [ DST ]




Post new topic Reply to topic  [ 12 posts ] 
Author Message
 Post subject: Required characters for CJK support in an embedded system
PostPosted: Tue Jul 12, 2011 5:21 pm 
Offline

Joined: Tue Jul 12, 2011 1:57 pm
Posts: 5
Hi.

I am working on an embedded system that needs to deal with Unicode strings. I want to support Japanese, Chinese and Korean character sets but have limited storage space available for fonts. I can fit in all the CJK Unified and Extension A plus everything else like Hiragana/Katakana etc. The only thing outstanding is Extension B (and all other extensions) and the Compatibility Ideographs. Basically anything above 0xffff isn't included.

Is it going to be okay for general use? The strings will include things like people's names which have caused issues for Unicode in the past as I understand it. I think I will have to ship two different firmwares, one with a Japanese style font and one with a Chinese style font.


Top
 Profile  
 
 Post subject: Re: Required characters for CJK support in an embedded syste
PostPosted: Tue Jul 12, 2011 6:01 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
If you need to subset you might want to start with something like the IICORE set of ideographs and build from there. There are too many ideographs that are extremely rarely used - the IICORE set tries to identify a reasonably complete repertoire that should cover most everyday/common use needs.

No subset is guaranteed to cover every character that you might encounter, but starting with a subset collection over just creating your own by code space has the advantage that you don't accidentally exclude more common characters only because they were added to Unicode a bit later in the process.


Top
 Profile  
 
 Post subject: Re: Required characters for CJK support in an embedded syste
PostPosted: Wed Jul 13, 2011 12:52 pm 
Offline

Joined: Tue Jul 12, 2011 1:57 pm
Posts: 5
Thanks, it makes sense to start with those. I managed to make everything fit by cutting out some Arabic and symblic pages but my main concern is the stuff not on the BMP. That will be very tricky to support.


Top
 Profile  
 
 Post subject: Re: Required characters for CJK support in an embedded syste
PostPosted: Thu Jul 14, 2011 5:19 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
There is some off-BMP stuff that's more widely used than some on-BMP stuff. That's just the way it is. But subsetting by entire CJK blocks is not the answer - they are too coarse.

If your layout engine doesn't support bidirectional layout, you can cut all right-to-left scripts from your fonts, it won't do users any good. Same for other scripts requiring detailed layout engines - Indic scripts, for example. Having the fonts, but not the layout engine for them is also useless to the users. (That argument works in reverse if your layout engine handles those scripts fine).


Top
 Profile  
 
 Post subject: Re: Required characters for CJK support in an embedded syste
PostPosted: Fri Jul 15, 2011 12:49 pm 
Offline

Joined: Tue Jul 12, 2011 1:57 pm
Posts: 5
My engine does not do bi-di rendering. I ditched everything I could find that was RTL but I will have another look.

For Japanese JIS_X_0208 is a subset of the BMP and thus the BMP should be sufficient for all Japanese writing purposes (at least in most applications, including this one).

I might have to drop Chinese support, or at least full Chinese support. The BMP seems to cover most of the older standards but not all of Big5 or GB18030. I have noticed that a lot of apps, both computer and embedded, make you choose between Traditional Chinese and Simplified Chinese encoding due to not being able to handle both at the same time, probably due to selection of Big5 encoding or font...

Realistically if there really are 50,000+ Chinese characters in use I can't see how embedded systems could support them all. Take older mobile phones, or even older computer systems from the 80s and early 90s because several megabytes of RAM became standard. Japanese systems from back then struggled too, and they only had 6-7,000 characters max to deal with. The only way I can see of doing it would be for me to store additional characters on SD card, but random access is very slow due to needing to read a whole block at a time compared to byte level access to Dataflash EEPROMs. I can't imagine I am the only person to have this problem though.

Well, there is one other option, ditch anti-aliasing and use 1 bit bitmaps to save space. It will look rubbish but at least it will work.


Top
 Profile  
 
 Post subject: Re: Required characters for CJK support in an embedded syste
PostPosted: Sat Jul 16, 2011 3:33 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
mojo wrote:
Realistically if there really are 50,000+ Chinese characters in use I can't see how embedded systems could support them all.


There are not, and no, they can't.

Hence the development of a useful common subset called the International Ideographs Core (IICORE).

From the website:

Quote:
There are over 70,000 ideographs encoded in [Unicode] A subset of these ideographs can provide conveniences for day-to-day electronic communication in Chinese on resource-limited devices, for example mobile devices. In addition, this subset can also facilitate global electronic communications in Chinese by providing a shared subset of ideographs.

... The IICORE contains 9,810 characters and it can be implemented in devices with limited memory, input/output capability, and/or applications where the use of complete ... ideographs repertoire is not feasible.


Supporting the IICORE is where you might want to start your efforts.


Top
 Profile  
 
 Post subject: Re: Required characters for CJK support in an embedded syste
PostPosted: Fri Aug 05, 2011 11:49 am 
Offline

Joined: Fri Apr 15, 2011 12:28 am
Posts: 16
I'm also trying to support Chinese (but on a computer where space is not limited). Is it possible for someone to answer the OP's question about what the required characters are for CJK support?

I'm having difficulty understanding what the Extended A, B, C, D, half-width, etc, are...if I just want to allow a Chinese person to visit my website and type whatever they like in Chinese (including peoples' names), Japanese, and Korean, what is the minimum sets that I will need to have?

Thank you!


Top
 Profile  
 
 Post subject: Re: Required characters for CJK support in an embedded syste
PostPosted: Fri Aug 05, 2011 4:32 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
thesun wrote:
I'm also trying to support Chinese (but on a computer where space is not limited). Is it possible for someone to answer the OP's question about what the required characters are for CJK support?

The OP's questions were based on the limitations of his embedded systems. So this is really a separate question.

thesun wrote:
I'm having difficulty understanding what the Extended A, B, C, D, half-width, etc, are...if I just want to allow a Chinese person to visit my website and type whatever they like in Chinese (including peoples' names), Japanese, and Korean, what is the minimum sets that I will need to have?

This doesn't sound like you are interested in minimal support, but full support instead. You would need to at least add the various CJK symbols as well as Katakana/Hiragana/Hangul and Jamo in addition to punctuation characters.

Many characters in the Extended blocks are more rarely used - but there are exceptions, such as characters commonly used in Hong Kong being part of Extended B. Also, some characters may be required for certain documents, but outside of the topic covered there, not needed.

If you just want a general purpose subset, the answer is similar to what I gave before, which is to use the IICORE as your starting point.

You can add to that, since you have the space, characters (including symbols/punct) that occur in popular non-Unicode character sets for these countries. (For that you'd need the mapping tables).


Top
 Profile  
 
 Post subject: Re: Required characters for CJK support in an embedded syste
PostPosted: Thu Aug 11, 2011 12:35 pm 
Offline

Joined: Tue Jul 12, 2011 1:57 pm
Posts: 5
I have come to the conclusion that Chinese support in Unicode is an epic failure due to the Han Unification and an apparent lack or foresight or understanding of Chinese characters.

- There is no way to differentiate Japanese and Chinese characters other than trying to look at the text and guessing.

- It impossible to create a universal font because the same character code is used for two different glyphs, one Japanese and one Chinese.

- Any document format that does not include a language attribute cannot reliably contain Japanese or Chinese characters. For example if I get an email encoding in UTF-16 there is no way to determine if it contains Japanese or Chinese unless it uses characters that are only in one of those two sets or some kind of AI/human reads it.

- The initial allocation failed to include commonly used Japanese characters such as those for people's names, despite them being well defined and listed by the government and several dictionaries.

- The multiple encodings, while clever and mathematically sound, make implementing Unicode on limited memory and performance systems very challenging. The knock-on effect of this is not only that most embedded systems don't support Unicode, but many desktop apps don't either because Unicode failed to displace existing more practical options.

As an example, you can't allocate a buffer of N 16 bit words and expect to fit N characters into it because some characters need a 24 bit word.

- To anyone not familiar with Japanese or Chinese these problems would not be obvious or well documented, so some apps that support Unicode still don't work properly for those languages.



Because of all this I am having to produce two versions of my product, one with a Chinese font and one with a Japanese font. I don't know Chinese so I somehow have to test it blind.


Top
 Profile  
 
 Post subject: Re: Required characters for CJK support in an embedded syste
PostPosted: Thu Aug 11, 2011 5:33 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
mojo wrote:
I have come to the conclusion that Chinese support in Unicode is an epic failure due to the Han Unification and an apparent lack or foresight or understanding of Chinese characters..... I don't know Chinese...

In contrast, the developers of the unified Han repertoire included many people fluent in multiple East Asian languages. Their work, in turn, was based on earlier work, which had been tested in practical applications. Altogether, I would estimate that several dozen highly expert specialists contributed to the Han unification work. Accusing these experts of "lack of foresight or understanding of Chinese characters" is a bit rich from somebody, who, by his own admission, doesn't now Chinese.
mojo wrote:
- There is no way to differentiate Japanese and Chinese characters other than trying to look at the text and guessing.

The characters are still the same, even when their customary rendering shows slight variations. Such language-based variations are not unique to Han characters, for example the preferred shape of accents on the letter a is different as well between Polish and French.
mojo wrote:
- It impossible to create a universal font because the same character code is used for two different glyphs, one Japanese and one Chinese.

The number of glyphs needed is independent of the encoding. With modern font technologies like Opentype, you can tag which glyph should be used for which language environment.
mojo wrote:
- Any document format that does not include a language attribute cannot reliably contain Japanese or Chinese characters. For example if I get an email encoding in UTF-16 there is no way to determine if it contains Japanese or Chinese unless it uses characters that are only in one of those two sets or some kind of AI/human reads it.

That is not correct. It can reliably contain such characters, and readers would recognize them as the correct characters, but they might complain about them looking "odd" if they are not rendered in their preferred type style.
mojo wrote:
- The initial allocation failed to include commonly used Japanese characters such as those for people's names, despite them being well defined and listed by the government and several dictionaries.

Unicode is growing. Additional characters are proposed, reviewed and approved on a regular basis. At this point, Unicode 6.0 is the basis on which you should implement. Whether something was present in Unicode 1.0 or not is not material.
mojo wrote:
- The multiple encodings, while clever and mathematically sound, make implementing Unicode on limited memory and performance systems very challenging. The knock-on effect of this is not only that most embedded systems don't support Unicode, but many desktop apps don't either because Unicode failed to displace existing more practical options.

You don't need to support all of them internally, and UFT-16 does not take more space than many legacy encodings for East Asian character sets. I don't think this argument holds.
mojo wrote:
As an example, you can't allocate a buffer of N 16 bit words and expect to fit N characters into it because some characters need a 24 bit word.

You can't really do that in the legacy character sets either. They alternated between 1, 2, and sometimes 3-byte sequences for characters. So much for being more "practical".
mojo wrote:
- To anyone not familiar with Japanese or Chinese these problems would not be obvious or well documented, so some apps that support Unicode still don't work properly for those languages.

High quality support for any language requires features that not every engineer knows about (although some do well for their native language). I've implemented Japanese support based on Unicode for desktop and server applications and I don't see the issues that you mention as being the most difficult.


Top
 Profile  
 
 Post subject: Re: Required characters for CJK support in an embedded syste
PostPosted: Thu Aug 11, 2011 7:42 pm 
Offline

Joined: Wed Feb 10, 2010 2:51 pm
Posts: 16
Location: Salt Lake City
mojo wrote:
I have come to the conclusion that Chinese support in Unicode is an epic failure due to the Han Unification and an apparent lack or foresight or understanding of Chinese characters.


As Asmus points out, the Unihan repertoire has been developed by native speakers of (Mandarin) Chinese, Cantonese, Japanese, Korean, and Vietnamese, including representatives from major font vendors, software companies, and universities throughout East Asia. The model used for Han unification was developed by JIS for Japanese character set standards and slightly refined by the Chinese. The issues you raise were all well-understood when the Unihan effort started over twenty years ago.

Quote:
- There is no way to differentiate Japanese and Chinese characters other than trying to look at the text and guessing.


Why do you want to do this? In general, Japanese users want to see kanji written with Japanese fonts, whether they're used to write Japanese or Chinese. The simplest way to have this happen is for the user to set their locale or manually set the font. Usually setting the locale is enough.

Quote:
- It impossible to create a universal font because the same character code is used for two different glyphs, one Japanese and one Chinese.


Actually, it's impossible to create a universal font because TrueType and its descendants can't have more than 65,536 glyphs. Having a single font file with simultaneous support for Chinese and Japanese is relatively straightforward using TrueType Collection (.ttc) fonts. You do have to distinguish the the Japanese variant and the Chinese variant in the UI, of course; but typical users will want only one and not the other anyway. And depending on character codes to distinguish Chinese and Japanese wouldn't do you a lick of good, because it's the glyph count which is the problem, not the character count.

If you really, seriously do not want to work with more than one font, or get the locale from the system, or let the user switch between fonts—well, then, yes, you're hosed.

Quote:
- Any document format that does not include a language attribute cannot reliably contain Japanese or Chinese characters. For example if I get an email encoding in UTF-16 there is no way to determine if it contains Japanese or Chinese unless it uses characters that are only in one of those two sets or some kind of AI/human reads it.


You mean "differentiate" instead of "contain," of course. And again, why do you want to do this? If it's only to set the appropriate font for display, then using a font derived from the locale should be enough. I'll reiterate: Japanese users want to see Japanese glyphs, period. They don't care what the language in question is.

Quote:
- The initial allocation failed to include commonly used Japanese characters such as those for people's names, despite them being well defined and listed by the government and several dictionaries.


The initial allocation was a superset of the major character sets then in use for simplified and traditional Chinese, Japanese, and Korean. Common Cantonese-specific characters were missing, and support for Vietnamese was fairly limited, largely because those two were terra incognita at the time. If Unicode was missing "commonly used Japanese characters" in 1991, it was because the Japanese failed to include them in their own standards, which makes this objection rather disingenuous.

In any event, when the matter was being argued over in the early 1990's, opponents of Unicode failed to produce a single kanji which was in common use for personal names and yet not encoded. Even now, the bulk of the characters proposed for inclusion by Japan do not include characters required for personal names, as opposed to the proposals coming from China.

I'm sorry to sound brusque, but this was a canard twenty years ago, and one would hope that we could simply let this particular horse rest in peace.

Quote:
- The multiple encodings, while clever and mathematically sound, make implementing Unicode on limited memory and performance systems very challenging. The knock-on effect of this is not only that most embedded systems don't support Unicode, but many desktop apps don't either because Unicode failed to displace existing more practical options.


Conversion between the various UTFs is trivial and requires a small amount of code. Libraries and samples illustrating how to do it with complete error handling are readily available. It's also very fast. The typical way of doing it is to use one UTF internally and do the conversions at input and output.

Interconversion between various UTFs is a far, far simpler problem than interconversion between the dozens of East Asian character sets which antedate Unicode.

In any event, such software manages without Unicode by the simple fact that it is generally not trying to solve the same problem that Unicode is; that is, it is deliberately limiting its support to a small number of languages, perhaps as few as one. Writing software which provides simultaneous support for all the major languages of East Asia—let alone the dozens of scripts and hundreds of languages covered by Unicode 6.0—is far, far more difficult without Unicode.

Quote:
As an example, you can't allocate a buffer of N 16 bit words and expect to fit N characters into it because some characters need a 24 bit word.


No, you can't. Would you prefer to be unable to allocate a buffer of N 8-bit bytes and be unable to fit N characters in it because some are single-byte and some are double-byte? That's the way the major East Asian standards worked in the pre-Unicode days. Would you prefer to allocate buffers of N 32-bit words and fit N characters in them? That's what you would have to do without Han unification. And if that's the case, just use UTF-32.

(And you can't do that anyway, because the assumption that one character == one code unit in memory == one grapheme is a gross oversimplification for languages generally. It is mostly true in East Asia and the US, which is why people keep thinking it should be true everywhere all the time.)

Quote:
- To anyone not familiar with Japanese or Chinese these problems would not be obvious or well documented, so some apps that support Unicode still don't work properly for those languages.


Out of curiosity, do you have specific applications in mind? I'm aware of software that still doesn't handle Unicode correctly in a number of different ways, but not any that has trouble with Chinese and Japanese.

Quote:
Because of all this I am having to produce two versions of my product, one with a Chinese font and one with a Japanese font. I don't know Chinese so I somehow have to test it blind.


And you can't ship it with both and let the system locale or the user determine which font to use because—?


Top
 Profile  
 
 Post subject: Re: Required characters for CJK support in an embedded syste
PostPosted: Wed Sep 07, 2011 12:15 pm 
Offline

Joined: Tue Jul 12, 2011 1:57 pm
Posts: 5
Tseng wrote:
Quote:
- There is no way to differentiate Japanese and Chinese characters other than trying to look at the text and guessing.


Why do you want to do this? In general, Japanese users want to see kanji written with Japanese fonts, whether they're used to write Japanese or Chinese. The simplest way to have this happen is for the user to set their locale or manually set the font. Usually setting the locale is enough.


My friend who I studied Japanese with also speaks Chinese, as well as being fluent in Korean and English. He gets emails from people in Chinese and Japanese, but because Gmail (and almost all other webmail apps) uses Unicode the browser has no way of knowing which font to use. He can either change font manually every time or live with characters that look wrong.

He put me onto a couple of good Chinese musicians so I bought some MP3s. Because there is no way of knowing if the file names or metadata are Japanese or Chinese all media players use the wrong font. My embedded system will interface with media players over USB, which also uses Unicode. The device name in the Device Manager uses the wrong font on a Chinese OS, and since there is no way of knowing which language song metadata is in my device also uses the wrong font on its LCD.

Fortunately I don't have any Sarah Brightman music. She does songs in both Japanese and Chinese, sometimes both in the same song. Her lyrics are literally incompatible with Unicode. There is no way to display them correctly in a plain text document, you would have to use character level metadata. It gets even worse when trying to deal with a single string, like say the authors of a document written by a Chinese and a Japanese person.

Quote:
And you can't ship it with both and let the system locale or the user determine which font to use because—?


In the real world people expect this stuff to just work. The whole point of creating a unified character encoding scheme should be to make things work properly. Thanks to Unicode that won't happen. Even if I add a switch it will still be wrong in cases where more than one language is displayed. Unless you have any suggestions about how to work around this?


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 12 posts ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com