From: Addison Phillips [wM] (email@example.com)
Date: Mon Jun 28 2004 - 12:49:46 CDT
Your question is incomplete. There are several Unicode encodings to choose from and the "number of bytes" question is influenced by your choice of encoding, as well as by the data you choose.
For example, UTF-8 is a multibyte encoding of Unicode, where each character is 1-, 2-, 3-, or 4-bytes long, depending on the character. The majority of characters written in Simplified Chinese will be three bytes long in this encoding.
UTF-16 encodes characters using two bytes per character for the vast majority of characters in most sets of data. Some Chinese characters are encoded on higher (or "supplemental") planes of Unicode and will require two two-byte characters (a "surrogate pair") to access them in UTF-16. These characters are generally considered to be quite rare in "average" data and it is unlikely that your data will contain more than a few of these characters in any event.
Probably, though, you are not starting your question in the right place. Why do you care about the number of bytes in a character? The reasons you give will determine whether a specific encoding is more (or less) suited for use than another encoding (or even character set, such as a legacy, non-Unicode, character set/encoding). For example, if you are trying to determine whether Unicode is more (or less) efficient than a legacy solution, then I think you'll find that the performance issues are somewhere other than the average byte count per character. If you are worried about storage (disk, database, etc.), then the specifics of your situation will determine what the "right answer" may be for you.
Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
Internationalization is an architecture.
It is not a feature.
From: firstname.lastname@example.org [mailto:email@example.com]On Behalf Of Duraivel
Sent: 2004年6月27日 23:38
Subject: number of bytes for simplified chinese
I would like to know the number opf bytes required for simplified chinese language. Can we represent all the characters of simplified chinese in unicode using just two bytes.
This archive was generated by hypermail 2.1.5 : Mon Jun 28 2004 - 12:53:57 CDT