RE: number of bytes for simplified chinese

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Mon Jun 28 2004 - 12:49:46 CDT

  • Next message: Rick McGowan: "APL mapping tables"

    Hi Duraivel,

    Your question is incomplete. There are several Unicode encodings to choose from and the "number of bytes" question is influenced by your choice of encoding, as well as by the data you choose.

    For example, UTF-8 is a multibyte encoding of Unicode, where each character is 1-, 2-, 3-, or 4-bytes long, depending on the character. The majority of characters written in Simplified Chinese will be three bytes long in this encoding.

    UTF-16 encodes characters using two bytes per character for the vast majority of characters in most sets of data. Some Chinese characters are encoded on higher (or "supplemental") planes of Unicode and will require two two-byte characters (a "surrogate pair") to access them in UTF-16. These characters are generally considered to be quite rare in "average" data and it is unlikely that your data will contain more than a few of these characters in any event.

    Probably, though, you are not starting your question in the right place. Why do you care about the number of bytes in a character? The reasons you give will determine whether a specific encoding is more (or less) suited for use than another encoding (or even character set, such as a legacy, non-Unicode, character set/encoding). For example, if you are trying to determine whether Unicode is more (or less) efficient than a legacy solution, then I think you'll find that the performance issues are somewhere other than the average byte count per character. If you are worried about storage (disk, database, etc.), then the specifics of your situation will determine what the "right answer" may be for you.

    Best Regards,

    Addison
    Addison P. Phillips
    Director, Globalization Architecture
    webMethods | Delivering Global Business Visibility
    http://www.webMethods.com
    Chair, W3C Internationalization (I18N) Working Group
    Chair, W3C-I18N-WG, Web Services Task Force
    http://www.w3.org/International

    Internationalization is an architecture.
    It is not a feature.

      -----Original Message-----
      From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On Behalf Of Duraivel
      Sent: 2004年6月27日 23:38
      To: unicode@unicode.org
      Subject: number of bytes for simplified chinese

      hi,

      I would like to know the number opf bytes required for simplified chinese language. Can we represent all the characters of simplified chinese in unicode using just two bytes.

      regards
      duraivel



    This archive was generated by hypermail 2.1.5 : Mon Jun 28 2004 - 12:53:57 CDT