RE: Multibyte languages - Chinese - double bytes or more bytes in each character

From: Phillips, Addison (addison@amazon.com)
Date: Tue Dec 23 2008 - 13:43:52 CST


“It depends”

The character encoding or encodings you use in your project will determine the number of “code units” (typically bytes) that you will use to encoding specific characters. Depending on your programming language, runtime environment (Windows, Mac, Unix, etc.), and so forth, you might choose various different character encodings to use with Chinese language text.

The most common Unicode encodings are UTF-16 and UTF-8. UTF-16 uses 16-bit code units (so each character takes two bytes, with some exceptions that require four bytes). UTF-8 is a multibyte encoding that requires between one and four bytes per character.

Most “legacy” (non-Unicode) multibyte encodings are variable width (like UTF-8 is). Depending on the encoding they can have a maximum width of one, two, three, and occasionally four bytes per character.

Since you don’t seem very sure of this, I would suggest you take a look at some of the training materials that can be found on the web. Some good places to look:

   My Unicode Conference Internationalization Tutorial http://www.inter-locale.com
   W3C International site http://www.w3.org/International
 Character Model for the WWW http://www.w3.org/TR/CharMod

And obviously the Unicode web site has some excellent materials on encodings.

Regards,

Addison

Addison Phillips
Globalization Architect -- Lab126
Chair – W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.

From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Yiru Chen
Sent: Tuesday, December 23, 2008 11:20 AM
To: unicode@unicode.org; cldr-users@unicode.org
Subject: Multibyte languages - Chinese - double bytes or more bytes in each character

I’m working on a multi-byte language project. I know Chinese is a multibyte language, but not sure if that means each Chinese character has two bytes or more than one byte but varies (ie. Variable number of bytes, can be two, three or more bytes)?

Can someone who knows this stuff confirm?

Thanks!

Yiru



This archive was generated by hypermail 2.1.5 : Fri Jan 02 2009 - 15:33:07 CST