RE: Multibyte languages - Chinese - double bytes or more bytes in each character

From: Phillips, Addison ([email protected])
Date: Tue Dec 23 2008 - 13:43:52 CST

Next message: James Kass: "Re: Multibyte languages - Chinese - double bytes or more bytes in each character"
Previous message: Yiru Chen: "Multibyte languages - Chinese - double bytes or more bytes in each character"
In reply to: Yiru Chen: "Multibyte languages - Chinese - double bytes or more bytes in each character"
Next in thread: James Kass: "Re: Multibyte languages - Chinese - double bytes or more bytes in each character"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

“It depends”

The character encoding or encodings you use in your project will determine the number of “code units” (typically bytes) that you will use to encoding specific characters. Depending on your programming language, runtime environment (Windows, Mac, Unix, etc.), and so forth, you might choose various different character encodings to use with Chinese language text.

The most common Unicode encodings are UTF-16 and UTF-8. UTF-16 uses 16-bit code units (so each character takes two bytes, with some exceptions that require four bytes). UTF-8 is a multibyte encoding that requires between one and four bytes per character.

Most “legacy” (non-Unicode) multibyte encodings are variable width (like UTF-8 is). Depending on the encoding they can have a maximum width of one, two, three, and occasionally four bytes per character.

Since you don’t seem very sure of this, I would suggest you take a look at some of the training materials that can be found on the web. Some good places to look:

My Unicode Conference Internationalization Tutorial http://www.inter-locale.com
W3C International site http://www.w3.org/International
Character Model for the WWW http://www.w3.org/TR/CharMod

And obviously the Unicode web site has some excellent materials on encodings.

Regards,

Addison

Addison Phillips
Globalization Architect -- Lab126
Chair – W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.

From: [email protected] [mailto:[email protected]] On Behalf Of Yiru Chen
Sent: Tuesday, December 23, 2008 11:20 AM
To: [email protected]; [email protected]
Subject: Multibyte languages - Chinese - double bytes or more bytes in each character

I’m working on a multi-byte language project. I know Chinese is a multibyte language, but not sure if that means each Chinese character has two bytes or more than one byte but varies (ie. Variable number of bytes, can be two, three or more bytes)?

Can someone who knows this stuff confirm?

Thanks!

Yiru

Next message: James Kass: "Re: Multibyte languages - Chinese - double bytes or more bytes in each character"
Previous message: Yiru Chen: "Multibyte languages - Chinese - double bytes or more bytes in each character"
In reply to: Yiru Chen: "Multibyte languages - Chinese - double bytes or more bytes in each character"
Next in thread: James Kass: "Re: Multibyte languages - Chinese - double bytes or more bytes in each character"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 02 2009 - 15:33:07 CST