From: Phillips, Addison (addison@amazon.com)
Date: Sat Jun 27 2009 - 11:30:11 CDT
Hello Venu,
Let me see if I understand what you’re asking.
The Unicode character set defines characters. One of these characters, at code point 0, is the NULL character. See [1]
UTF-16 is a character encoding of the Unicode character set. In UTF-16, each Unicode code point (“character”) is represented by one or (occasionally) two 16-bit “code units” [by comparison, a byte is an 8-bit “code unit”]. The NULL character, in this encoding, is represented by a 16-bit code unit in which all of the bits are set to 0. A UTF-16 string consists of a sequence of 16-bit code units and it is a convention of many programming languages that the character NULL marks the end of a string buffer. In these programming languages, the appearance of a 16-bit NULL will cause the string to terminate.
If, by “character” or “value zero”, you mean the (8-bit) byte value zero, then, yes, there will be a lot of “zero” bytes in a UTF-16 encoded buffer: these do not represent the character NULL on their own. This doesn’t cause buffer termination, because one does not use an 8-bit byte to access a UTF-16 string. If you have a “uint_16t[]” for your UTF-16 string, your pointer will increment 16-bits, rather than 8-bits, at a time through the buffer. The value of a single “encoding unit” in this string is always 16-bits long. Only a 16-bit “null” represents the character NULL.
If you want to use bytes (char* in C), then you would use a different character encoding of Unicode (UTF-8). In this encoding, the null byte represents only the character NULL and is never part of a larger character unit.
I hope that helps explain it. You might also glance at my character encoding tutorial [2] or even order a copy of the Unicode Guide [3] to help you out.
Regards,
Addison
[1] http://www.unicode.org/charts/PDF/U0000.pdf
[2] http://www.inter-locale.com/whitepaper/Encodings-and-Unicode.pptx
[3] http://www.amazon.com/exec/obidos/tg/detail/-/1423201809
Addison Phillips
Globalization Architect -- Lab126
Internationalization is not a feature.
It is an architecture.
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Venugopalan G
Sent: Saturday, June 27, 2009 5:35 AM
To: unicode@unicode.org
Subject: Zero termination
Hi grp,
I just want to know if a valid UTF16 string can contain the value zero(0), not the character zero but the 16bit value zero.
Like, if i iterate through each unicode character(16 bits), will i find zero at any time? Is Zero a valid code point or a part of a code point?
Basically can i use zero to represent termination of a U16 string? because if zero is in the middle of str, then the program will terminate in wrong place.
Thanks,
Venu
This archive was generated by hypermail 2.1.5 : Sat Jun 27 2009 - 11:34:41 CDT