RE: Zero termination

From: Phillips, Addison (addison@amazon.com)
Date: Sat Jun 27 2009 - 11:30:11 CDT

Next message: Doug Ewell: "Re: Zero termination"

Previous message: John (Eljay) Love-Jensen: "RE: Zero termination"
In reply to: Venugopalan G: "Zero termination"
Next in thread: Andrew Lipscomb: "Re: Zero termination"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hello Venu,

Let me see if I understand what you’re asking.

The Unicode character set defines characters. One of these characters, at code point 0, is the NULL character. See [1]

UTF-16 is a character encoding of the Unicode character set. In UTF-16, each Unicode code point (“character”) is represented by one or (occasionally) two 16-bit “code units” [by comparison, a byte is an 8-bit “code unit”]. The NULL character, in this encoding, is represented by a 16-bit code unit in which all of the bits are set to 0. A UTF-16 string consists of a sequence of 16-bit code units and it is a convention of many programming languages that the character NULL marks the end of a string buffer. In these programming languages, the appearance of a 16-bit NULL will cause the string to terminate.

If, by “character” or “value zero”, you mean the (8-bit) byte value zero, then, yes, there will be a lot of “zero” bytes in a UTF-16 encoded buffer: these do not represent the character NULL on their own. This doesn’t cause buffer termination, because one does not use an 8-bit byte to access a UTF-16 string. If you have a “uint_16t[]” for your UTF-16 string, your pointer will increment 16-bits, rather than 8-bits, at a time through the buffer. The value of a single “encoding unit” in this string is always 16-bits long. Only a 16-bit “null” represents the character NULL.

If you want to use bytes (char* in C), then you would use a different character encoding of Unicode (UTF-8). In this encoding, the null byte represents only the character NULL and is never part of a larger character unit.

I hope that helps explain it. You might also glance at my character encoding tutorial [2] or even order a copy of the Unicode Guide [3] to help you out.

Regards,

Addison

[1] http://www.unicode.org/charts/PDF/U0000.pdf
[2] http://www.inter-locale.com/whitepaper/Encodings-and-Unicode.pptx
[3] http://www.amazon.com/exec/obidos/tg/detail/-/1423201809

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Venugopalan G
Sent: Saturday, June 27, 2009 5:35 AM
To: unicode@unicode.org
Subject: Zero termination

Hi grp,

I just want to know if a valid UTF16 string can contain the value zero(0), not the character zero but the 16bit value zero.
Like, if i iterate through each unicode character(16 bits), will i find zero at any time? Is Zero a valid code point or a part of a code point?
Basically can i use zero to represent termination of a U16 string? because if zero is in the middle of str, then the program will terminate in wrong place.

Thanks,
Venu

Next message: Doug Ewell: "Re: Zero termination"
Previous message: John (Eljay) Love-Jensen: "RE: Zero termination"
In reply to: Venugopalan G: "Zero termination"
Next in thread: Andrew Lipscomb: "Re: Zero termination"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jun 27 2009 - 11:34:41 CDT