RE: Zero termination

From: Phillips, Addison (addison@amazon.com)
Date: Sat Jun 27 2009 - 11:30:11 CDT

  • Next message: Doug Ewell: "Re: Zero termination"

    Hello Venu,

    Let me see if I understand what you’re asking.

    The Unicode character set defines characters. One of these characters, at code point 0, is the NULL character. See [1]

    UTF-16 is a character encoding of the Unicode character set. In UTF-16, each Unicode code point (“character”) is represented by one or (occasionally) two 16-bit “code units” [by comparison, a byte is an 8-bit “code unit”]. The NULL character, in this encoding, is represented by a 16-bit code unit in which all of the bits are set to 0. A UTF-16 string consists of a sequence of 16-bit code units and it is a convention of many programming languages that the character NULL marks the end of a string buffer. In these programming languages, the appearance of a 16-bit NULL will cause the string to terminate.

    If, by “character” or “value zero”, you mean the (8-bit) byte value zero, then, yes, there will be a lot of “zero” bytes in a UTF-16 encoded buffer: these do not represent the character NULL on their own. This doesn’t cause buffer termination, because one does not use an 8-bit byte to access a UTF-16 string. If you have a “uint_16t[]” for your UTF-16 string, your pointer will increment 16-bits, rather than 8-bits, at a time through the buffer. The value of a single “encoding unit” in this string is always 16-bits long. Only a 16-bit “null” represents the character NULL.

    If you want to use bytes (char* in C), then you would use a different character encoding of Unicode (UTF-8). In this encoding, the null byte represents only the character NULL and is never part of a larger character unit.

    I hope that helps explain it. You might also glance at my character encoding tutorial [2] or even order a copy of the Unicode Guide [3] to help you out.

    Regards,

    Addison

    [1] http://www.unicode.org/charts/PDF/U0000.pdf
    [2] http://www.inter-locale.com/whitepaper/Encodings-and-Unicode.pptx
    [3] http://www.amazon.com/exec/obidos/tg/detail/-/1423201809

    Addison Phillips
    Globalization Architect -- Lab126

    Internationalization is not a feature.
    It is an architecture.

    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Venugopalan G
    Sent: Saturday, June 27, 2009 5:35 AM
    To: unicode@unicode.org
    Subject: Zero termination

    Hi grp,

    I just want to know if a valid UTF16 string can contain the value zero(0), not the character zero but the 16bit value zero.
    Like, if i iterate through each unicode character(16 bits), will i find zero at any time? Is Zero a valid code point or a part of a code point?
    Basically can i use zero to represent termination of a U16 string? because if zero is in the middle of str, then the program will terminate in wrong place.

    Thanks,
    Venu



    This archive was generated by hypermail 2.1.5 : Sat Jun 27 2009 - 11:34:41 CDT