RE: Zero termination

From: Phillips, Addison (addison@amazon.com)
Date: Sat Jun 27 2009 - 12:21:16 CDT

  • Next message: Venugopalan G: "Re: Zero termination"

    Venu,


    Thanks for the detailed desc.
    The input is always a readable text from some language(not necessarily English), not an arbitary UTF16 stream.
    Let me put the question in diff manner.

    Is it possible that a readable/valid string of any other language has a U+0000 in the middle?
    AP> No. It doesn’t matter what the language is. The only character in Unicode (and thus UTF-16) that uses the code unit 0x0000 is NULL.

    I understand that U+0000 is used for representing NULL char. But is it always NULL irrespective of language/charset?
    AP> Yes. Always.


    One possibility i cud think of is, e.g. some chinese character might have
    one code point = two 16b code units,
    AP> Some Chinese (and other characters from other scripts) in fact do use two 16-bit code units. These are called a “surrogate pair” and are restricted to a specific range of code units which are never null.

     where 1st 16bit unit is something and the next 16 bit is U+0000. Is that possible?
    AP> No.

    Any real world character with such encoding value? Does unicode allow character sets to choose U+0000 for their code point representation?
    AP> Unicode is the character set. It encodes the various scripts used to write the world’s languages, assigning each character a unique code point. The code point U+0000 is assigned (solely, uniquely) to NULL.
    Addison

    Addison Phillips
    Globalization Architect -- Lab126

    Internationalization is not a feature.
    It is an architecture.





    This archive was generated by hypermail 2.1.5 : Sat Jun 27 2009 - 12:24:27 CDT