Re: Parsing Unicode strings

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed May 28 2008 - 16:56:06 CDT

  • Next message: Kenneth Whistler: "Re: Parsing Unicode strings"

    On 5/28/2008 1:49 PM, Peter Johansson wrote:
    > Is the Unicode-encoded character string self-descriptive? That is, do
    > I need /a priori/ knowledge that it is encoded as, for example, UTF-8
    > rather than UTF-32? Or, by examining the first byte (or first few
    > bytes) can I determine the encoding?
    UTF-32 will have every 4th byte null (0x00). Always, and no matter what
    the text contains. LE and BE differ only in whether these null bytes
    lead or trail in each group of four bytes. (It's the MSB that's null)

    In essence, that makes UTF-32 self-describing for anything more than two
    characters. Your example didn't mention UTF-16, so if the only other
    alternative is UTF-8, the null bytes are a very definite signature.
    (UTF-16 text in ASCII/Latin-1 has every other byte a null byte, so that
    would include the every fourth byte case).

    For text on the BMP, you would have every alternate pair of bytes being
    null bytes in UTF-32, which is something you don't get for UTF-16 unless
    you allow the document to contain null terminated strings containing
    single characters.

    UTF-32 that's off the BMP could look like UTF-16 where every other
    character is a control code. With increasing length of text
    progressively unlikely (and even so, currently only 01, 02, 0E, 0F and
    10 would correspond to assigned or private use UTF-32 characters, not
    the most frequently used control bytes, these). So, checking not only
    for the MSB, but the next byte in the putative UTF-32 text, would
    establish quickly whether it's UTF-32 or rather UTF-16.

    In short, discriminating among UTF's (if other encodings are ruled out)
    is a rather definite proposition. The one exception is UTF-16 BE vs LE
    because it's easy to construct cases where one looks like odd, but
    legal, text in the other. Therefore, the use of BOM.

    Where other encodings could be present you get the complex issue of
    encoding recognition, and that's where adding a BOM really helps both to
    establish the encoding as Unicode and to declare the encoding scheme.
    >
    > I didn't see anything on this topic in the FAQ.
    >
    > Regards,
    >
    > Peter Johansson
    >
    > Congruent Software, Inc.
    > 98 Colorado Avenue
    > Berkeley, CA 94707
    >
    > (510) 527-3926
    > (510) 527-3856 FAX
    >
    > PJohansson@ACM.org
    >



    This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 16:58:21 CDT