Re: Unicode Conformance

From: Doug Ewell (dewell@roadrunner.com)
Date: Mon Feb 18 2008 - 13:17:10 CST

  • Next message: N. Ganesan: "Fwd: Tamil Unicode in mobile devices"

    Srikrishna Erra wrote:

    > For UTF16 encoding scheme, BOM specifies that how a file should be
    > serialized i.e, if BOM=FEFF then this file should use big-endian byte
    > serialization (most significant byte first) or if BOM=FFFE then this
    > file should use little-endian byte serialization (least significant
    > byte first) and the unmarked form (No BOM) uses big-endian byte
    > serialization by default.
    >
    > So LE & BE input files are supposed to be processed on LE & BE
    > platforms respectively. when worng endianess input scripts are given
    > i.e LE script on BE platform and vice versa, application should
    > terminate with an error.
    >
    > Here one of my application is allowing BE scripts on LE platforms and
    > vice versa. so i need clarification.

    Conformance clause C11 directs you to definition D97, which doesn't
    really shed any additional light on this, but definiton D98 directly
    below it says:

    "The UTF-16 encoding scheme may or may not begin with a BOM. However,
    when there is no BOM, and in the absence of a higher-level protocol, the
    byte order of the UTF-16 encoding scheme is big-endian."

    I think your question is governed by that "higher-level protocol"
    clause. It depends on what "input scripts" means in your context (plain
    text files?), but if you have an application that reads files in the FOO
    file format, or input streams using the FOO protocol, which specifies
    that UTF-16 text is little-endian and has no BOM, then it is OK and
    expected to read this data correctly even on a big-endian architecture.

    If you have a UTF-16 file beginning with a BOM, that is not declared to
    be UTF-16LE or UTF-16BE, then D98 says you can use the byte orientation
    of the BOM as a clue to interpret the file as big-endian or
    little-endian. You can choose to be robust and use this clue to read
    both types of files, instead of rejecting the one that doesn't match
    your architecture.

    --
    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
    http://www.ewellic.org
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ
    


    This archive was generated by hypermail 2.1.5 : Mon Feb 18 2008 - 13:19:46 CST