Re: Conformance

From: Hans Aberg (haberg@math.su.se)
Date: Fri Jan 21 2005 - 12:47:21 CST

  • Next message: Hans Aberg: "Re: Conformance (Was: 32'nd bit & UTF-8)"

    On 2005/01/21 16:33, Peter Kirk at peterkirk@qaya.org wrote:

    >> [Jill's Important Question 2]:
    >> And the second question I must ask is: if a file is labelled by some
    >> higher level protocol (for example, Unix locale, HTTP header, etc) as
    >> "UTF-8", should a conformant process interpret that as UTF-8, the
    >> Unicode Encoding FORM (which prohibits a BOM) or as UTF-8, the Unicode
    >> Encoding SCHEME (which allows one)?
    >>
    > Excellent question! And what if it is not labelled at all, but expected
    > to be UTF-8?

    Here, it seems, the higher level protocol should define what should happen
    with BOM, just as with any other character. UTF-8 just means that the byte
    sequence is well formed according to UTF-8.

    The Unicode standard, it seems, is prone to misinterpretations on this
    point. It should be rewritten. There appears to be no need for it to mention
    the BOM, except as a curiosity note, noting that programs and other
    protocols may treat it differently than its 0xFEFF glyph semantics. In this
    respect, it is not different from any other valid character sequence in
    Unicode. Shell script or PS markers do not make those files not conforming
    to Unicode. Unicode, as a character protocol, just provides the characters
    and encodings, but does not enforce any particular of programs behavior
    otherwise.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 13:05:12 CST