Re: Stateful?

From: Doug Ewell (
Date: Wed May 28 2008 - 07:27:56 CDT

  • Next message: Ed Trager: "Re: Arabic Lamalef missing Unicode Ligatures with Tashkeel and/or Shadda on Lam"

    Marcin ‘Qrczak’ Kowalczyk <qrczak at knm dot org dot pl> wrote:

    > A UTF-8 with a BOM is stateful: the decoder must remember whether it
    > has seen a BOM or whether it is past the beginning, and the encoder
    > must remember if it is at the beginning, to know whether to emit
    > U+FEFF twice for the case when the data begins with U+FEFF. A UTF-8
    > without any special treatment of U+FEFF at the beginning is stateless.
    > Both variants of UTF-8 are in use. It would be better to distinguish
    > them explicitly, like UTF-16 is distinguished from UTF-16BE &
    > UTF-16LE.

    Nobody has yet shown me a realistic (non-contrived) scenario of Unicode
    data beginning with ZERO-WIDTH NO-BREAK SPACE. It would make no sense;
    the whole purpose of ZWNBSP as such is to be placed *between* two
    characters. Certainly it can be done, just as a diaeresis can be
    positioned after a control character, but it's not realistic.

    Doug Ewell  *  Arvada, Colorado, USA  *  RFC 4645  *  UTN #14  ˆ

    This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 07:30:13 CDT