Re: Stateful?

From: Doug Ewell (dewell@roadrunner.com)
Date: Wed May 28 2008 - 07:27:56 CDT

  • Next message: Ed Trager: "Re: Arabic Lamalef missing Unicode Ligatures with Tashkeel and/or Shadda on Lam"

    Marcin ‘Qrczak’ Kowalczyk <qrczak at knm dot org dot pl> wrote:

    > A UTF-8 with a BOM is stateful: the decoder must remember whether it
    > has seen a BOM or whether it is past the beginning, and the encoder
    > must remember if it is at the beginning, to know whether to emit
    > U+FEFF twice for the case when the data begins with U+FEFF. A UTF-8
    > without any special treatment of U+FEFF at the beginning is stateless.
    > Both variants of UTF-8 are in use. It would be better to distinguish
    > them explicitly, like UTF-16 is distinguished from UTF-16BE &
    > UTF-16LE.

    Nobody has yet shown me a realistic (non-contrived) scenario of Unicode
    data beginning with ZERO-WIDTH NO-BREAK SPACE. It would make no sense;
    the whole purpose of ZWNBSP as such is to be placed *between* two
    characters. Certainly it can be done, just as a diaeresis can be
    positioned after a control character, but it's not realistic.

    --
    Doug Ewell  *  Arvada, Colorado, USA  *  RFC 4645  *  UTN #14
    http://www.ewellic.org
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ
    


    This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 07:30:13 CDT