Re: Filtering and displaying untrusted UTF-8

From: Jukka K. Korpela (
Date: Mon Dec 28 2009 - 12:50:20 CST

  • Next message: Jason Schauberger: "Re: Filtering and displaying untrusted UTF-8"

    - - wrote:

    > 2) For code points in planes 0 to 2 (BMP, SMP, SIP) filter the
    > following:
    - -
    > * 0xFEFF (byte order mark, no use in UTF-8 and may be
    > potentially dangerous if converted later to UTF-16 without proper
    > filtering)

    Others have commented on the big picture, which remains somewhat obscure,
    and I have just one note on a detail: U+FEFF is, by definition, ZERO WIDTH
    NO-BREAK SPACE when it occurs anywhere except at the start of data stream.
    In that role, it acts as invisible glue that prevents a line break where it
    might otherwise be introduced. Even though you might say that another
    character is preferred for such usage, U+FEFF is still the one that works
    most widely, in popular software like Microsoft Word and Internet Explorer.
    (Technically, they do not operate on plain text, but they do operate on
    text, and U+FEFF is the text-level weapon that one can use.)

    Therefore, regarding U+FEFF as not allowed in plain text datastream would be
    a big mistake, even though filtering it out would normally result in
    inferior typography at most.


    This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 12:53:12 CST