Re: 32'nd bit & UTF-8

From: Hans Aberg (
Date: Tue Jan 18 2005 - 18:09:33 CST

  • Next message: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"

    On 2005/01/18 22:53, Philippe VERDY at wrote:

    >>> Of course this loses the fact that UTF-8 data will never contain 0xFE or
    >>> 0xFF
    >>> (and so UTF-16 with a BOM will never be confused with UTF-8, a fact that is
    >>> important to XML parsers for one application).
    >> In , the use of BOM is
    >> discouraged for use on UNIX platforms. So if endianness may appear to
    >> becomes a problem, it might be better to use UTF-8 externally, and then
    >> convert it to UTF-32/H/L internally in the program.
    > I have not read any formal description (even informative) of an UTF-8-like
    > transformation format that used bytes FE and FF.
    > So if you really want to use FE and FF, to extend the
    > old-deprecated-informative-RFC UTF-8 to keep the compatibility with byte order
    > marks used to autodetect UTF-16 and UTF-32, you can consider this:
    > - if FF is used, it has to be followed by FE to be recongized as a
    > (not-recommanded) UTF-16 or UTF-32 BOM
    > - if FE is used, it has to be followed by FF to be recongized as a
    > (not-recommanded) UTF-16 or UTF-32 BOM
    > So you have better options: don't use FE blindly in your extension: make sure
    > that your extension will not allow encoding a FF byte just after it. Same
    > thing for FE (can't be followed by FF).

    These are good points, but really non-issues from the point of extending
    Flex to those encodings, as it will be the fellow who writes the lexer that
    will have to resolve those issues, not Flex itself. So, for example, Flex
    admits a lexer having context switches called "start conditions", which
    might be used if one wants to make a lexer that can switch between different
    encodings. Then it will be up to the lexer writer to figure those things
    out, not the Flex program itself or the developer of Flex.

      Hans Aberg

    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 18:14:04 CST