Re: 32'nd bit & UTF-8

From: Hans Aberg (
Date: Wed Jan 19 2005 - 13:56:34 CST

  • Next message: Kenneth Whistler: "Re: 32'nd bit & UTF-8"

    On 2005/01/19 17:48, Arcane Jill at wrote:

    >> A lexer generator like Flex does not process Unicode directly, it generates a
    >> lexer that processes bytes.

    > As a programmer myself, I actually followed that explanation.

    There has been discussions about that in the Flex list. People write Unicode
    expressions by writing \x.. by hand. But that is tedious. This led me to
    that approach.

    > But I wonder if
    > it's the right approach. Would it not be a more ... interesting ... approach,
    > to forget Flex, and instead write a brand new Unicode lexer generator which
    > generates a lexer that processes characters (not bytes)?

    Why do don't you do that yourself? :-) -- You must think about how much work
    that has already been pout into developing Flex to this point. And Unicode
    is not the only issue, there are many others: Better Flex Bison handling,
    multilanguage output, etc.

    If you want to write a lexer that directly processes Unicode points, using
    the DFA approach, then the problem is that you need a table with 2^21 index
    values. Since this is too big for a typical static array, you get into the
    issue of table compressions and the like. So the rewrite as 1-byte regular
    expression has the advantage of avoiding that issue altogether, promising
    quick Unicode support with relatively minor implementation work. There seems
    to benefits to this approach as well: One can mix different Unicode
    encodings in the same lexer, by the use of start conditions say.

    Then this need not be the end of it. But if one should make a lexer for only
    Unicode points, then perhaps one needs to have some idea of what actual
    Unicode lexers in use look like. So it may be the case one will have to wait
    some time into the future for that to happen.

      Hans Aberg

    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 13:58:32 CST