Re: Unicode lexer

From: Hans Aberg (
Date: Wed Apr 20 2005 - 18:55:09 CST

  • Next message: Peter R. Mueller-Roemer: "Re: Unicode Bloopers"

    At 20:27 -0400 2005/04/20, Tom Emerson wrote:
    >UTF-8 is a solution to the problem, though the depth of the automata
    >increases and you may end up having to convert your existing UTF-16/32
    >buffers to UTF-8 for lexing, then back again, dealing all the while
    >with returning correct offsets during error processing. PCRE, for
    >example, works in UTF-8, so if you want to use it on a UTF-16 buffer
    >you need to convert both ways. A RPITA.

    There is no problem using UTF-16/32 directly either, as they merely
    will be interpreted as byte sequences. UTF-16 is quite irregular, and
    is harder to use because of that. So a translator to UTF-8/32 is
    probably to prefer. Then, UTF-8 will probably win over UTF-32, as it
    has ASCII in its single bytes low 7 bits.

       Hans Aberg

    This archive was generated by hypermail 2.1.5 : Wed Apr 20 2005 - 19:01:34 CST