Re: Unicode lexer

From: Tom Emerson (
Date: Wed Apr 20 2005 - 18:27:30 CST

  • Next message: Tex Texin: "Re: Unicode lexer"

    Tex Texin writes:
    > Tom, you are right it is the latter, Unicoded identifiers and such. I'll
    > look at the Python docs, thanks for the tip.

    Note that Python does not allow Unicode identifiers: just Unicode
    string support. Java is probably your best template for dealing with
    Unicode identifiers.

    The big problem with a fully-Unicode enabled lexer (i.e., one that is
    using UTF-16 or UTF-32 internally) is the sheer size of the lookup
    tables: instead of an alphabet of less than 100 characters, you end up
    with one with tens of thousands of characters. Ye Olde direct index
    falls apart in the presence of these sparse tables. My two IUC
    presentations (24 and whatever number happened in Dublin) talk about
    some methods for dealing with these issues: unfortunately you end up
    trading off size for a non-trivial speed hit, unless you are very

    UTF-8 is a solution to the problem, though the depth of the automata
    increases and you may end up having to convert your existing UTF-16/32
    buffers to UTF-8 for lexing, then back again, dealing all the while
    with returning correct offsets during error processing. PCRE, for
    example, works in UTF-8, so if you want to use it on a UTF-16 buffer
    you need to convert both ways. A RPITA.


    Tom Emerson                                          Basis Technology Corp.
    Software Architect                       
      "Beware the lollipop of mediocrity: lick it once and you suck forever"

    This archive was generated by hypermail 2.1.5 : Wed Apr 20 2005 - 18:28:22 CST