Re: Unicode lexer

From: Tex Texin (
Date: Wed Apr 20 2005 - 18:33:22 CST

  • Next message: Hans Aberg: "Re: Unicode lexer"

    All true, which is why I am looking for an existing implementation...

    Tom Emerson wrote:
    > Tex Texin writes:
    > > Tom, you are right it is the latter, Unicoded identifiers and such. I'll
    > > look at the Python docs, thanks for the tip.
    > Note that Python does not allow Unicode identifiers: just Unicode
    > string support. Java is probably your best template for dealing with
    > Unicode identifiers.
    > The big problem with a fully-Unicode enabled lexer (i.e., one that is
    > using UTF-16 or UTF-32 internally) is the sheer size of the lookup
    > tables: instead of an alphabet of less than 100 characters, you end up
    > with one with tens of thousands of characters. Ye Olde direct index
    > falls apart in the presence of these sparse tables. My two IUC
    > presentations (24 and whatever number happened in Dublin) talk about
    > some methods for dealing with these issues: unfortunately you end up
    > trading off size for a non-trivial speed hit, unless you are very
    > careful.
    > UTF-8 is a solution to the problem, though the depth of the automata
    > increases and you may end up having to convert your existing UTF-16/32
    > buffers to UTF-8 for lexing, then back again, dealing all the while
    > with returning correct offsets during error processing. PCRE, for
    > example, works in UTF-8, so if you want to use it on a UTF-16 buffer
    > you need to convert both ways. A RPITA.
    > -tree
    > --
    > Tom Emerson Basis Technology Corp.
    > Software Architect
    > "Beware the lollipop of mediocrity: lick it once and you suck forever"

    Tex Texin   cell: +1 781 789 1898
    Xen Master                
    Making e-Business Work Around the World

    This archive was generated by hypermail 2.1.5 : Wed Apr 20 2005 - 18:33:57 CST