Re: Unicode lexer

From: Tom Emerson (
Date: Wed Apr 20 2005 - 05:34:21 CST

  • Next message: Andrew C. West: "Unicode Bloopers"

    Tex Texin writes:
    > I would be interested in pointers to any papers, case studies etc. on
    > migrating programming languages to be Unicode-enabled. (No sense
    > repeating the sins of the past.)

    I would take a look at Python and the various specifications that were
    written around its Unicode implementation. The guys who implemented it
    did a fantastic job. Indeed, the implementation is pretty easy to read
    as well, so you may just want to look at the code.

    There are, of course, a couple of levels of "Unicode-enablement"
    within a programming language. Many moons ago I was involved with
    working on the Unicode-enablement of Gwydion Dylan, though life
    intervened and I had to stop. If "all" you need to do is provide
    support for a Unicode string type, with appropriate transcoders, then
    the task is considerably easier than if you are enabling the entire
    language to allow Unicode identifiers, a la Java. Since you are asking
    for a Unicode enabled lexer, I assume the latter.

    I thought that Flex had been modified to deal with Unicode... I guess
    that isn't the case.

    You don't mention the implementation language: whether it's C, C++,
    Java, or something else entirely. That will certainly constrain your

    It may end up being easier to develop your own lexer from scratch, not
    using Flex or other lexer generator. But again, without knowing more
    about the problem, it's hard to say. FWIW I've taken this approach in
    one project, and it worked well, especially given UAX #31 as a
    starting point.


    Tom Emerson                                          Basis Technology Corp.
    Software Architect                       
      "Beware the lollipop of mediocrity: lick it once and you suck forever"

    This archive was generated by hypermail 2.1.5 : Wed Apr 20 2005 - 05:35:53 CST