Re: Unicode lexer

From: Hans Aberg (
Date: Wed Apr 20 2005 - 10:37:26 CST

  • Next message: Mark Davis: "Re: String name and Character Name"

    At 10:23 -0400 2005/04/20, Frank Yung-Fong Tang wrote:
    >I think one question we need to first answer is how do you define an
    >Unicode Enabled Lexer
    >I don't have a good answer. But I think it should at least include
    >the following
    >1. Have the ability to scane UTF-8 (and/or UTF-16) input file

    Any lexer generator that admits full 8-bit bytes in the source and
    scanning inputs has this property. For example, in Flex, if you feed
    it a UTF-8 .l file, then in the Flex language part, it must of course
    be 7-bit ASCII, as that is how the language defines it. But in string
    rules "...", if 8-bit bytes are admitted, you could put in a UTF-8
    string, and that would be matched literally by the generated lexer.
    In an UTF-8 editor, you would just see the Unicode character string.

    >2. Have the ability to return token in one or more transformation
    >format of Unicode

    I am not sure what you have in your mind here: The Flex generated
    lexer typically just returns an int, if anything. Other semantically
    data is returned imperatively in some state variable. One does that
    by hand, by writing explicit rules. The default rule "." would not
    work under UTF-8, to match any Unicode character, so some extension
    might be needed.

    >3. Have the ability to handle some set of Unicode regular expression features
    >4. Have the ability to support programming language specific Unicode
    >'escape' sequence. ( \uHHHH, &#ddddd; &#xxxxx; \HHHHH , etc) The
    >lexer may not support it directly, but it should be able to let the
    >Lexer caller to define a way to deal with it.

    These are the extensions I addressed for Flex, i.e., translating
    Unicode character classes into byte regular expressions that match
    these strings if the lexer input is in UTF-8/32.

    >5. Use some Unicode based String data type as primitive datatype to
    >return the result in the token.[?]

    Again, it is unclear what you mean here, as the lexer just returns
    the int token values indicated by hand in the rule actions.

    More advanced Unicode support might involve support for recognizing
    common Unicode character classes. For example, one might want to
    recognize letters, so that one can easily admit identifiers using

       Hans Aberg

    This archive was generated by hypermail 2.1.5 : Wed Apr 20 2005 - 10:39:00 CST