Byte-oriented lexer generator for Unicode

From: Hans Aberg (haberg@math.su.se)
Date: Fri Jan 21 2005 - 08:05:27 CST

  • Next message: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"

    At 14:16 -0500 2005/01/20, Richard T. Gillam wrote:
    >As I understand it, the basic issue here is how to make a byte-oriented lexer
    >generator (Flex) Unicode-aware.  You want to be able to generate lexers that
    >can accept Unicode input and handle all of the different characters Unicode
    >defines.
    ...
    >The essential mental shift is to go from thinking of the lexer as processing
    >streams of BYTES to thinking of it as processing streams of CHARACTERS.   This
    >doesn't seem that hard, so I must be missing some essential part of the
    >argument.  What am I missing?

    I have looked this through more carefully. The main point missing is that
    you are focusing too hard on what the lexer should do, and too little on how
    the lexer generator (Flex) should be built. The lexer generator should admit
    the user to implement a Unicode aware lexer and it should be done as
    transparently as possible with respect to any specific Unicode encoding,
    even though the generated lexer, of course just will process bytes, and will
    have no knowledge of Unicode.

    So, in the input grammar, one will want as much as possible to only use
    Unicode points, rather than encoding specific data. However, we do not know
    as such whether the lexer will only process a single Unicode encoding, nor
    do we know, from this general standpoint, what the generated lexer will do
    when encountering an illegal input: The latter can only the implementer of
    the lexer know (see below).

    >   The obvious way to do this would be to generate lexers that operate on
    >streams of Unicode code points rather than on streams of bytes.  Everything
    >gets defined in terms of Unicode, and any conversion between a legacy encoding
    >and Unicode (or between some UTF and the in-memory representation) is outside
    >the scope of the lexer (well, outside the scope of the generated part,
    >anyway).  

    So here one must make a careful distinction: What data is the lexer
    operating on, as emulated from the input, and what data is it operating on
    the machine level. Then, the lexer need not be Unicode aware on the machine
    level; there, it is sufficient that it just operates on bytes.

    In order to achieve this, I just convert the Unicode point regular
    expressions into byte regular expressions. The idea is to feed this into
    Flex as normal.

    Think of a Unicode point regular expression. It may contain some character
    classes. Expand these Unicode points character classes as expressions
    a_1|...|a_k. Then you get a new regular expression, where each character
    stands alone. Apply a Unicode encoding to the characters of this regular
    expression, say UTF-8. This produces a new regular expression. (In reality,
    I use a more efficient method. The one indicated is easy to understand
    theoretically.) Then feed that regular expression into Flex. This produces a
    lexer that recognizes the regular expressions in byte form. Do this for the
    whole input grammar. This new transformed grammar must then be augmented
    with some rules recognizing invalid byte sequences. There are various ways
    to do that.

    Now, this is the simplest model. One can in fact mix character ranges for
    different Unicode encodings in the same grammar. One reason one might want
    to do this is to generate a lexer that works with different input encodings.
    When jotting down these character ranges in the input grammar, it is most
    convenient if these refer to Unicode points. But one needs to know what type
    of Unicode encoding it should be translated into in the byte format.

    >The original poster doesn't want to do this because you go from having a
    >character-category table with 256 entries to having one with 1.1 million
    >entries.  Now instead of doing a simple (and fast) array lookup, you have to
    >compress the table somehow and do some sort of slower, more complicated
    >lookup.   My main reaction is So what?

    The reason to avoid this is that it takes a lot of work: The lexer generator
    needs to be given a rather thorough overhaul. By contrast, the method above
    seems to require relatively small changes to the original lexer generator.
    Which method to choose, as I see it, will only depend on how much input the
    developer of the lexer generator is willing to give relative the speed and
    space of the lexer itself. If the lexer processes shorter binary units (like
    bytes), one gets more states, which may produce a slower lexer. But longer
    binary units may require table compressions techniques, with slower table
    lookup in the generated lexer.

    >  *Every* process that looks up data based on character values has to go
    >through this transformation at some point.  Techniques for doing the
    >compression and the lookup abound, they're well-researched and
    >well-understood.  Converting may be kind of a pain in the butt, but there are
    >existing solutions out there, and they're plenty good enough for this kind of
    >thing.  There's no need to invent anything.  

    The problem is not here that it is not known, but that one will have to
    spend a lot of time and effort. So therefore, looking for other, more
    straightforward solutions is desirable, at least for some uses.

    >But let's say that's too fundamental an architectural change and we want to
    >keep the internal tables of the lexers byte-oriented.  You could do this with
    >any of the Unicode encoding schemes, but it probably works best with UTF-8. 
    >Instead of feeding the lexer Unicode code points, you feed it UTF-8 code
    >units.  You get the same level of expressiveness, and the internals of the
    >lexer are still byte-oriented.  Basically, if you care about non-ASCII
    >characters, you're accepting larger, more complicated state tables to get
    >smaller, less complicated category tables.

    So this people have done in the past by writing the byte expressions by
    hand. Therefore, I wrote some functions doing this translation
    automatically. It then does not matter what specific Unicode encoding
    UTF-8/16/32 one chooses, any of them would work equally well.

    >  This is a reasonable tradeoff to consider.

    Right. This is the main point in fact. There are a lot of byte oriented
    regular expressions scanners out there, and this promises to give a simple
    method to extend them to various Unicode encodings. The input will then be
    in Unicode points, with additional information about what encoding to lex.

    >   The problem seems to be how do you have a byte-oriented lexer deal with
    >invalid UTF-8 byte sequences?  The problem is that you have no way to express
    >those invalid byte sequences in the regular expressions you're generating the
    >lexer from, and so we're talking about using a nonstandard extension of UTF-8
    >to express the invalid byte sequences.  

    Actually, such an extension of invalid code points is of less importance in
    UTF-8 than in UTF-32, as one somehow will need to block out illegal
    multibytes, which may not have any relation to any UTF-8 extension of the
    Unicode point range. Instead focus on UTF-32, just to illustrate the
    situation. Say that \U... denotes the UTF-32 encoded character with ... as
    hexadecimal number. Then the illegal UTF-32 combinations outside the Unicode
    range can be written very intuitively (with a lexer action added)
      [\U110000-\UFFFFFFFF] error ...
    Of course, I could add another, new symbol for this illegal Unicode range.
    It is just the question of creating a suitable user interface. Keeping also
    \U.. for illegal Unicode points enables one to generate handling outside the
    Unicode range. Of course, it would then not be valid Unicode data.
    Alternativelye, I could retain \U to only accept valid Unicode points, and
    create a new name for the extended version. Again, this is just the question
    of creating a suitable user interface in the lexer generator. These features
    can easily coexist side by side.

    >If I'm understanding any of this correctly, I would argue that's the wrong way
    >to look at it.  This puts knowledge of the character encoding into the lexer's
    >tables. 

    So this does not put any character encoding data into the lexer more than
    any of the other translation work does. The lexer, clearly, will be designed
    to handle just specific encodings, but will otherwise only, on the machine
    level, process bytes.

    >You don't want that because it ties you to one encoding, whether it be
    >Unicode-based or not.  You want the internal tables of the lexer to be
    >encoding-independent.  They should be able to depend on getting input that is
    >well-formed in terms of whatever encoding the tables use-- if we're talking
    >Unicode, the tables should be able to depend on getting well-formed UTF-8 or a
    >sequence of valid Unicode code point values. 

    You must be careful with the terminology: "encoding-independent" would
    require (see below, though) the state tables to operate on Unicode points
    alone. If it processes say UTF-8, then it is already is specific to one
    encoding. Note however, that a lexer that processes a single Unicode
    encoding can easily be made processing another Unicode encoding or Unicode
    code points via an input code converter. So then, which method to use, is
    just a matter of practical concerns. This would not work if one wants to
    produce a lexer able to switch between different input encodings without
    external converters (even though one could put context switches on a
    multi-code-converter and change input conversion method from the lexer).

    >Producing that well-formed sequence should be the responsibility of something
    >outside the lexer. 

    That is the first naive approach I thought about. But in view of the
    translation method above, I thought it might be useful to be able to use the
    lexer to detect that a sequence is well-formed. In an actual application one
    might of course pipe lexers, where the first is just a specialized tool,
    detecting that the input code is correct.

    >Or, to put it another way, the grammar the lexer interprets should be
    >independent of the scheme used to encode it; it's in characters, not bytes. 

    As I noted above, this is just the question of creating a suitable user
    interface of the lexer generator. Suppose, for example, that the input file
    is assumed to be in UTF-8. Then if one scans for a UTF string "...", no
    change is needed, as the string is already in UTF-8. If one wants it to
    parse in say UTF-32, then one must somehow indicate that in the grammar, say
    \U"...". The lexer then must first convert the "...", which is input in
    UTF-8 to Unicode points, and then that is converted into a UTF-32 regular
    expression. A similar thing happens with character classes.

    > If you look at it this way, the regular expressions that the lexer generator
    >interprets can be plain old Unicode (in whatever UTF);

    With the method indicated above, one must in addition know which Unicode
    encoding to use. This need only be character class specific information
    though.

    >there's no need to worry about things Unicode can't express.  

    With the method above, the lexer still has to be able to catch erroneous
    input byte sequences (unless one uses some kind of input filter). This is
    just the question of producing a suitable user interface.

    >This means that you need something outside the state-machine mechanism that
    >transforms input from whatever encoding it's in to whatever encoding the lexer
    >uses, or that at least ensures that the incoming byte sequence is indeed
    >already in the encoding the lexer uses. 

    So if the generated lexer is built up around states with Unicode points,
    that would be necessary. But with the method indicated above, this would not
    be needed in general, as the lexer still can be made to handle that.

    >If the input is malformed at the encoding level, this layer signals an error
    >before the lexer ever sees the offending byte sequence and maybe passes
    >through some sort of valid byte sequence (e.g., U+FFFD) to the lexer.  This
    >outside layer is responsible for byte-swapping, and it's responsible for
    >skipping a leading BOM.   You can delegate all this to an outside layer
    >because these are requirements of the encoding itself, not the grammar you're
    >trying to tokenize. 

    One can build it up that way, but with the method indicated above, it is not
    necessary. As for the regular expression functions I wrote, I first made
    them to generate errors for invalid sequences such as OXFFFF, but that
    turned out just to be a problem. So the simplest way is to admit those
    erroneous numbers in the translation functions, and then add onto that a way
    to capture erroneous Unicode values and sequences in the lexer generator
    user interface.

    >If you purport to handle some encoding-- Unicode or something else-- that
    >implies a fixed set of well-formedness rules irrespective of the thing in that
    >encoding you're actually trying to parse.  The actual grammar shouldn't have
    >to include or restate these rules, and it shouldn't be allowed to override
    >them. 

    This depends on what kind of lexer one wants to produce. The method I
    indicated above will always be locally encoding specific, whatever encoding
    into multibytes one is choosing. But such a lexer can be made encoding
    independent by the use of suitable code converters.

    It is perhaps prudent to make the distinction (following common mathematical
    usage for coordinate systems) between an encoding free lexer, which is one
    that lexes Unicode points directly, and an encoding independent lexer, which
    is one that lexes a specific encoding, but uses code converters or other
    means (like say grammar start conditions) to become multiencoding
    compatible. Using this terminology, it seems that you propose an encoding
    free lexer generator and lexers, but the method I indicate would at best
    generate encoding independent lexers. But that should suffice.

    >(Creating a lexer that interprets illegal UTF-8 byte sequences is, by
    >definition, not creating a lexer that interprets UTF-8-- instead it's creating
    >one that interprets some nonstandard UTF-8-like encoding.)  

    Right. But this is a question for the writer of the lexer and not a question
    of how the lexer generator should be implemented. The lexer should prudently
    supply a suitable user interface, making it easy for the lexer writer you
    sift out illegal UTF-8 character sequences, if that is what is processed.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 09:49:00 CST