RE: Byte-oriented lexer generator for Unicode

From: Richard T. Gillam (
Date: Fri Jan 21 2005 - 12:59:35 CST

  • Next message: Antoine Leca: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"


    After reading this email, I'm firmly convinced that I understand what you're trying to do and why, and I stand by my assertions in my last note. Let me start with a few small points:

    >The reason to avoid this is that it takes a lot of work: The lexer generator needs to be given a rather thorough

    I don't think it's as thorough as you seem to think it is, but since I have no familiarity with the code, I could well be wrong about this.

    >If the lexer processes shorter binary units (like bytes), one gets more states, which may produce a slower lexer. But
    >longer binary units may require table compressions techniques, with slower table lookup in the generated lexer.

    I think the differences in execution speed between these two approaches will tend to come out in the wash, with one or the other being ahead depending on the exact mature of the lexer being generated. Same is generally true of memory footprint.

    >>Converting [to tables designed for 21-bit values] may be kind of a pain in the butt, but
    >>there are existing solutions out there, and they're plenty good enough
    >>for this kind of thing.  There's no need to invent anything.
    >The problem is not here that it is not known, but that one will have to spend a lot of time and effort. So therefore,
    >looking for other, more straightforward solutions is desirable, at least for some uses.

    This is flat wrong. Solutions to the problem ARE well know and are implemented in a fair amount of software. You might have to modify some stuff a bit for your particular problem domain, but the solutions are well understood. Take a look at the ICU source code, particularly the character-property-lookup code and the BreakIterator code. Take a look at Chapter 13 of my book. There is no need to break new ground here.

    As for pretty much everything else you say here, I stick to my original assertion: You have to make a fundamental shift from thinking about the lexers you generate as processing streams of BYTES to thinking of them as processing streams of CHARACTERS. You're not making that shift, and you're failing to explain to me why it's inappropriate to tall you to make that shift.

    As a rule, lexical analyzers operate on text. That's what they're for, and that's why it's appropriate to think of the generated code as operating on characters. Yeah, I can think of some situations where you might write lexers for other types of binary data, but most of the time you're working on text. Text isn't simply an arbitrary stream of bytes, but a stream of bytes that adheres to the syntactic and semantic rules of some character encoding standard. This means a particular input stream can be one of three things:

    - Well-formed
    - Malformed with respect to the character encoding
    - Malformed with respect to the grammar

    The only thing the designer of a grammar should have to worry about is whether or not the input stream is well-formed with respect to his intended grammar. He should not have to design the grammar to make sure the input is also well-formed with respect to the character encoding. This is extra unnecessary work, and it makes it much harder to adapt the same grammar to some other encoding or use the same grammar with input in any of several different encodings.

    The syntacic and semantic rules for a given character encoding are fixed for that encoding-- they apply to all character streams that use that encoding. They should not have to be restated by every grammar that passes through your lexer generator. The lexer generator should just know about them.

    What someone using your lexer generator might care about is what to do when the lexer encounters data that is malformed with respect to the character encoding. DO you stop with an error, throw out the offending bytes, or substitute something valid for the offending bytes? You probably still want to have a method for the user to control this, but you don't do that in the grammar rules. Generally you do that by having callback hooks in the generated code that the client can use to supply the appropriate behavior.

    Yes, the guts of the lexer have to operate on bytes. But it can operate on byte sequences that it already knows to be well-formed with respect to some particular encoding, and that encoding is private to the implementation. No one outside need care. The point is that neither the grammar rules nor the code generated from them should have to care about invalid byte sequences in the input encoding.

    This means another conceptual stage. The input stream has to go through a transcoding phase before being parsed. IF you're actually transcoding (i.e., handling input in more than one encoding scheme, including different flavors of Unicode), yes, this means another table lookup. If you don't want to transcode, but merely make sure the code is valid for whatever encoding standard you support, you can build the appropriate grammatical rules into the tables your lexer generator generates and still do everything in one pass. But those rules are fixed; they're not provided by the user supplying the grammar. [You can get rid of this machinery in situations where the user really wants to operate on arbitrary bytes, but I don't think that's your high-runner case.]

    The bottom line: Do not saddle the developers of the grammar you parse with the job of telling you how to handle the details of any character encodings, Unicode or otherwise. Let them concentrate on the grammar of the language they're trying to parse. You either do this by using a preexisting transcoding framework, such as the one in ICU, or you do it by mandating that the input to a generated lexer be in UTF-8, adding canned rules to every grammar that enforce UTF-8's well-formedness constraints, and require the application programmer to use a transcoding framework such as ICU. For those users who WANT to process arbitrary bytes, you can take out the canned rules, and you don't need to do anything special. What you have right now would work fine.

    Finally, as I said before, don't waste time worrying about preparing for the day when Unicode runs out of room. It won't. If you know you're dealing with text, you can exclude values above 0x10FFFF or treat them as errors.

    Characters, not bytes.

    --Rich Gillam
      Language Analysis Systems

    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 13:05:24 CST