Re: Byte-oriented lexer generator for Unicode

From: Hans Aberg (haberg@math.su.se)
Date: Sat Jan 22 2005 - 12:54:17 CST

  • Next message: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"

    On 2005/01/21 19:59, Richard T. Gillam at rgillam@las-inc.com wrote:

    >> The reason to avoid this is that it takes a lot of work: The lexer generator
    >> needs to be given a rather thorough
    >> overhaul.
    >
    > I don't think it's as thorough as you seem to think it is, but since I have no
    > familiarity with the code, I could well be wrong about this.

    This is quoted out of context, but the context is that the lexer generator
    already takes regular expressions as input, and it is easy, it seems, to
    hook the translated regular expression deriving from Unicode onto that.
    Besides, people have used this method by writing in regular expressions by
    hand.

    >> If the lexer processes shorter binary units (like bytes), one gets more
    >> states, which may produce a slower lexer. But
    >> longer binary units may require table compressions techniques, with slower
    >> table lookup in the generated lexer.
    >
    > I think the differences in execution speed between these two approaches will
    > tend to come out in the wash, with one or the other being ahead depending on
    > the exact mature of the lexer being generated. Same is generally true of
    > memory footprint.

    Right. This is a question to be determined by profiling. It will depend on
    the application.

    >>> Converting [to tables designed for 21-bit values] may be kind of a pain in
    >>> the butt, but
    >>> there are existing solutions out there, and they're plenty good enough
    >>> for this kind of thing.  There's no need to invent anything.
    >>
    >> The problem is not here that it is not known, but that one will have to spend
    >> a lot of time and effort. So therefore,
    >> looking for other, more straightforward solutions is desirable, at least for
    >> some uses.
    >
    > This is flat wrong. Solutions to the problem ARE well know and are
    > implemented in a fair amount of software. You might have to modify some stuff
    > a bit for your particular problem domain, but the solutions are well
    > understood. Take a look at the ICU source code, particularly the
    > character-property-lookup code and the BreakIterator code. Take a look at
    > Chapter 13 of my book. There is no need to break new ground here.

    Perhaps you did not see what I wrote. I just don't want to stick my nose
    into that stuff if it is not needed.

    > As for pretty much everything else you say here, I stick to my original
    > assertion: You have to make a fundamental shift from thinking about the lexers
    > you generate as processing streams of BYTES to thinking of them as processing
    > streams of CHARACTERS. You're not making that shift, and you're failing to
    > explain to me why it's inappropriate to tall you to make that shift.

    The situation is the same as with all runtime information: A computer
    language like C generally throws away all computer language type information
    from the source code, in the name of efficiency. If, however, one builds a
    runtime polymorphic (dynamic) system, then will one have to carry over type
    information into the runtime system.

    It is not needed to make lexers where the character information is carried
    over into the runtime system, but of course, you can do it, if it falls into
    the programmers objectives.

    > As a rule, lexical analyzers operate on text. That's what they're for, ...

    They can in fact be used for operating on binary data as well. Some do. If
    your lexer generator will always only operate on Unicode characters, then
    you are right, but not otherwise.

    >...and
    > that's why it's appropriate to think of the generated code as operating on
    > characters.

    As in the case of C, the runtime model might be quite different from the
    mental image of the writer.

    > Text isn't simply an arbitrary stream of bytes, but a stream of bytes that
    > adheres to the syntactic and semantic rules of some character encoding
    > standard. This means a particular input stream can be one of three things:
    >
    > - Well-formed
    > - Malformed with respect to the character encoding
    > - Malformed with respect to the grammar

    You are here focusing on a lexer generator that will be used exclusively for
    processing Unicode data according to a single encoding. A program like Flex
    isn't like that. It is to be used for all sorts of applications.

    > The only thing the designer of a grammar should have to worry about is whether
    > or not the input stream is well-formed with respect to his intended grammar.

    This is an idealistic view that does not conform to actual programming needs
    or even practises. Some programmers will prefer to write in by hand various
    things that can be done automatically, for various reason, efficiency, or
    even just the fact that one is used to it, etc. One example of the latter
    from Flex is the use of line numbers, which can be automated, but which some
    programmers do by writing it in explicitly in the grammar, the latter which
    then has to be modified to accommodate for it. If was possible to enforce a
    way that this must always be made automatically, it could not be made
    acceptable in a program like Flex, since some programmers insist on doing it
    by hand.

    > He should not have to design the grammar to make sure the input is also
    > well-formed with respect to the character encoding. This is extra unnecessary
    > work, and it makes it much harder to adapt the same grammar to some other
    > encoding or use the same grammar with input in any of several different
    > encodings.

    So this is one programming style that you focus on. It would be too limited
    in the scope of a program like Flex.

    > The syntacic and semantic rules for a given character encoding are fixed for
    > that encoding-- they apply to all character streams that use that encoding.
    > They should not have to be restated by every grammar that passes through your
    > lexer generator. The lexer generator should just know about them.

    But one might write programs that processes more than one input encoding,
    and there are more than one way to handle malformed input code.

    > What someone using your lexer generator might care about is what to do when
    > the lexer encounters data that is malformed with respect to the character
    > encoding. DO you stop with an error, throw out the offending bytes, or
    > substitute something valid for the offending bytes? You probably still want
    > to have a method for the user to control this, but you don't do that in the
    > grammar rules. Generally you do that by having callback hooks in the
    > generated code that the client can use to supply the appropriate behavior.

    From the general point of view of writing a lexer generator, it is important
    to leave the choice open to the writer of the lexer. It is often good to be
    able to handle this from a single input grammar file. What you describe here
    is one, high level, use. But one cannot exclude low level uses.

    > Yes, the guts of the lexer have to operate on bytes. But it can operate on
    > byte sequences that it already knows to be well-formed with respect to some
    > particular encoding, and that encoding is private to the implementation. No
    > one outside need care. The point is that neither the grammar rules nor the
    > code generated from them should have to care about invalid byte sequences in
    > the input encoding.

    Again, this is one possibility. But the lexer will detect any sequence of
    not positively okayed by an input grammar rule as an error. So it is not
    needed.

    > This means another conceptual stage. The input stream has to go through a
    > transcoding phase before being parsed. IF you're actually transcoding (i.e.,
    > handling input in more than one encoding scheme, including different flavors
    > of Unicode), yes, this means another table lookup. If you don't want to
    > transcode, but merely make sure the code is valid for whatever encoding
    > standard you support, you can build the appropriate grammatical rules into the
    > tables your lexer generator generates and still do everything in one pass.
    > But those rules are fixed; they're not provided by the user supplying the
    > grammar. [You can get rid of this machinery in situations where the user
    > really wants to operate on arbitrary bytes, but I don't think that's your
    > high-runner case.]

    Again, there is more than one programming approach. It will depend on the
    application, and what the programmer. Note also that a stream might use
    mixed input encodings; your approach seem to assume that it has only one.

    > The bottom line: Do not saddle the developers of the grammar you parse with
    > the job of telling you how to handle the details of any character encodings,
    > Unicode or otherwise. Let them concentrate on the grammar of the language
    > they're trying to parse.

    Programmers may want to choose different approach because of the application
    at hand or because or personal preferences.

    >You either do this by using a preexisting
    > transcoding framework, such as the one in ICU, or you do it by mandating that
    > the input to a generated lexer be in UTF-8, adding canned rules to every
    > grammar that enforce UTF-8's well-formedness constraints, and require the
    > application programmer to use a transcoding framework such as ICU. For those
    > users who WANT to process arbitrary bytes, you can take out the canned rules,
    > and you don't need to do anything special. What you have right now would work
    > fine.

    With the approach I have indicated, there is no problem of producing a user
    interface which restricts the code to any specific paradigm: One merely
    provides input rule expressions that the user can restrict to, in order to
    achieve this effect. If one wants to process Unicode characters, then on the
    machine level, that will in effect be UTF-32 (in the endianness of the
    platform).

    > Characters, not bytes.

    You are focusing entirely on one type of applications, those that will
    process Unicode only in a single encoding (i.e., not changing within the
    stream), and need no sophisticated error handling on encountering malformed
    code. This approach is probably very good for a certain group of
    applications, where one will want to focus on what the lexer should do, and
    not bother about what happens with malformed code. But it too restrictive
    from the point of view of what a lexer generator like Flex should be able to
    handle.



    This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 17:08:15 CST