Byte-oriented lexer generator for Unicode

From: Hans Aberg ([email protected])
Date: Fri Jan 21 2005 - 08:05:27 CST

Next message: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"

Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Richard T. Gillam: "RE: Byte-oriented lexer generator for Unicode"
Maybe reply: Richard T. Gillam: "RE: Byte-oriented lexer generator for Unicode"
Maybe reply: Lars Kristan: "RE: Byte-oriented lexer generator for Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 14:16 -0500 2005/01/20, Richard T. Gillam wrote:
>As I understand it, the basic issue here is how to make a byte-oriented lexer
>generator (Flex) Unicode-aware. You want to be able to generate lexers that
>can accept Unicode input and handle all of the different characters Unicode
>defines.
...
>The essential mental shift is to go from thinking of the lexer as processing
>streams of BYTES to thinking of it as processing streams of CHARACTERS. This
>doesn't seem that hard, so I must be missing some essential part of the
>argument. What am I missing?

I have looked this through more carefully. The main point missing is that
you are focusing too hard on what the lexer should do, and too little on how
the lexer generator (Flex) should be built. The lexer generator should admit
the user to implement a Unicode aware lexer and it should be done as
transparently as possible with respect to any specific Unicode encoding,
even though the generated lexer, of course just will process bytes, and will
have no knowledge of Unicode.

So, in the input grammar, one will want as much as possible to only use
Unicode points, rather than encoding specific data. However, we do not know
as such whether the lexer will only process a single Unicode encoding, nor
do we know, from this general standpoint, what the generated lexer will do
when encountering an illegal input: The latter can only the implementer of
the lexer know (see below).

> The obvious way to do this would be to generate lexers that operate on
>streams of Unicode code points rather than on streams of bytes. Everything
>gets defined in terms of Unicode, and any conversion between a legacy encoding
>and Unicode (or between some UTF and the in-memory representation) is outside
>the scope of the lexer (well, outside the scope of the generated part,
>anyway).

So here one must make a careful distinction: What data is the lexer
operating on, as emulated from the input, and what data is it operating on
the machine level. Then, the lexer need not be Unicode aware on the machine
level; there, it is sufficient that it just operates on bytes.

In order to achieve this, I just convert the Unicode point regular
expressions into byte regular expressions. The idea is to feed this into
Flex as normal.

Think of a Unicode point regular expression. It may contain some character
classes. Expand these Unicode points character classes as expressions
a_1|...|a_k. Then you get a new regular expression, where each character
stands alone. Apply a Unicode encoding to the characters of this regular
expression, say UTF-8. This produces a new regular expression. (In reality,
I use a more efficient method. The one indicated is easy to understand
theoretically.) Then feed that regular expression into Flex. This produces a
lexer that recognizes the regular expressions in byte form. Do this for the
whole input grammar. This new transformed grammar must then be augmented
with some rules recognizing invalid byte sequences. There are various ways
to do that.

Now, this is the simplest model. One can in fact mix character ranges for
different Unicode encodings in the same grammar. One reason one might want
to do this is to generate a lexer that works with different input encodings.
When jotting down these character ranges in the input grammar, it is most
convenient if these refer to Unicode points. But one needs to know what type
of Unicode encoding it should be translated into in the byte format.

>The original poster doesn't want to do this because you go from having a
>character-category table with 256 entries to having one with 1.1 million
>entries. Now instead of doing a simple (and fast) array lookup, you have to
>compress the table somehow and do some sort of slower, more complicated
>lookup. My main reaction is So what?

The reason to avoid this is that it takes a lot of work: The lexer generator
needs to be given a rather thorough overhaul. By contrast, the method above
seems to require relatively small changes to the original lexer generator.
Which method to choose, as I see it, will only depend on how much input the
developer of the lexer generator is willing to give relative the speed and
space of the lexer itself. If the lexer processes shorter binary units (like
bytes), one gets more states, which may produce a slower lexer. But longer
binary units may require table compressions techniques, with slower table
lookup in the generated lexer.

> *Every* process that looks up data based on character values has to go
>through this transformation at some point. Techniques for doing the
>compression and the lookup abound, they're well-researched and
>well-understood. Converting may be kind of a pain in the butt, but there are
>existing solutions out there, and they're plenty good enough for this kind of
>thing. There's no need to invent anything.

The problem is not here that it is not known, but that one will have to
spend a lot of time and effort. So therefore, looking for other, more
straightforward solutions is desirable, at least for some uses.

>But let's say that's too fundamental an architectural change and we want to
>keep the internal tables of the lexers byte-oriented. You could do this with
>any of the Unicode encoding schemes, but it probably works best with UTF-8.
>Instead of feeding the lexer Unicode code points, you feed it UTF-8 code
>units. You get the same level of expressiveness, and the internals of the
>lexer are still byte-oriented. Basically, if you care about non-ASCII
>characters, you're accepting larger, more complicated state tables to get
>smaller, less complicated category tables.

So this people have done in the past by writing the byte expressions by
hand. Therefore, I wrote some functions doing this translation
automatically. It then does not matter what specific Unicode encoding
UTF-8/16/32 one chooses, any of them would work equally well.

> This is a reasonable tradeoff to consider.

Right. This is the main point in fact. There are a lot of byte oriented
regular expressions scanners out there, and this promises to give a simple
method to extend them to various Unicode encodings. The input will then be
in Unicode points, with additional information about what encoding to lex.

> The problem seems to be how do you have a byte-oriented lexer deal with
>invalid UTF-8 byte sequences? The problem is that you have no way to express
>those invalid byte sequences in the regular expressions you're generating the
>lexer from, and so we're talking about using a nonstandard extension of UTF-8
>to express the invalid byte sequences.

Actually, such an extension of invalid code points is of less importance in
UTF-8 than in UTF-32, as one somehow will need to block out illegal
multibytes, which may not have any relation to any UTF-8 extension of the
Unicode point range. Instead focus on UTF-32, just to illustrate the
situation. Say that \U... denotes the UTF-32 encoded character with ... as
hexadecimal number. Then the illegal UTF-32 combinations outside the Unicode
range can be written very intuitively (with a lexer action added)
[\U110000-\UFFFFFFFF] error ...
Of course, I could add another, new symbol for this illegal Unicode range.
It is just the question of creating a suitable user interface. Keeping also
\U.. for illegal Unicode points enables one to generate handling outside the
Unicode range. Of course, it would then not be valid Unicode data.
Alternativelye, I could retain \U to only accept valid Unicode points, and
create a new name for the extended version. Again, this is just the question
of creating a suitable user interface in the lexer generator. These features
can easily coexist side by side.

>If I'm understanding any of this correctly, I would argue that's the wrong way
>to look at it. This puts knowledge of the character encoding into the lexer's
>tables.

So this does not put any character encoding data into the lexer more than
any of the other translation work does. The lexer, clearly, will be designed
to handle just specific encodings, but will otherwise only, on the machine
level, process bytes.

>You don't want that because it ties you to one encoding, whether it be
>Unicode-based or not. You want the internal tables of the lexer to be
>encoding-independent. They should be able to depend on getting input that is
>well-formed in terms of whatever encoding the tables use-- if we're talking
>Unicode, the tables should be able to depend on getting well-formed UTF-8 or a
>sequence of valid Unicode code point values.

You must be careful with the terminology: "encoding-independent" would
require (see below, though) the state tables to operate on Unicode points
alone. If it processes say UTF-8, then it is already is specific to one
encoding. Note however, that a lexer that processes a single Unicode
encoding can easily be made processing another Unicode encoding or Unicode
code points via an input code converter. So then, which method to use, is
just a matter of practical concerns. This would not work if one wants to
produce a lexer able to switch between different input encodings without
external converters (even though one could put context switches on a
multi-code-converter and change input conversion method from the lexer).

>Producing that well-formed sequence should be the responsibility of something
>outside the lexer.

That is the first naive approach I thought about. But in view of the
translation method above, I thought it might be useful to be able to use the
lexer to detect that a sequence is well-formed. In an actual application one
might of course pipe lexers, where the first is just a specialized tool,
detecting that the input code is correct.

>Or, to put it another way, the grammar the lexer interprets should be
>independent of the scheme used to encode it; it's in characters, not bytes.

As I noted above, this is just the question of creating a suitable user
interface of the lexer generator. Suppose, for example, that the input file
is assumed to be in UTF-8. Then if one scans for a UTF string "...", no
change is needed, as the string is already in UTF-8. If one wants it to
parse in say UTF-32, then one must somehow indicate that in the grammar, say
\U"...". The lexer then must first convert the "...", which is input in
UTF-8 to Unicode points, and then that is converted into a UTF-32 regular
expression. A similar thing happens with character classes.

> If you look at it this way, the regular expressions that the lexer generator
>interprets can be plain old Unicode (in whatever UTF);

With the method indicated above, one must in addition know which Unicode
encoding to use. This need only be character class specific information
though.

>there's no need to worry about things Unicode can't express.

With the method above, the lexer still has to be able to catch erroneous
input byte sequences (unless one uses some kind of input filter). This is
just the question of producing a suitable user interface.

>This means that you need something outside the state-machine mechanism that
>transforms input from whatever encoding it's in to whatever encoding the lexer
>uses, or that at least ensures that the incoming byte sequence is indeed
>already in the encoding the lexer uses.

So if the generated lexer is built up around states with Unicode points,
that would be necessary. But with the method indicated above, this would not
be needed in general, as the lexer still can be made to handle that.

>If the input is malformed at the encoding level, this layer signals an error
>before the lexer ever sees the offending byte sequence and maybe passes
>through some sort of valid byte sequence (e.g., U+FFFD) to the lexer. This
>outside layer is responsible for byte-swapping, and it's responsible for
>skipping a leading BOM. You can delegate all this to an outside layer
>because these are requirements of the encoding itself, not the grammar you're
>trying to tokenize.

One can build it up that way, but with the method indicated above, it is not
necessary. As for the regular expression functions I wrote, I first made
them to generate errors for invalid sequences such as OXFFFF, but that
turned out just to be a problem. So the simplest way is to admit those
erroneous numbers in the translation functions, and then add onto that a way
to capture erroneous Unicode values and sequences in the lexer generator
user interface.

>If you purport to handle some encoding-- Unicode or something else-- that
>implies a fixed set of well-formedness rules irrespective of the thing in that
>encoding you're actually trying to parse. The actual grammar shouldn't have
>to include or restate these rules, and it shouldn't be allowed to override
>them.

This depends on what kind of lexer one wants to produce. The method I
indicated above will always be locally encoding specific, whatever encoding
into multibytes one is choosing. But such a lexer can be made encoding
independent by the use of suitable code converters.

It is perhaps prudent to make the distinction (following common mathematical
usage for coordinate systems) between an encoding free lexer, which is one
that lexes Unicode points directly, and an encoding independent lexer, which
is one that lexes a specific encoding, but uses code converters or other
means (like say grammar start conditions) to become multiencoding
compatible. Using this terminology, it seems that you propose an encoding
free lexer generator and lexers, but the method I indicate would at best
generate encoding independent lexers. But that should suffice.

>(Creating a lexer that interprets illegal UTF-8 byte sequences is, by
>definition, not creating a lexer that interprets UTF-8-- instead it's creating
>one that interprets some nonstandard UTF-8-like encoding.)

Right. But this is a question for the writer of the lexer and not a question
of how the lexer generator should be implemented. The lexer should prudently
supply a suitable user interface, making it easy for the lexer writer you
sift out illegal UTF-8 character sequences, if that is what is processed.

Hans Aberg

Next message: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Richard T. Gillam: "RE: Byte-oriented lexer generator for Unicode"
Maybe reply: Richard T. Gillam: "RE: Byte-oriented lexer generator for Unicode"
Maybe reply: Lars Kristan: "RE: Byte-oriented lexer generator for Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 09:49:00 CST