Re: Subject: Re: 32'nd bit & UTF-8

From: Richard T. Gillam ([email protected])
Date: Thu Jan 20 2005 - 13:16:38 CST

Next message: [email protected]: "RE: UTF-8 'BOM'"

Previous message: Addison Phillips [wM]: "RE: UTF-8 'BOM'"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Good grief. We seem to be going through another round of "night of the
living thread."

This discussion has degenerated horribly from its original roots. I
really don't think it's productive to get into "UTF-x is better than
UTF-y" battles here, and I wish we could find a way to put a stop to it.
It's like programming-language wars-- people go round and round and
there's never any resolution. Suffice it to say that there are good,
valid reasons to use each of the UTFs and good, valid reasons not to use
each of the UTFs. Depending on your particular situation, any of the
three might be the best fit. There's a reason all three exist.

Similarly, I don't think endless discussion of the BOM is productive. I
think most people would agree that out-of-band methods of specifying the
encoding scheme are preferable to using the BOM, but they're not always
available. The BOM is what it is and it's not going away. It would
have been better if the BOM hadn't been overloaded with the "zero-width
non-breaking space" semantic-- if it had just been considered a no-op--
but even with the overloaded semantic, having extra BOMs creep into a
file that passes through non-BOM-aware processes rarely causes genuine
problems in practice. It's just not that big a deal.

Comparatively few of the people who have weighed in on this thread have
really responded to what the original poster was trying to do, perhaps
because they didn't understand it. I'm not sure I do either, but I want
to try both to understand the original motivation for this thread and
respond to it directly.

As I understand it, the basic issue here is how to make a byte-oriented
lexer generator (Flex) Unicode-aware. You want to be able to generate
lexers that can accept Unicode input and handle all of the different
characters Unicode defines.

The obvious way to do this would be to generate lexers that operate on
streams of Unicode code points rather than on streams of bytes.
Everything gets defined in terms of Unicode, and any conversion between
a legacy encoding and Unicode (or between some UTF and the in-memory
representation) is outside the scope of the lexer (well, outside the
scope of the generated part, anyway).

The original poster doesn't want to do this because you go from having a
character-category table with 256 entries to having one with 1.1 million
entries. Now instead of doing a simple (and fast) array lookup, you
have to compress the table somehow and do some sort of slower, more
complicated lookup.

My main reaction is So what? *Every* process that looks up data based
on character values has to go through this transformation at some point.
Techniques for doing the compression and the lookup abound, they're
well-researched and well-understood. Converting may be kind of a pain
in the butt, but there are existing solutions out there, and they're
plenty good enough for this kind of thing. There's no need to invent
anything.

But let's say that's too fundamental an architectural change and we want
to keep the internal tables of the lexers byte-oriented. You could do
this with any of the Unicode encoding schemes, but it probably works
best with UTF-8. Instead of feeding the lexer Unicode code points, you
feed it UTF-8 code units. You get the same level of expressiveness, and
the internals of the lexer are still byte-oriented. Basically, if you
care about non-ASCII characters, you're accepting larger, more
complicated state tables to get smaller, less complicated category
tables. This is a reasonable tradeoff to consider.

The problem seems to be how do you have a byte-oriented lexer deal with
invalid UTF-8 byte sequences? The problem is that you have no way to
express those invalid byte sequences in the regular expressions you're
generating the lexer from, and so we're talking about using a
nonstandard extension of UTF-8 to express the invalid byte sequences.

If I'm understanding any of this correctly, I would argue that's the
wrong way to look at it. This puts knowledge of the character encoding
into the lexer's tables. You don't want that because it ties you to one
encoding, whether it be Unicode-based or not. You want the internal
tables of the lexer to be encoding-independent. They should be able to
depend on getting input that is well-formed in terms of whatever
encoding the tables use-- if we're talking Unicode, the tables should be
able to depend on getting well-formed UTF-8 or a sequence of valid
Unicode code point values. Producing that well-formed sequence should
be the responsibility of something outside the lexer. Or, to put it
another way, the grammar the lexer interprets should be independent of
the scheme used to encode it; it's in characters, not bytes. If you
look at it this way, the regular expressions that the lexer generator
interprets can be plain old Unicode (in whatever UTF); there's no need
to worry about things Unicode can't express.

This means that you need something outside the state-machine mechanism
that transforms input from whatever encoding it's in to whatever
encoding the lexer uses, or that at least ensures that the incoming byte
sequence is indeed already in the encoding the lexer uses. If the input
is malformed at the encoding level, this layer signals an error before
the lexer ever sees the offending byte sequence and maybe passes through
some sort of valid byte sequence (e.g., U+FFFD) to the lexer. This
outside layer is responsible for byte-swapping, and it's responsible for
skipping a leading BOM.

You can delegate all this to an outside layer because these are
requirements of the encoding itself, not the grammar you're trying to
tokenize. If you purport to handle some encoding-- Unicode or something
else-- that implies a fixed set of well-formedness rules irrespective of
the thing in that encoding you're actually trying to parse. The actual
grammar shouldn't have to include or restate these rules, and it
shouldn't be allowed to override them. (Creating a lexer that
interprets illegal UTF-8 byte sequences is, by definition, not creating
a lexer that interprets UTF-8-- instead it's creating one that
interprets some nonstandard UTF-8-like encoding.)

The essential mental shift is to go from thinking of the lexer as
processing streams of BYTES to thinking of it as processing streams of
CHARACTERS.

This doesn't seem that hard, so I must be missing some essential part of
the argument. What am I missing?

--Rich Gillam
Language Analysis Systems, Inc.

Next message: [email protected]: "RE: UTF-8 'BOM'"
Previous message: Addison Phillips [wM]: "RE: UTF-8 'BOM'"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 13:17:15 CST