Re: Subject: Re: 32'nd bit & UTF-8

From: Richard T. Gillam (
Date: Thu Jan 20 2005 - 13:16:38 CST

  • Next message: "RE: UTF-8 'BOM'"

    Good grief. We seem to be going through another round of "night of the
    living thread."
    This discussion has degenerated horribly from its original roots. I
    really don't think it's productive to get into "UTF-x is better than
    UTF-y" battles here, and I wish we could find a way to put a stop to it.
    It's like programming-language wars-- people go round and round and
    there's never any resolution. Suffice it to say that there are good,
    valid reasons to use each of the UTFs and good, valid reasons not to use
    each of the UTFs. Depending on your particular situation, any of the
    three might be the best fit. There's a reason all three exist.
    Similarly, I don't think endless discussion of the BOM is productive. I
    think most people would agree that out-of-band methods of specifying the
    encoding scheme are preferable to using the BOM, but they're not always
    available. The BOM is what it is and it's not going away. It would
    have been better if the BOM hadn't been overloaded with the "zero-width
    non-breaking space" semantic-- if it had just been considered a no-op--
    but even with the overloaded semantic, having extra BOMs creep into a
    file that passes through non-BOM-aware processes rarely causes genuine
    problems in practice. It's just not that big a deal.
    Comparatively few of the people who have weighed in on this thread have
    really responded to what the original poster was trying to do, perhaps
    because they didn't understand it. I'm not sure I do either, but I want
    to try both to understand the original motivation for this thread and
    respond to it directly.
    As I understand it, the basic issue here is how to make a byte-oriented
    lexer generator (Flex) Unicode-aware. You want to be able to generate
    lexers that can accept Unicode input and handle all of the different
    characters Unicode defines.
    The obvious way to do this would be to generate lexers that operate on
    streams of Unicode code points rather than on streams of bytes.
    Everything gets defined in terms of Unicode, and any conversion between
    a legacy encoding and Unicode (or between some UTF and the in-memory
    representation) is outside the scope of the lexer (well, outside the
    scope of the generated part, anyway).
    The original poster doesn't want to do this because you go from having a
    character-category table with 256 entries to having one with 1.1 million
    entries. Now instead of doing a simple (and fast) array lookup, you
    have to compress the table somehow and do some sort of slower, more
    complicated lookup.
    My main reaction is So what? *Every* process that looks up data based
    on character values has to go through this transformation at some point.
    Techniques for doing the compression and the lookup abound, they're
    well-researched and well-understood. Converting may be kind of a pain
    in the butt, but there are existing solutions out there, and they're
    plenty good enough for this kind of thing. There's no need to invent
    But let's say that's too fundamental an architectural change and we want
    to keep the internal tables of the lexers byte-oriented. You could do
    this with any of the Unicode encoding schemes, but it probably works
    best with UTF-8. Instead of feeding the lexer Unicode code points, you
    feed it UTF-8 code units. You get the same level of expressiveness, and
    the internals of the lexer are still byte-oriented. Basically, if you
    care about non-ASCII characters, you're accepting larger, more
    complicated state tables to get smaller, less complicated category
    tables. This is a reasonable tradeoff to consider.
    The problem seems to be how do you have a byte-oriented lexer deal with
    invalid UTF-8 byte sequences? The problem is that you have no way to
    express those invalid byte sequences in the regular expressions you're
    generating the lexer from, and so we're talking about using a
    nonstandard extension of UTF-8 to express the invalid byte sequences.
    If I'm understanding any of this correctly, I would argue that's the
    wrong way to look at it. This puts knowledge of the character encoding
    into the lexer's tables. You don't want that because it ties you to one
    encoding, whether it be Unicode-based or not. You want the internal
    tables of the lexer to be encoding-independent. They should be able to
    depend on getting input that is well-formed in terms of whatever
    encoding the tables use-- if we're talking Unicode, the tables should be
    able to depend on getting well-formed UTF-8 or a sequence of valid
    Unicode code point values. Producing that well-formed sequence should
    be the responsibility of something outside the lexer. Or, to put it
    another way, the grammar the lexer interprets should be independent of
    the scheme used to encode it; it's in characters, not bytes. If you
    look at it this way, the regular expressions that the lexer generator
    interprets can be plain old Unicode (in whatever UTF); there's no need
    to worry about things Unicode can't express.
    This means that you need something outside the state-machine mechanism
    that transforms input from whatever encoding it's in to whatever
    encoding the lexer uses, or that at least ensures that the incoming byte
    sequence is indeed already in the encoding the lexer uses. If the input
    is malformed at the encoding level, this layer signals an error before
    the lexer ever sees the offending byte sequence and maybe passes through
    some sort of valid byte sequence (e.g., U+FFFD) to the lexer. This
    outside layer is responsible for byte-swapping, and it's responsible for
    skipping a leading BOM.
    You can delegate all this to an outside layer because these are
    requirements of the encoding itself, not the grammar you're trying to
    tokenize. If you purport to handle some encoding-- Unicode or something
    else-- that implies a fixed set of well-formedness rules irrespective of
    the thing in that encoding you're actually trying to parse. The actual
    grammar shouldn't have to include or restate these rules, and it
    shouldn't be allowed to override them. (Creating a lexer that
    interprets illegal UTF-8 byte sequences is, by definition, not creating
    a lexer that interprets UTF-8-- instead it's creating one that
    interprets some nonstandard UTF-8-like encoding.)
    The essential mental shift is to go from thinking of the lexer as
    processing streams of BYTES to thinking of it as processing streams of
    This doesn't seem that hard, so I must be missing some essential part of
    the argument. What am I missing?
    --Rich Gillam
      Language Analysis Systems, Inc.

    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 13:17:15 CST