RE: Byte-oriented lexer generator for Unicode

From: Lars Kristan (
Date: Sat Jan 22 2005 - 04:09:03 CST

  • Next message: Christopher Fynn: "Re: Conformance (was UTF, BOM, etc)"

    Richard T. Gillam wrote:

    > As a rule, lexical analyzers operate on text. That's what
    > they're for, and that's why it's appropriate to think of the
    > generated code as operating on characters.

    But that does not mean that you need to convert text to codepoints, or
    UTF-32 if you will. This is just one of the options. One might just as well
    decide to convert to UTF-8 and process it as bytes. Sometimes bytes will
    need to be interpreted as codepoints, but that is not needed for all
    operations. Certain operations will work well simply by processing strings
    (!) of bytes.

    Such an approach may have several advantages:
    * Existing string functions can be used (on UNIX for example).
    * It can have the 'self compressing' properties required in a lexer
    * Last but not least, it is possible to process unvalidated data.

    To explain the last one, let me compare one aspect of the two approaches:
    * The codepoint approach needs to convert any legacy encoding into
    codepoints. For that, it must know the encoding. On any mismatch, it will
    choke very early on, producing no useful results. Which is sometimes even
    desired, but sometimes it is not. Even more so, because it might not choke,
    sometimes it produces results which seem accurate, but can cause damage.
    * The byte string approach only needs to convert UTF-16 and UTF-32 data. It
    can process UTF-8 data as it is. And what is important, it can process any
    legacy encoded data in the same way, without any conversion. If the rules
    are in the same encoding, it will work. Even more, if the rules are ASCII,
    it will work with both UTF-8 and all the legacy encodings. Of course not
    converting legacy encoded data is just an option. When one knows the
    encoding of both the data and the rules, they are both converted to UTF-8.

    So, the byte string approach is much more versatile. Windows might not need
    that versatility, but UNIX does.


    This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 04:12:24 CST