From: Lars Kristan (email@example.com)
Date: Sat Jan 22 2005 - 04:09:03 CST
Richard T. Gillam wrote:
> As a rule, lexical analyzers operate on text. That's what
> they're for, and that's why it's appropriate to think of the
> generated code as operating on characters.
But that does not mean that you need to convert text to codepoints, or
UTF-32 if you will. This is just one of the options. One might just as well
decide to convert to UTF-8 and process it as bytes. Sometimes bytes will
need to be interpreted as codepoints, but that is not needed for all
operations. Certain operations will work well simply by processing strings
(!) of bytes.
Such an approach may have several advantages:
* Existing string functions can be used (on UNIX for example).
* It can have the 'self compressing' properties required in a lexer
* Last but not least, it is possible to process unvalidated data.
To explain the last one, let me compare one aspect of the two approaches:
* The codepoint approach needs to convert any legacy encoding into
codepoints. For that, it must know the encoding. On any mismatch, it will
choke very early on, producing no useful results. Which is sometimes even
desired, but sometimes it is not. Even more so, because it might not choke,
sometimes it produces results which seem accurate, but can cause damage.
* The byte string approach only needs to convert UTF-16 and UTF-32 data. It
can process UTF-8 data as it is. And what is important, it can process any
legacy encoded data in the same way, without any conversion. If the rules
are in the same encoding, it will work. Even more, if the rules are ASCII,
it will work with both UTF-8 and all the legacy encodings. Of course not
converting legacy encoded data is just an option. When one knows the
encoding of both the data and the rules, they are both converted to UTF-8.
So, the byte string approach is much more versatile. Windows might not need
that versatility, but UNIX does.
This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 04:12:24 CST