From: Martin Duerst (duerst@w3.org)
Date: Mon Jan 24 2005 - 03:24:36 CST
At 05:46 05/01/21, Hans Aberg wrote:
>The problem is that we do not have a specific lexer at hand, but a lexer
>generator Flex, and wants to figure out how to make it support Unicode
>encodings. Then there is no universal way to define exactly how it should
>act in the case of an error, because different lexers may choose different
>actions. So, at least in the case of UTF-32, it is convenient tenable
>regular expressions for all 2^32 numbers. The lexer writer will have to
>attune that to the Unicode standard.
I think it's a bad idea to try to provide a Unicode-enable version
of flex (by itself a very good idea) but to leave error handling
in the original encoding to the programmer using flex.
What I would expect such an Unicode-enabled version of flex to do
is to have something similar to <<EOF>>, let's call it <<NONCHAR>>
for the moment. <<NONCHAR>> would match shortest non-UTF-8 byte
sequences. The typical use would be for a grammar to have a single
rule matching <<NONCHAR>>, e.g. like so:
<<NONCHAR>> fprintf(stderr, "Illegal UTF-8 input.\n"); exit(1);
Of course, the average programmer may have a somewhat more user-
friendly way of telling the user about errors, e.g. including
the line number and byte position, or continuing after the first
such error to find others, but stopping at around 10 occurrences
to not produce too long error logs. Also, if the programmer wants
to be more specific, e.g. in terms of thing such as 'overlong
sequence', 'high surrogate', or whatever, this can always
be done by the programmer having a look at yytext, where it
should find the bytes matched. As a convenience to programmers,
you could even provide a function that does such analysis
and that can be called by programmers.
<<EOF>> is just an acknowledgement that the input is not a sequence
of bytes in the range 0x00-0xFF, but also includes end-of-file.
Likewise, <<NONCHAR>> is an acknowledgement that the input is
not a sequence of Unicode characters, but may include some illegal,
non-Unicode stuff.
Requiring the flex programmer to do anything more than something
like the above it doing completely the wrong thing; rather than
abstracting Unicode knowledge inside flex, you are exposing it
to a programmer. The chance that programmers will do the right
thing with this is very low. In particluar because what you
seem to want to do is to abstract the normal case, but expose
error conditions.
Indeed, I would go even a step further, and make sure that flex
has a default action for <<NONCHAR>>, which would be to stop
further processing and exit with an error.
Those programmers that really want to mess around with UTF-8
can always hack stuff into the <<NONCHAR>> rule, or can write
their lexical rules in a byte-oriented tool (e.g. like the current
flex).
In addition to these 'usability' issues, your idea to extend
the mapping from a subset of integers (those corresponding
to Unicode characters) to a larger subset (those representable
by a 32-bit integer) has other problems. One specific one is
that UTF-8 doesn't allow overlong sequences, i.e. things like
0xC0 0xAF. But your mapping might just map that to a '/',
which would be a serious security issues.
You may want to look at
http://dev.w3.org/cvsweb/validator/httpd/cgi-bin/check?rev=1.376&content-type=text/x-cvsweb-markup
(search for "sub check_utf8") and
http://dev.w3.org/cvsweb/charlint/charlint.pl?rev=1.27&content-type=text/x-cvsweb-markup
(search for sub CheckUTF8) for legal and illegal UTF-8 expressed
as byte-oriented regular expressions.
And if you are affraid that some time in the future, Unicode
will have to go beyond U+10FFFF, I think there is no problem
if you wait for that to happen to update flex. In my view, even
if you are very young now, the change that you still live when
that happens is rather small. Also, please note that such
an update, if ever necessary, will be much easier if users don't
have to change their own clungy rules and code that deals with
illegal stuff.
Regards, Martin.
This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 19:27:58 CST