Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 13:56:34 CST

Next message: Kenneth Whistler: "Re: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Arcane Jill: "RE: 32'nd bit & UTF-8"
Next in thread: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/19 17:48, Arcane Jill at arcanejill@ramonsky.com wrote:

>> A lexer generator like Flex does not process Unicode directly, it generates a
>> lexer that processes bytes.

> As a programmer myself, I actually followed that explanation.

There has been discussions about that in the Flex list. People write Unicode
expressions by writing \x.. by hand. But that is tedious. This led me to
that approach.

> But I wonder if
> it's the right approach. Would it not be a more ... interesting ... approach,
> to forget Flex, and instead write a brand new Unicode lexer generator which
> generates a lexer that processes characters (not bytes)?

Why do don't you do that yourself? :-) -- You must think about how much work
that has already been pout into developing Flex to this point. And Unicode
is not the only issue, there are many others: Better Flex Bison handling,
multilanguage output, etc.

If you want to write a lexer that directly processes Unicode points, using
the DFA approach, then the problem is that you need a table with 2^21 index
values. Since this is too big for a typical static array, you get into the
issue of table compressions and the like. So the rewrite as 1-byte regular
expression has the advantage of avoiding that issue altogether, promising
quick Unicode support with relatively minor implementation work. There seems
to benefits to this approach as well: One can mix different Unicode
encodings in the same lexer, by the use of start conditions say.

Then this need not be the end of it. But if one should make a lexer for only
Unicode points, then perhaps one needs to have some idea of what actual
Unicode lexers in use look like. So it may be the case one will have to wait
some time into the future for that to happen.

Hans Aberg

Next message: Kenneth Whistler: "Re: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Arcane Jill: "RE: 32'nd bit & UTF-8"
Next in thread: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 13:58:32 CST