Unicode Lookahead in Parsers?

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Aug 30 1996 - 18:32:44 EDT


Dan Oscarsson writes in response to Arnt:

>> Putting combining characters before the non-combining character would
>> make such speculative rendering impossible.
>>
>Agreed, for GUI representation it would be ok. But not when havinga a
>command/program parser. Generally, when writing parsers, you want to
>avoid as much look ahead as possible. Unicode forces you to always
>read at least one character ahead.

Nope, not even for parsers is this a problem. With correct extension
of the identifier syntax, there is no additional cost over what
a parser (or more accurately, the lexer portion of the parser)
currently has to do. Once you transition to the state
which is accumulating a token for an identifier, you sit in a loop
of the form:

        while ( isIdentifierPart (*s) )
            *tk++ = *s++;

The entire trick is in specifying the identifier correctly. The
implementation guidelines published in the Unicode Standard 2.0
include a section which spells out a complete suggested BNF syntax
for identifiers which can be used to generate an efficient one-step
table lookup underneath an isIdentifierPart() implementation.

Check with the Java implementers. They're not complaining about
combining characters causing inefficiencies in the lexer.

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT