Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 18 2005 - 18:09:33 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    On 2005/01/18 23:06, Philippe VERDY at verdy_p@wanadoo.fr wrote:

    >> Philippe VERDY has already pointed out that the 31-bit version is called
    >> UTF-BSS. UTF-8 is restricted to at or below 0x10FFFF.
    >
    > I have not stated that. I have also replied to you about this name, and I
    > pointed you to the appropriate RFC that describes it. I said I was not sure
    > about the name (in fact I don't matter much how it is named, I don't use it,
    > and don't need it...)

    There is a problem of the time delay of email here. But the transformation
    format should have a name of its own, otherwise the confusion appearing in
    this list will be repeated.

    > For your problem, of handling invalid codesequences in a lexer, the best you
    > can do is to add a meta-character for regular expressions that wil match any
    > invalid input sequence, and use it for all cases where an input sequence
    > cannot be recognized but be handled as a whole

    Right. This is how I was lead to discussions in this list. It will not be
    sufficient to have a single metacharacter though, because one wants to leave
    it open for the lexer to choose how to act. For example, it might decide to
    resynchornize, or switch to another encoding, or simply vary diagnostics.

    > For example C0.80 is invalid, and you may convert it internally to an invalid
    > codepoint followed by an indicator of the value of this byte: if your internal
    > representation for you lexer needs to keep these values but handle them as
    > errors that can be matched in a regular expression, you may convert each byte
    > to codepoints FFFF, followed by a PUA containing the value of the invalid
    > byte: this scheme will work independantly of the specified input charset (it
    > could be used to match unpaired surrogates, converted for example by
    > converting an unpaired surrogate D8xx to FFFF,EED8,FFFF,EExx).
    > Of course this will be invalid Unicode, but you keep internally the value of
    > the invalid bytes, and you can then display these values in error messages
    > generated by your lexer...
    >
    > And you'll be able to match these invalid bytes in regular expressions...

    I just provide functions that admits one to produce the overloaded
    multibytes (and corresponding regular expressions), and seuqnce for those.
    Then it is easy to put them into a lexer rule. The lexer writer can decide
    what the lexer should do.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 18:13:43 CST