Re: Re: 32'nd bit & UTF-8

From: Philippe VERDY (verdy_p@wanadoo.fr)
Date: Tue Jan 18 2005 - 16:06:06 CST

  • Next message: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"

    > Philippe VERDY has already pointed out that the 31-bit version is called
    > UTF-BSS. UTF-8 is restricted to at or below 0x10FFFF.

    I have not stated that. I have also replied to you about this name, and I pointed you to the appropriate RFC that describes it. I said I was not sure about the name (in fact I don't matter much how it is named, I don't use it, and don't need it...)

    For your problem, of handling invalid codesequences in a lexer, the best you can do is to add a meta-character for regular expressions that wil match any invalid input sequence, and use it for all cases where an input sequence cannot be recognized but be handled as a whole

    For example C0.80 is invalid, and you may convert it internally to an invalid codepoint followed by an indicator of the value of this byte: if your internal representation for you lexer needs to keep these values but handle them as errors that can be matched in a regular expression, you may convert each byte to codepoints FFFF, followed by a PUA containing the value of the invalid byte: this scheme will work independantly of the specified input charset (it could be used to match unpaired surrogates, converted for example by converting an unpaired surrogate D8xx to FFFF,EED8,FFFF,EExx).
    Of course this will be invalid Unicode, but you keep internally the value of the invalid bytes, and you can then display these values in error messages generated by your lexer...

    And you'll be able to match these invalid bytes in regular expressions...



    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 16:12:18 CST