Re: RE: RE: Roundtripping in Unicode

From: Philippe VERDY (verdy_p@wanadoo.fr)
Date: Mon Dec 13 2004 - 12:24:57 CST

  • Next message: Peter Kirk: "Re: Thanks: auto loading Hebrew and Russian fonts ; Re: Unicode HTML, download"

    > From : "Lars Kristan"
    > Philippe VERDY wrote:
    > > If a source sequence is invalid, and you want to preserve it,
    > > then this sequence must remain invalid if you change its encoding.
    > > So there's no need for Unicode to assign valid code points
    > > for invalid source data.
    > Using invalid UTF-16 sequences to represent invalid UTF-8 sequences is a known approach (UTF-8B, if I remember correctly). But this is then not UTF-16 data so you don't gain much. The data is at risk of being rejeted or filtered out at any time. And that misses the whole point.

    I don't think I miss the point. My suggested approach to perform roundtrip conversions between UTF's while keeping all invalid sequences as invalid (for the standard UTFs), is much less risky than converting them to valid codepoints (and by consequence to valid code units, because all valid code points need valid code units in UTF encoding forms).

    The application doing that just preserves the original byte sequences, for its internal needs, but will not expose to other applications or modules such invalid sequences without the same risks: these other modules need their own strategy, and their strategy could simply be rejecting invalid sequences, assuming that all other valid sequences are encoding valid codepoints (this is the risk you take with your proposal to assign valid codepoints to invalid byte sequences in a UTF-8 stream, and a module that would implement your proposal would remove important security features).

    Note also that once your proposal is implemented, all valid codepoints become convertible across all UTFs, without notice (this is the principle of UTF that they allow transparent conversions between each other).

    Suppose that your proposal is accepted, and that invalid bytes 0xnn in UTF-8 sources (these bytes are necessarily between 0x80 and 0xFF) get encoded to some valid code units U+0mmmnn (in a new range U+mmm80 to U+mmmFF), then they become immediately and transparently convertible to valid UTF-16 or even valid UTF-8. Your assumption that the byte sequence will be preserved will be wrong, because each encoded binary byte will become valid sequences of 3 or 4 UTF-8 bytes (one lead byte in 0xE0..EF if code points are in the BMP, or in 0xF0..0xF7 if they are in a supplementary plane, and 2 or 3 trail bytes in 0x80..0xBF).

    How do you think that other applications will treat these sequences: they won't notice that they are originally equivalent to the new valid sequences, and the byte sequence itself would be transmitted across modules without any warning (applications most often don't check whever codepoints are assigned, just that they are valid and properly encoded).

    Which application will take the responsability to convert back these 3-4 bytes valid sequences back to invalid 1-byte sequences, given that your data will already be treated by them as valid, and already encoded with valid UTF code units or encoding schemes?

    Come back to your filesystem problem. Suppose that there ARE filenames that already contain these valid 3-4 byte sequences. This hypothetic application will blindly convert the valid 3-4 bytes sequences to invalid 1-byte sequences, and then won't be able to access these files, despite they were already correctly UTF-8 encoded. So your proposal breaks valid UTF-8 encoding of filenames. In addition it creates dangerous aliases that will redirect accesses from one filename to another (so yes it is also a security problem).

    My opinion is then that we must not allow the conversion of any invalid byte sequences to valid code points. All what your application can do is to convert them to invalid sequences code units, to preserve the invalid status. Then it's up to that application to make this conversion privately and resoring the original byte sequence before communicating again with the external system. Another process or module can do the same if it wishes to, but none will communicate directly to each other with their private code unit sequences. The decision to accept invalid byte sequences must remain local to each module and is not transmissible.

    This means that permanent files containing invalid byte sequences must not be converted and replaced to another UTF as long as they contain an invalid byte sequence. Such file converter should fail, and warn the user about file contents or filenames that could not be converted. Then it's up to the user to decide if it wishes to:
    - drop these files
    - use a filter to remove invalid sequences (if it's a filename, the filter may need to append some indexing string to keep filenames unique in a directory)
    - use a filter to replace some invad sequences by a user specified valid substitution string
    - use a filter that will automatically generate valid substitution strings.
    - use other programs that will accept and will be able to process invalid files as opaque sequences of bytes instead of as a stream of Unicode characters.
    - change the meta-data file-type so that it will no longer be considered as plain-text
    - change the meta-data encoding label, so that it will be treated as ISO-8859-1 or some other complete 8-bit charset with 256 valid positions (like CP850, CP437, ISO-8859-2, MacRoman...).



    This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 12:52:05 CST