mapping invalid bytes to invalid code units in deserializers for internal processing

From: Philippe VERDY (
Date: Sun Jan 23 2005 - 10:54:18 CST

  • Next message: Antoine Leca: "Re: wchar_t (was RE: 32'nd bit & UTF-8)"

    Jon Hanna <> a écrit :
    > > > And if, speculatively, Windows were to support UTF-8 as a
    > > > code page, the
    > > > situation would be unchanged. Byte sequences which are
    > > > illegal UTF-8 are
    > > > garbage in that code page and so would correctly be replaced
    > > > by U+FFFD.
    > >
    > > Which is exactly what needs to be changed. 128 codepoints,
    > remember?
    > 128 flavours of garbage.

    If an applications really needs a way to keep internally the value of garbage bytes that could not be safely converted to Unicode in one of its standard encoding forms, or as codepoints, I see only solution for this internal handling: making sure that the invalid bytes will be converted to codepoints (or encoided forms) that are also invalid for Unicode, so that there can be no confusion.

    For example, with internal handling as UTF-32 code units, the application can map these invalid bytes to invalid codepoints like 0x8000nn invalid in Unicode and that are easy to make distinct from other valid code units during internal string processing; with UTF-16 code units, these bytes could be mapped to the sequence <0xFFFE, 0xDCnn>, where <0xFFFE> is an invalid code unit acting as a special high surrogate, whose base codepoint will be 0, followed by a trailing low surrogate whose least significant bits contain the value of the invalid byte.

    This seems trivial during deserialization of broken streams of bytes, but this still adds a complexity for the serialization back to an encoding scheme, because the serializer needs to know that the special invalid internal code unit is a placeholder whose interpretation will need to be synchronized with the behavior of the deserializer. For example, the serializer that will attempt to encode into an encoding scheme, will need to create invalid sequences of bytes according to the values of invalid code units; this is certainly completely out of the Unicode standard itself.

    Whever this will be useful or not depends on the origin of the invalid code units: if the invalid code units came effectively from a deserializer, this forced non-standard behavior such as converting code units <0xFFFE,0xDCnn> to byte <0xnn> will possibly produce the original invalid bytes, but possibly within a context where these bytes may magically become valid. In many cases, invalid bytes in a stream (encoded with an encoding scheme or legacy MBCS charset) are only invalid within limited contexts.

    However if the application creates valid Unicode text "around" these invalid sequences, it may happen that this creates valid text once it is serialized. When it will be deserialized again, the internal invalid sequences may have disappeared and the text been modified. This can be dangerous, and so such serializer is not guaranteed toi generate invalid text that can be deserialized back to the original, intended encoding form.

    One example:
    given the following broken UTF-8 text encoded on a stream of bytes:
    <0x41 0xC2>
    it is interpreted by the deserializer as:
    <U+0041, invalidbyte(0xC2)>
    which then creates this broken encoding UTF-16 form>
    <0x0041, 0xFFFE, 0xDCC2>

    Then the application decides to mix several of such invalid strings, for example the following code units:
    <0xFFFE, 0xDC80>
    coming from the deserialization of the following invalid UTF-8 encoding scheme: <0x80>

    So it attempts to append those two invalid strings, and creates a third invalid string of UTF-16 code units:
    <0x0041, 0xFFFE, 0xDCC2, 0xFFFE, 0xDC80>
    Now comes the serializer, which generates for UTF-8:
    <0x41, 0xC2, 0x80>
    Unfortunately, this UTF-8 sequence of bytes is valid, and it will now be deserialized and interpreted as the following codepoints:
    <U+0041, U+0100>

    Mapping invalid bytes to invalid codeunits in the deserializer seems inocuous (the deserializer will perform that so that internally, invalid code units can still be processed as if they were valid, using custom, private rules). But clearly, the danger is then in the corresponding serializer (from code units in an encoding form, to bytes in an encoding scheme). Should it blindly encode source strings containing invalid sequences of code units?

    Although the process using such pair of custom serializer/deserailizer that accepts invalid bytes in a special internal private encoding form seems to still be able to process all Unicode valid strings, it is not safe mostly within the serailizer, unless there are strict rules for operations allowed on strings in internal encoding form, that forbids making valid encoding schemes on output. The problem is that the internal encoding form does not specify which final encoding scheme will be used. these encapsulated bytes traverse the application in a context that clearly is invalid if the output encoding scheme will be the same as the encoding scheme used on input.

    Things become more tricky if finally those invalid strings in internal encoding forms, are then serialized into another encoding scheme than the original. Suppose that these invalid code units are serialized to UTF-16 (i.e. the pair <0xFFFE, 0xDCnn> that has been created internally) instead of the original UTF-8?
    Should the invalid code units be serialized identically including the pair <0xFFFE, 0xDCnn> in the output, or should it emit the associated mapped bytes <0xnn>?

    The later option will clearly desynchronize the output (in the case of an UTF-16 encoding scheme), soi may be it should be a pair of bytes <0x00,0xnn>? But then how can the applications using this output make a difference with valid Latin-1 characters?

    This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 12:39:46 CST