Re: folding UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Aug 25 2006 - 20:13:41 CDT

  • Next message: Shariqul Islam Azad - Omi: "RE: Unicode FAQ pages updated"

    > > Further, what about combining character sequences? Inserting a CRLF
    > > between a base character and a combining charcter or between one of
    > > the combining characters would not produce an ill-formed
    > > byte-sequence. Would you agree/disagree?
    >
    > I would agree, but I have the feeling this was intended to be relevant
    > to the "mangled text" question above and I don't see the connection.

    To further clarify Doug Ewell's response on this particular question:

    If you have a Unicode string expressed in well-formed, UTF-8,
    say:

    écho <U+0065, U+0301, U+0063, U+0068, U+006F>

    UTF-8: <0x65 0xCC 0x81 0x63 0x68 0x6F>

    and some process inserts a CRLF between the "e" and the
    combining acute accent, you would get:

           <U+0065, U+000D, U+000A, U+0301, U+0063, U+0068, U+006F>

    UTF-8: <0x65 0x0D 0x0A 0xCC 0x81 0x63 0x68 0x6F>

    The resulting UTF-8 sequence is still well-formed UTF-8.

    What has happened here, however, is that the inappropriate
    insertion of a CRLF in the middle of a combining character
    sequence has resulted in the isolated U+0301 constituting a
    "defective combining character sequence" -- because it does
    not directly follow a base character.

    That isn't a good thing to do, for a couple of reasons.
    First, it creates formatting problems -- you end up splitting
    apart things which should be displayed together, and force
    a rendering system to display a combining mark separated
    from its base. Second, this is contrary to the linebreaking
    rules expressed in UAX #29 for how hard line breaks should
    be handled and how combining character sequences should be
    kept together.

    So what you have here is a process which is doing bad
    linebreaking and which is violating the integrity of
    combining character sequences, but, on the other hand,
    it isn't creating mangled, ill-formed UTF-8, as referred to
    in C12a of TUS 4.0.

    The kind of mangling referred to in C12a would happen if the
    CRLF were inserted not as characters, honoring character
    boundaries, but instead were inserted by some rogue process
    merely as bytes in a UTF-8 sequence, ignoring character
    boundaries altogether. If the CRLF were inserted between the 0xCC and
    the 0x81, you would get

    *NOT* UTF-8: <0x65 0xCC 0x0D 0x0A 0x81 0x63 0x68 0x6F>

    Any conformant UTF-8 convertor would be obliged to spit out
    an error on attempting to interpret that byte sequence, because
    it is just plain ill-formed UTF-8, and doesn't follow the
    rules for UTF-8 at all.

    The exceptions allowed for by C12a are where a higher-level
    protocol has enough information to hand to know *exactly*
    how that "0x0D 0x0A" got to be there, and so would be able
    to subtract it back out safely, thereby recovering content
    that would then be known to be valid, well-formed UTF-8.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Aug 25 2006 - 20:18:28 CDT