Re: folding UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Aug 25 2006 - 20:13:41 CDT

Next message: Shariqul Islam Azad - Omi: "RE: Unicode FAQ pages updated"

Previous message: Rick McGowan: "Unicode FAQ pages updated"
Maybe in reply to: Oliver Block: "folding UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> > Further, what about combining character sequences? Inserting a CRLF
> > between a base character and a combining charcter or between one of
> > the combining characters would not produce an ill-formed
> > byte-sequence. Would you agree/disagree?
>
> I would agree, but I have the feeling this was intended to be relevant
> to the "mangled text" question above and I don't see the connection.

To further clarify Doug Ewell's response on this particular question:

If you have a Unicode string expressed in well-formed, UTF-8,
say:

écho <U+0065, U+0301, U+0063, U+0068, U+006F>

UTF-8: <0x65 0xCC 0x81 0x63 0x68 0x6F>

and some process inserts a CRLF between the "e" and the
combining acute accent, you would get:

<U+0065, U+000D, U+000A, U+0301, U+0063, U+0068, U+006F>

UTF-8: <0x65 0x0D 0x0A 0xCC 0x81 0x63 0x68 0x6F>

The resulting UTF-8 sequence is still well-formed UTF-8.

What has happened here, however, is that the inappropriate
insertion of a CRLF in the middle of a combining character
sequence has resulted in the isolated U+0301 constituting a
"defective combining character sequence" -- because it does
not directly follow a base character.

That isn't a good thing to do, for a couple of reasons.
First, it creates formatting problems -- you end up splitting
apart things which should be displayed together, and force
a rendering system to display a combining mark separated
from its base. Second, this is contrary to the linebreaking
rules expressed in UAX #29 for how hard line breaks should
be handled and how combining character sequences should be
kept together.

So what you have here is a process which is doing bad
linebreaking and which is violating the integrity of
combining character sequences, but, on the other hand,
it isn't creating mangled, ill-formed UTF-8, as referred to
in C12a of TUS 4.0.

The kind of mangling referred to in C12a would happen if the
CRLF were inserted not as characters, honoring character
boundaries, but instead were inserted by some rogue process
merely as bytes in a UTF-8 sequence, ignoring character
boundaries altogether. If the CRLF were inserted between the 0xCC and
the 0x81, you would get

*NOT* UTF-8: <0x65 0xCC 0x0D 0x0A 0x81 0x63 0x68 0x6F>

Any conformant UTF-8 convertor would be obliged to spit out
an error on attempting to interpret that byte sequence, because
it is just plain ill-formed UTF-8, and doesn't follow the
rules for UTF-8 at all.

The exceptions allowed for by C12a are where a higher-level
protocol has enough information to hand to know *exactly*
how that "0x0D 0x0A" got to be there, and so would be able
to subtract it back out safely, thereby recovering content
that would then be known to be valid, well-formed UTF-8.

--Ken

Next message: Shariqul Islam Azad - Omi: "RE: Unicode FAQ pages updated"
Previous message: Rick McGowan: "Unicode FAQ pages updated"
Maybe in reply to: Oliver Block: "folding UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Aug 25 2006 - 20:18:28 CDT