Re: folding UTF-8

From: Doug Ewell (
Date: Fri Aug 25 2006 - 00:49:38 CDT

  • Next message: Rick McGowan: "Unicode FAQ pages updated"

    Oliver Block <lists at block dash online dot eu> wrote:

    > definition C12a of Unicode Standard Version 4.0 mentions so "mangled"
    > text caused by folding (last paragraph of C12a).
    > Having the definition in mind (italic text at the top of C12a) I
    > understand mangled text as ill-formed text, that is not according to
    > table 3-6. Would you agree/disagree?

    It is ill-formed text of a special type: it would have been well-formed
    if not for an easily recognized, external process or layer -- the
    example mentions inserting a CR/LF pair every 80 bytes -- that can
    easily and unequivocally be reversed.

    Definition C12a states that a process may interpret such data, but goes
    on to say, "However, such repair of mangled data is a special case, and
    it must not be used in circumstances where it would cause securtiy
    problems." I think it is clear that the intent of C12a is not to allow
    a conformant process to interpret just any old random junk as if it were
    well-formed UTF-8.

    > Further, what about combining character sequences? Inserting a CRLF
    > between a base character and a combining charcter or between one of
    > the combining characters would not produce an ill-formed
    > byte-sequence. Would you agree/disagree?

    I would agree, but I have the feeling this was intended to be relevant
    to the "mangled text" question above and I don't see the connection.

    > (As every specification that requires folding does also require
    > unfolding, this would probably be more a semantic issue.)

    I do not agree that every specification that requires folding also
    requires unfolding.

    Doug Ewell
    Fullerton, California, USA

    This archive was generated by hypermail 2.1.5 : Fri Aug 25 2006 - 01:00:02 CDT