Re: Filtering and displaying untrusted UTF-8

From: Asmus Freytag (
Date: Mon Dec 28 2009 - 21:16:16 CST

  • Next message: Doug Ewell: "Re: HTML5 encodings (was: Re: BOCU patent)"

    Welcome back, Jason.

    I'm continuing to believe that filtering merely "reserved" character
    codes is a fool's game, and one that is based on a pretty fundamental
    misunderstanding of how the Unicode Standard is designed to allow
    forward-compatible implementations, with reasonable, though not perfect
    chances of remaining free of modifications as the repertoire of
    characters gets extended.

    Philippe covered it first in his reply, but he's mixed the good with the
    bad, the true with the false or merely irrelevant, so I shall parse his
    reply here:

    On 12/28/2009 5:10 PM, verdy_p wrote:
    > "Jason Schauberger" wrote:
    >> I tend to disagree. Of course it's likely that now unassigned code
    >> points are assigned a character in future Unicode versions. However,
    >> it's also possible that some of them will be assigned non-characters.
    > There's absolutely NO sign in TUS (and in ISO/CEI 10646) and as well as in the roadmap or in the allocation policy document, that new non-characters will be allocated before a very long term.
    ... if ever.

    So far, I can only agree with Philippe.

    What follows in the next few paragraphs is best characterized as a
    "flight of fancy" and, in my view, doesn't contribute anything useful to
    the discussion. Who knows what the future will be "post-Unicode"?
    > May be at this time, it will only be to allow the transition to a newer standard replacing both TUS and ISO/IEC 10646 (and defining new encodings tranform formats for a newer universal set, which will only be justified by the adoption of very different allocation policies, along with the adoption of a new character model).
    > But there's still NO sign that such a newer standard will be needed and developed (even in some far undefined
    > future). Given the current software development and deployment practices, nothing should occur before at least 2 decennials or more.
    > Even if this ever happens, there will be a long transition period where it will coexist with the current UCS and its UTF's, without any fundamental change in the UCS allocation policies for non-special planes, before the UCS itself gets frozen when all the new standardization work will occur within the new standard. And long before it happens, you will have seen various experiments and concurrent proposals for the new standard, and long discussions that could take another one or two decennials (because it will be more complicate than when the UCS was adopted).
    > The octet encoding unit (for UTFs) will certainly be deprecated long before, for data interchange (and some form of UTF-32 will have become more universal), without needing the abandon of the UCS itself, which is now critical for lots of other computing standards (that will be really difficult and extremely costly to adapt to a completely different character encoding model).
    ... skipped to here.

    > All existing non-characters have been allocated for supporting the conversion of legacy SBCS and MBCS charsets
    > (inherited up to the end of last millenium) to the UCS and to support the adoption of standard UTF's; and no new allocations of non-characters have occured since very long now.
    The above statement is completely counter-factual (false). NONE of what
    Unicode defines as "noncharacters" were allocated for such legacy
    purposes. However, control codes and the Private Use Area were.
    > I think it's much better for any immediate future to use the already assigned default properties for unassigned character blocks: these default properties (and the layout of existing planes) are already guiding the existing roadmaps for new allocations. So to treat unassigned positions acording to their default properties, which are definitely not for non-characters.
    That recommendation is useful. Unicode has been designed to allow
    forward-looking implementations to be as compatible with future
    character assignments as possible.
    > (May be, if really new non-characters need to be assigned in some far future, they should not occur in any of the first planes, but only in special plane 14).
    Plane 14 is loosely designated for any characters with unusual or
    "special" properties.
    > So consider that all unassigned positions in existing empty planes 4 to
    > 13 as being used for characters (possibly combining, but possibly not normalized), not for non-characters (with the only exception of U+xxFFFE and U+xxFFFF, already assigned since long to non-characters, where xx can be any supplementary plane from hex 01 to 10).
    What he says, is allow all character codes from planes 1 through 16
    (except for the two non-characters at the end of each) but to exclude
    anything on Plane 14 that doesn't fit your plain-text model. This would
    currently mean you allow the Variation selectors, but not the tags.

    The remainder seems to be another "flight of fancy" - skipping...
    > In fact the tentative justifications seen on various places to encode the linguistic grapheme clusters (those used in collation for example) or to encode glyphs (and their attributes), do not need to change the UCS principles and do not require any new allocations for them (this is not even needed for the compatibility or transition). It will remain best to develop a new higher-level standard using UCS for the effective encoding of its plain-text characters. This is what already happens with the standardization of various XML schemas or several computing languages (and their support libraries), when they want to encode something else than just plain-text...
    > And there is still plenty of evolutions possible in font file formats (OpenType is now an incredible mess, still locked in various interllectual property problems that should find some end when the existing patents will be freed to allow better alternate representations) and for their interchange (separately or embedded in documents, or through "web services"), or in the standardization of text-rendering engines (and of CSS, SVG, ECMASCript, HTML...), or in input methods and user customizations: all of them can work better with the UCS and should support better i18n features with a much better interoperability.
    ... to here.



    This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 21:20:24 CST