Re: Filtering and displaying untrusted UTF-8

From: verdy_p (
Date: Mon Dec 28 2009 - 19:10:30 CST

  • Next message: Asmus Freytag: "Re: Filtering and displaying untrusted UTF-8"

    "Jason Schauberger" wrote:
    > I tend to disagree. Of course it's likely that now unassigned code
    > points are assigned a character in future Unicode versions. However,
    > it's also possible that some of them will be assigned non-characters.

    There's absolutely NO sign in TUS (and in ISO/CEI 10646) and as well as in the roadmap or in the allocation policy
    document, that new non-characters will be allocated before a very long term.

    May be at this time, it will only be to allow the transition to a newer standard replacing both TUS and ISO/IEC
    10646 (and defining new encodings tranform formats for a newer universal set, which will only be justified by the
    adoption of very different allocation policies, along with the adoption of a new character model).

    But there's still NO sign that such a newer standard will be needed and developed (even in some far undefined
    future). Given the current software development and deployment practices, nothing should occur before at least 2
    decennials or more.

    Even if this ever happens, there will be a long transition period where it will coexist with the current UCS and its
    UTF's, without any fundamental change in the UCS allocation policies for non-special planes, before the UCS itself
    gets frozen when all the new standardization work will occur within the new standard. And long before it happens,
    you will have seen various experiments and concurrent proposals for the new standard, and long discussions that
    could take another one or two decennials (because it will be more complicate than when the UCS was adopted).

    The octet encoding unit (for UTFs) will certainly be deprecated long before, for data interchange (and some form of
    UTF-32 will have become more universal), without needing the abandon of the UCS itself, which is now critical for
    lots of other computing standards (that will be really difficult and extremely costly to adapt to a completely
    different character encoding model).

    All existing non-characters have been allocated for supporting the conversion of legacy SBCS and MBCS charsets
    (inherited up to the end of last millenium) to the UCS and to support the adoption of standard UTF's; and no new
    allocations of non-characters have occured since very long now. I think it's much better for any immediate future to
    use the already assigned default properties for unassigned character blocks: these default properties (and the
    layout of existing planes) are already guiding the existing roadmaps for new allocations. So to treat unassigned
    positions acording to their default properties, which are definitely not for non-characters.

    (May be, if really new non-characters need to be assigned in some far future, they should not occur in any of the
    first planes, but only in special plane 14). So consider that all unassigned positions in existing empty planes 4 to
    13 as being used for characters (possibly combining, but possibly not normalized), not for non-characters (with the
    only exception of U+xxFFFE and U+xxFFFF, already assigned since long to non-characters, where xx can be any
    supplementary plane from hex 01 to 10).

    In fact the tentative justifications seen on various places to encode the linguistic grapheme clusters (those used
    in collation for example) or to encode glyphs (and their attributes), do not need to change the UCS principles and
    do not require any new allocations for them (this is not even needed for the compatibility or transition). It will
    remain best to develop a new higher-level standard using UCS for the effective encoding of its plain-text
    characters. This is what already happens with the standardization of various XML schemas or several computing
    languages (and their support libraries), when they want to encode something else than just plain-text...

    And there is still plenty of evolutions possible in font file formats (OpenType is now an incredible mess, still
    locked in various interllectual property problems that should find some end when the existing patents will be freed
    to allow better alternate representations) and for their interchange (separately or embedded in documents, or
    through "web services"), or in the standardization of text-rendering engines (and of CSS, SVG, ECMASCript, HTML...),
    or in input methods and user customizations: all of them can work better with the UCS and should support better i18n
    features with a much better interoperability.


    This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 19:12:17 CST