Re: Filtering and displaying untrusted UTF-8

From: verdy_p (verdy_p@wanadoo.fr)
Date: Mon Dec 28 2009 - 19:10:30 CST

Next message: Asmus Freytag: "Re: Filtering and displaying untrusted UTF-8"

Previous message: verdy_p: "Re: Filtering and displaying untrusted UTF-8"
In reply to: Jason Schauberger: "Re: Filtering and displaying untrusted UTF-8"
Next in thread: Asmus Freytag: "Re: Filtering and displaying untrusted UTF-8"
Reply: Asmus Freytag: "Re: Filtering and displaying untrusted UTF-8"
Reply: Doug Ewell: "Re: Filtering and displaying untrusted UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Jason Schauberger" wrote:
> I tend to disagree. Of course it's likely that now unassigned code
> points are assigned a character in future Unicode versions. However,
> it's also possible that some of them will be assigned non-characters.

There's absolutely NO sign in TUS (and in ISO/CEI 10646) and as well as in the roadmap or in the allocation policy
document, that new non-characters will be allocated before a very long term.

May be at this time, it will only be to allow the transition to a newer standard replacing both TUS and ISO/IEC
10646 (and defining new encodings tranform formats for a newer universal set, which will only be justified by the
adoption of very different allocation policies, along with the adoption of a new character model).

But there's still NO sign that such a newer standard will be needed and developed (even in some far undefined
future). Given the current software development and deployment practices, nothing should occur before at least 2
decennials or more.

Even if this ever happens, there will be a long transition period where it will coexist with the current UCS and its
UTF's, without any fundamental change in the UCS allocation policies for non-special planes, before the UCS itself
gets frozen when all the new standardization work will occur within the new standard. And long before it happens,
you will have seen various experiments and concurrent proposals for the new standard, and long discussions that
could take another one or two decennials (because it will be more complicate than when the UCS was adopted).

The octet encoding unit (for UTFs) will certainly be deprecated long before, for data interchange (and some form of
UTF-32 will have become more universal), without needing the abandon of the UCS itself, which is now critical for
lots of other computing standards (that will be really difficult and extremely costly to adapt to a completely
different character encoding model).

All existing non-characters have been allocated for supporting the conversion of legacy SBCS and MBCS charsets
(inherited up to the end of last millenium) to the UCS and to support the adoption of standard UTF's; and no new
allocations of non-characters have occured since very long now. I think it's much better for any immediate future to
use the already assigned default properties for unassigned character blocks: these default properties (and the
layout of existing planes) are already guiding the existing roadmaps for new allocations. So to treat unassigned
positions acording to their default properties, which are definitely not for non-characters.

(May be, if really new non-characters need to be assigned in some far future, they should not occur in any of the
first planes, but only in special plane 14). So consider that all unassigned positions in existing empty planes 4 to
13 as being used for characters (possibly combining, but possibly not normalized), not for non-characters (with the
only exception of U+xxFFFE and U+xxFFFF, already assigned since long to non-characters, where xx can be any
supplementary plane from hex 01 to 10).

In fact the tentative justifications seen on various places to encode the linguistic grapheme clusters (those used
in collation for example) or to encode glyphs (and their attributes), do not need to change the UCS principles and
do not require any new allocations for them (this is not even needed for the compatibility or transition). It will
remain best to develop a new higher-level standard using UCS for the effective encoding of its plain-text
characters. This is what already happens with the standardization of various XML schemas or several computing
languages (and their support libraries), when they want to encode something else than just plain-text...

And there is still plenty of evolutions possible in font file formats (OpenType is now an incredible mess, still
locked in various interllectual property problems that should find some end when the existing patents will be freed
to allow better alternate representations) and for their interchange (separately or embedded in documents, or
through "web services"), or in the standardization of text-rendering engines (and of CSS, SVG, ECMASCript, HTML...),
or in input methods and user customizations: all of them can work better with the UCS and should support better i18n
features with a much better interoperability.

Philippe.

Next message: Asmus Freytag: "Re: Filtering and displaying untrusted UTF-8"
Previous message: verdy_p: "Re: Filtering and displaying untrusted UTF-8"
In reply to: Jason Schauberger: "Re: Filtering and displaying untrusted UTF-8"
Next in thread: Asmus Freytag: "Re: Filtering and displaying untrusted UTF-8"
Reply: Asmus Freytag: "Re: Filtering and displaying untrusted UTF-8"
Reply: Doug Ewell: "Re: Filtering and displaying untrusted UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 19:12:17 CST