Re: Code pages and Unicode

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 25 Aug 2011 03:00:10 +0200

2011/8/25 Richard Wordingham <richard.wordingham_at_ntlworld.com>:
> It will only happen when the need becomes obvious, which may be never,
> or may be 30 years hence.  It's even conceivable that UTF-16 will
> drop out of use.
"Conceivable" but extremely unlikely because it will remain used in
extremely frequent cases, even if it can only support a subset of the
new encoding.

[begin side note]
This is a situation similar to the case of the UCS-2 subset, and of
the ISO 10646 "implementation levels" that have been withdrawn and are
no longer meaningful as a condition for conformance: conforming
applications today *must* exhibit behaviors that effectively can
respect the unbreakability and unreorderability of surrogate pairs;
the need to support isolated surrogates or custom encodings that would
depend on different pairing rules of surrogates, i.e. a high surrogate
followed by a low surrogate, are not conforming.

This does not mean that applications have to imply distinctive
semantics to surrogates or have to "support" non-BMP characters by
recognizing their distinctive properties: as long as runs of
surrogates are handled in such a way that they will never be reordered
or composed in arbitrary sequences, these applications can satisfy the
conformance requirement, without having to fully assert a higher
"implementation level".

So an UCS-2 only application can continue to blindly treat surrogates
*as if* they were unbreakable strings of symbols with a strong LTR
directionality and unknown glyphs (or just the same ".notdef" glyph),
or to treat them *as if* they were unassigned (but valid) code points
in the BMP (all with the same default property values, except that the
value of individual code units must all be preserved; alternatively an
UCS-2 application may still replace those surrogate code units all
simultaneously to the same value associated to a non-ignorable
character, such as 0xFFFD or 0x003F, or may still suppress all of
them, knowing that it is destructive of information, or opt for
throwing a fatal exception for all of them; these are some of the
worst situations where this UCS-2 only behavior is still conforming).
[end side note]

This does not mean that existing UTF's will be the favored encoding in
the future (we can't say that even about UTF-8, or UTF-32). It's just
impossible to magically predict now which of the three standard UTF's
(or their standard byte-order variants) will become out of use, or if
any one of them will become out of use: for now there is absolutely no
sign that this will ever occur. Instead, we still see a very large
(and still accelerating) adoption rate for these UTFs (notably UTF-8).
Received on Wed Aug 24 2011 - 20:03:53 CDT

This archive was generated by hypermail 2.2.0 : Wed Aug 24 2011 - 20:03:54 CDT