Re: CLDR and ICU

From: Mark Davis ☕ <mark_at_macchiato.com>
Date: Fri, 27 Jul 2012 09:01:13 -0700

The key term is 'open interchange'.

"In effect, noncharacters can be thought of as application-internal
private-use code points. Unlike the private-use characters discussed in
Section 16.5, Private-Use Characters, which are assigned characters and
which are intended for use in open interchange, subject to interpretation
by private agreement, noncharacters are permanently reserved (unassigned)
and have no interpretation whatsoever outside of their possible
application-internal private uses."

For CLDR collation data - *not open interchange, but specific to use in
CLDR collation data* - these characters have specified use as sentinel
characters, marking the boundaries for CJK 'buckets' for use in indexes.
This is described in http://unicode.org/reports/tr35/#Collation_Elements.
The noncharacters are chosen specifically so that they do not overlap with
publicly interchanged private use characters. Of course, implementations of
LDML can tailor the collations to remove them, or replace by other
mechanisms.

> NULL and the two noncharacters U+FFFE and U+FFFF are banned from XML

It is not just null, but most controls.

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]

Unfortunately, some restrictions that were perfectly reasonable for use in
document interchange become annoying flaws in a general structured data
interchange format. The inability to interchange all Unicode scalar values
is one.

Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**

On Fri, Jul 27, 2012 at 12:17 AM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> On Thu, 26 Jul 2012 22:52:54 -0700
> "Steven R. Loomis" <srl_at_icu-project.org> wrote:
>
> > On Thu, Jul 26, 2012 at 6:19 PM, Richard Wordingham <
> > richard.wordingham_at_ntlworld.com> wrote:
> >
> > > On Thu, 26 Jul 2012 17:01:53 -0700
> > > "Steven R. Loomis" <srl_at_icu-project.org> wrote:
>
> > I suspect it was simply an oversight and not indicative of any
> > systemic issue. UTS#35 gives the example of <cp hex="0"> for
> > representing NULL as an example of a character not to be used in XML.
> > Note that there's nothing wrong with processing non-characters in
> > memory- I have to deal with non-characters all the time. Thanks for
> > filing the bug.
>
> NULL and the two noncharacters U+FFFE and U+FFFF are banned from XML;
> the other noncharacters are allowed. It's the Unicode Standard that
> bans them from *open interchange*.
>
> Richard.
>
>
Received on Fri Jul 27 2012 - 11:07:09 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 27 2012 - 11:07:11 CDT