Re: NamesList.txt as data source

From: Mark Davis ☕️ <mark_at_macchiato.com>
Date: Mon, 28 Mar 2016 13:59:42 +0200

> I'm very curious about where CLDR data depends on these subheaders or
other annotations in NamesList.txt

You're right. CLDR data doesn't.

I think there is a misunderstanding because of the online utilities which
have been, for convenience, hosted with the same server as the CLDR survey
tool. So one sees "cldr" in the following URL, but that doesn't mean a
particular association with CLDR.

Example:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{sc=grek}

This just filters characters to those with script = Greek.

The listing has both the block name and the Nameslist subhead label in
listing characters. One can also use the subhead labels in filtering, eg

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{subhead=Archaic%20letters}
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7Bsubhead=Archaic%20letters%7D>

But subheads are *not* Unicode Character Properties. And repeating the
caveats expressed earlier, the Nameslist data is designed for chart
production, not as a reliable source of machine-readable data. While it may
be in some cases useful to look at, the subheads are not designed to be a
consistent source of data. For example, one couldn't use them effectively
to find non-modern-use characters, because different terms are used for
that, and the groupings mix in other characters. For example:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{subhead=/(?i)historic|archaic|obsolete/}

Other examples: the NamesList data doesn't include all the case mappings,
nor all the normative name aliases. It also lists the decomposition
mapping, not the canonical and/or compatibility decompositions (which are
*not* the same). And so on.

One needs to use the UCD instead of trying to dig this information out of
the NamesList.txt file — because such information will be wrong and
incomplete.

Mark

On Sun, Mar 27, 2016 at 11:04 PM, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

>
> Le 27 mars 2016 20:47, "Doug Ewell" <doug_at_ewellic.org> a écrit :
> >
> > Asmus Freytag wrote:
> >
> >> Nobody disputes that subheaders are informative. However, subheaders
> >> do not define a character property.
> >
> >
> > Janusz was making a point that the CLDR data sometimes treats them as
> such, or at least as a kind of supplementary property.
>
> I'm very curious about where CLDR data depends on these subheaders or
> other annotations in NamesList.txt...
>
> Subheaders may only be used eventually as named anchors splitting a
> normative block onto several subparts (somtimes with several parts on the
> same heading) but thèse subblocks are not normative, notably because they
> are not correlated with other subbocks in additional blocks. And there's
> not even any warranty that cbaracters in these subblocks share some basic
> property, not even a script type, or a général category. Thase are juste
> anchors for speaking about subblocks, and relatés to the discussions that
> occured before these characters were encoded.
> If mater there are new characterd added these existing subblocks won't be
> sufficient. But the new characters will ne added at any convenient range
> available or in a new block. If needed, even these subblocks may ne
> subdivisée and thus renamed. None of them are stable.
>
> For CLDR algorithms and data, these headings are not necessary and not
> used. Instead, character ranges or sets are used, specifying the characters
> directly, or one oor more of their properties in cimbinations but not this
> one.
>
> I juste hope that there's no algorithm depending on them and treating them
> as properties (for exemple in regular expressions with a custom property).
> If an algorithme must be created, it should define its own named subsets to
> d'Égine their own properties (many UAX algorithms do that constantly, e.g
> for text breakers or Bidi or text transforms)
>
Received on Mon Mar 28 2016 - 07:01:17 CDT

This archive was generated by hypermail 2.2.0 : Mon Mar 28 2016 - 07:01:18 CDT