Re: NamesList.txt as data source

From: Asmus Freytag (t) <asmus-inc_at_ix.netcom.com>
Date: Sat, 26 Mar 2016 21:38:42 -0700
On 3/26/2016 2:10 AM, Janusz S. "Bień" wrote:
On Thu, Mar 10 2016 at 22:40 CET, kenwhistler@att.net writes:

[...]

The *reason* that NamesList.txt exists at all is to drive the tool, unibook,
that formats the full Unicode code charts for posting. It is only
posted in the Unicode Character Database at all as a matter of
convenience, to give people access to a text only version of the
names list that appears in the fully formatted pdf versions of the
code charts that contain all the representative glyphs.

NamesList.txt should *not* be data mined.
I've just noticed that NamesList.txt is in a sense data mined by the
Unicode consortium itself. I mean the "Unicode Utilities: Character
Properties", which e.g. for LATIN SMALL LETTER P WITH FLOURISH
(http://unicode.org/cldr/utility/character.jsp?a=A753) display in
particular

subhead: Medievalist addition

Am I right that this information is available only in NamesList.txt?

You are correct, subheaders are specific to the code charts, just as chapter headers, section headers and the like are specific to the text of the core specification.

Their purpose is to group a range of related characters - where possible.
In my opinion this is important information and should be officially
available for character data mining engines.

Nobody disputes that subheaders are informative. However, subheaders do not define a character property. There are several good reasons:

  1. They do not "classify" characters in a uniform way: For some ranges they give the purpose for which the character was encoded (as in your example), for others, they give the type of character (vowel, consonant), and in some cases they are free of information ("Miscellaneous addition").
  2. Even where they give the purpose for which the character was encoded, they do not necessarily attest that the characters in that range are never used for other purposes.
  3. The information is purely editorial, and as such, changed by the editors as needed, not assigned as result of a vote in the Unicode Technical Committee.
  4. They appear to be more "formal" than they are, just because they are presented with semantic markup in the input file to the code chart layout tool; with the file being a rather structured file, only because it describes a tabular presentation of data. However, see points (1) through (3) on why this superficial  appearance of formality is misleading.

There's an additional reason why we discourage the kind of data mining that treats these as if they were character properties: just because they are easy to lift out of the file doesn't mean that they represent information that is more useful than, for example, information contained in the discussion of the script of character block in the text of the core specification.

If you seriously wanted to present "all that is known about a character" you would need to excerpt all mentions of it in the core specification, as well as (potentially) any additional details presented in the version of the proposal document that was approved by the UTC as part of encoding the character. (In addition to each and any explicit and implicit mention in the text of a UAX and which is not already covered by a formal character property).

The reason nobody provides such a comprehensive summary, although perhaps they should, is that the way the information is presented in the core specification is, while equally useful(!), simply not formatted in a way that makes data mining easy.

If you take a shortcut, and only present the information that's easy to scrape, you are not necessarily doing your users any service.

It's not quite garbage-in/garbage-out because the subheaders were selected with some care, and in some cases, will provide the users with a necessary or useful hint, but at the cost of misleading the same users about the fact that these hints are not supplied consistently and uniformly. And that, by ignoring the discussion in the core specification, a lot of more useful and often more important information is ignored.

A./




Best regards

Janusz


Received on Sat Mar 26 2016 - 23:39:56 CDT

This archive was generated by hypermail 2.2.0 : Sat Mar 26 2016 - 23:39:56 CDT