Re: NamesList.txt as data source (was: Re: Gaps in Mathematical Alphanumeric Symbols) from Asmus Freytag (t) on 2016-03-10 (Unicode Mail List Archive)

From: Asmus Freytag (t) <asmus-inc_at_ix.netcom.com>
Date: Thu, 10 Mar 2016 17:05:43 -0800

On 3/10/2016 2:14 PM, Doug Ewell wrote:

Ken Whistler wrote:

NamesList.txt should *not* be data mined.

And yet it was the only Unicode data file utilized by MSKLC.

There are many possible reasons for this approach, which we will
probably never know.

Extracting information from namelist.txt that was added to that file based on information from the UCD is plain folly - not least because it uses a secondary source instead of a primary source. What may not have come across from Ken's description is that the process for incorporating this data is under editorial control - and some values or entries may be suppressed for readability. There is explicitly not guarantee for completeness.

There is some information that *only* exists in the nameslist.txt file. This includes, informal aliases for character names, cross references, etc.. The problem with extracting this information blindly (that is, not mediated by a human) is, again, that the level of consistency of presentation is that appropriate for a human reader, not for an extraction algorithm.

For example, to reduce clutter, cross references are not symmetric or transitive, even though the relationship that gave rise to the cross reference in te first place (e.g. similarity) would normally be one that is symmetric and transitive. The human reader can be trusted to determine that, for example "<" is the "main" entry and that from there all the other, same or similar characters are referenced, but by not listing the reverse direction everywhere, the level of clutter in the rest of the nameslist is reduced, making additional cross references stand out more.

Those are just the intentional inconsistencies.

There is a historical development in the annotations - over time, more characters get annotated. However, annotations are not always backported, so the level of annotations can be inconsistent for reasons of incremental development.

Now, for the x-refs on gaps, a human reader could extract and verify the set, but relying blindly on an algorithm to extract the data is fraught with peril. (Other gaps may have slightly different origin and status, yet also carry an annotation).

Using the mathematical data files for this is a step up, because the data there is focused on a single use case. The downside is that the information is in a comment field.

A./
Received on Thu Mar 10 2016 - 19:06:33 CST

This archive was generated by hypermail 2.2.0 : Thu Mar 10 2016 - 19:06:34 CST