NamesList.txt as data source (was: Re: Gaps in Mathematical Alphanumeric Symbols)

From: Ken Whistler <kenwhistler_at_att.net>
Date: Thu, 10 Mar 2016 13:40:47 -0800

On 3/10/2016 1:00 PM, Andrew West wrote:
> It (http://www.unicode.org/Public/UNIDATA/NamesList.txt) is
> machine-readable, although the file specifically warns that "this file
> should not be parsed for machine-readable information".
>

NamesList.txt is just a structured text file, so of course it is
"machine-readable".
The problem is that because it is machine-readable, people tend to jump
to the conclusion that all the information they need can simply be
reliably parsed out of that file.

It can't be.

The reason is that NamesList.txt is itself the result of a complicated merge
of code point, name, and decomposition mapping information from
UnicodeData.txt, of listings of standardized variation sequences from
StandardizedVariants.txt, and then a very long list of annotational
material, including names list subhead material, etc., maintained in
other sources.

If people actually want to get reliably parsed data on code points, names,
and decomposition mappings, they should get that directly from
UnicodeData.txt. Likewise for information about standardized variation
sequences, from StandardizedVariants.txt.

The *reason* that NamesList.txt exists at all is to drive the tool, unibook,
that formats the full Unicode code charts for posting. It is only
posted in the Unicode Character Database at all as a matter of
convenience, to give people access to a text only version of the
names list that appears in the fully formatted pdf versions of the code
charts
that contain all the representative glyphs.

NamesList.txt should *not* be data mined. Well, nobody can stop
people from attempting to do so, of course, but they tend to end
up confused and disappointed, because their assumptions going in
don't match the editorial realities that affect the development of
the annotational content added to the names list and the actual
use for which NamesList.txt was created in the first place.

--Ken
Received on Thu Mar 10 2016 - 15:41:50 CST

This archive was generated by hypermail 2.2.0 : Thu Mar 10 2016 - 15:41:50 CST