Re: annotations from Asmus Freytag (t) on 2016-03-13 (Unicode Mail List Archive)

From: Asmus Freytag (t) <asmus-inc_at_ix.netcom.com>
Date: Sun, 13 Mar 2016 20:32:03 -0700

On 3/12/2016 10:55 PM, Janusz S. "Bień" wrote:

In fact, the possibility of reuse in this context probably among the
> unstated rationales for making the information and syntax available in
> the first place.

I understand there is no intention to make an official XML version of
the file as it would require changes in Unibook?

In principle, the tooling that the editorial committee maintains could be modified to
write out some XML version of the information. It's only software. By the same token
principle, someone could write a new parser for Unibook that can read the XML.
Both would consume significant amount of resources, for absolutely no gain when
it comes to the core purpose: the production of the code charts.

In fact, the work would not be done, because the code chart process requires
the use of some namelist-aware tools for draft preparation. All of these would have
to be translated into a new format as well.

Finally, Unibook relies on auxiliary files that provide font selection and configuration
data. Logically, the smart thing would be to convert all of them to XML, or JSON, or
whatever the structured data description format du jour.

Looked at it from a practical perspective, by those involved in doing the work of
creating the code charts and issuing new versions of the Standard, it's a non-starter.

There are explanations about character use that are only maintained in
the PDF of the core specification, where this information is packaged
in a way that can be understood by a human reader, but is not amenable
to be extracted by machine.

While the annotations, comments, cross references etc. in Namelist.txt
appear, formally, to be machine extractable, the way they are created
and managed make them just as much "human-accessible" only as the core
specification.

I'm afraid it's not clear for me. Let's take an example. Sometime ago I
inquired about a controversial alias for U+018D:

        http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0014.html

Can I really find anything about "reversed Polish-hook o" in the core
specification which is not a literal copy of the information from
NamesList.txt?

This comment referred to the facts that (a) the nameslist is not exhaustive and
that (b) it is perfectly OK to have information that's not intended for machine-
parsing.

Information intended for machine-parsing has a certain amount of structure and
consistency, so that when a data table is built from it, the consuming program
can rely on the fact that it will cover some aspect of character identity or
behavior in a systematic way.

Well, not all possible information is systematized that way. Some information
requires being interpreted by a human reader; the fact that the information is
not buried in running text, but shows up in "fields" in a list, doesn't make it
systematized in the same way as case mapping, decomposition or other property
data.

You might as well have a tool that extracts snippets from the core specification.
All fine, if your goal is, for example, to present all bits of text mentioning a certain
code point (search engines will do some of that extraction for you).

However, even after extraction, the data is still just as unstructured as before,
and, while useful to a human reader, doesn't constitute a formal character
property. That's the whole reason why we go to the trouble of defining so
clearly what is and isn't a character property (see UAX#44).

A./

Received on Sun Mar 13 2016 - 22:33:13 CDT

This archive was generated by hypermail 2.2.0 : Sun Mar 13 2016 - 22:33:13 CDT