Re: annotations (was: NamesList.txt as data source) from Philippe Verdy on 2016-03-14 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 14 Mar 2016 08:23:18 +0100

is the term "exponentially" really appropriate ? the NamesList file is not
so large, and the grow would remain linear.

Anyway, this file (current CSV format or XML format) does not need to be
part of the core UCD files, they can be in a separate download for people
needing it.

One benefit I would see is that this conversion to XML using an automated
tool could ensure that it is properly formated. But I believe that Unibook
is already parsing it to produce consistent code charts so its format is
already checked. And this advantage is not really effective.

But the main benefit would be that the file could be edited and updated
using standard tools. XML is not the only choice available, JSON today is
simpler to parse, easier to read (and even edit) by humans, it can embed
indentation whitespaces (outside quoted strings) that won't be considered
part of the data (unlike XML where they "pollute" the DOM with extra text
elements).

In fact I belive that the old CSV formats used in the original UCD may be
deprecated in favor of JSON (the old format could be automatically
generated for applications that want them. It could unify all formats with
a single parser in all tools. Files in older CSV or tabulated formats would
be in a separate derived collection. Then users would choose which format
they prefer (legacy now derived, JSON, or XML if people really want it).

The advantage of XML however is the stability for later updates that may
need to insert additional data or annotations (with JSON or CSV/tabulated
formats, the number of columns is fixed, all columns must be fed at least
with an empty data, even if it is is not significant). Note that legacy
formats also have comments after hash signs, but many comments found at end
of data lines also have some parsable meaning, so they are structured, and
may be followed by an extra hash sign for a real comment)

The advantage of existing XSV/tabulated formats is that they are extremely
easy to import in a spreadsheet for easier use by a human (I won't requiest
the UTC to provide these files in XLS/XLSX or ODC format...). But JSON and
XML could as well be imported provided that the each data file remains
structured as a 2D grid without substructures within cells (otherwise you
need to provide an explicit schema).

But note that some columns is frequently structured: those containing the
code point key is frequently specifying a code range using an additional
separator; as well those whose value is an ordered list of code points,
using space separator and possibly a leading subtag (such as decomposition
data): in XML you would translate them into separate subelements or into
additional attributes, and in JSON, you'll need to structure these
structured cells using subarrays. So the data is *already* not strictly 2D
(converting them to a pure 2D format, for relational use, would require
adding additional key or referencing "ID" columns and those converted files
would be much less easier to read/edit by humans, in *any* format:
CSV/tabular, JSON or XML).

Other candidate formats also include Turtle (generally derived from OWL,
but replacing the XML envelope format by a tabulated "2.5D" format which is
much easier than XML to read/edit and much more compact than XML-based
formats and easier to parse)...

2016-03-14 3:14 GMT+01:00 Marcel Schneider <charupdate_at_orange.fr>:

> On Sun, 13 Mar 2016 13:03:20 -0600, Doug Ewell wrote:
>
> > My point is that of J.S. Choi and Janusz Bień: the problem with
> > declaring NamesList off-limits is that it does contain information that
> > is either:
> >
> > • not available in any other UCD file, or
> > • available, but only in comments (like the MAS mappings), which aren't
> > supposed to be parsed either.
> >
> > Ken wrote:
> >
> > > [ .. ] NamesList.txt is itself the result of a complicated merge
> > > of code point, name, and decomposition mapping information from
> > > UnicodeData.txt, of listings of standardized variation sequences from
> > > StandardizedVariants.txt, and then a very long list of annotational
> > > material, including names list subhead material, etc., maintained in
> > > other sources.
> >
> > But sometimes an implementer really does need a piece of information
> > that exists only in those "other sources." When that happens, sometimes
> > the only choices are to resort to NamesList or to create one's own data
> > file, as Ken did by parsing the comment lines from the math file. Both
> > of these are equally distasteful when trying to be conformant.
>
>
> If so, then extending the XML UCD with all the information that is
> actually missing in it while available in the Code Charts and
> NamesList.txt, ends up being a good idea. But it still remains that such a
> step would exponentially increase the amount of data, because items that
> were not meant to be systematically provided, must be.
>
> Further I see that once this is completed, other requirements could need
> to tackle the same job on the core specs.
>
> The point would be to know whether in Unicode implementation and i18n,
> those needs are frequent. E.g. the last Apostrophe thread showed that full
> automatization is sometimes impossible anyway.
>
> Marcel
>
>
Received on Mon Mar 14 2016 - 02:25:01 CDT

This archive was generated by hypermail 2.2.0 : Mon Mar 14 2016 - 02:25:01 CDT