Re: Character Index

From: Asmus Freytag (c) <>
Date: Tue, 29 Mar 2016 11:56:34 -0700
On 3/29/2016 11:24 AM, Ken Whistler wrote:

On 3/29/2016 12:16 AM, Janusz S. Bień wrote:
What about a simpler and more technical approach, like a character index
with links to the relevant proposals? Doesn't such a thing already exist
for internal use?

No, and it is exceedingly *non*-trivial to produce such an index.

This is an area where anybody can play. Not to ignore the issues and complexities that Ken rightly points out, but the archive is accessible and anyone could try their hands and data mining  - but would quickly discover that they would need actual manpower to curate the results to be useful.

If done well, it would be a valuable resource to people who do need to know more about a character because they need to address details of character behavior and use in specific implementations.

Doing it well would be non-trivial (or exceedingly so, here I fully agree with Ken), and as a result would not be something that we could expect the Consortium to take on - not if we want it to maintain its focus on encoding characters.

As I pointed out, the results of scouring the document register for specific information on character background and behavior, esp. if well focused, could be re-published by the Consortium, for example as a Unicode Technical Note, (but would not be maintained by it).

Such effort would be most useful not in a generalized index, but in describing certain collections of characters (for example Latin Medievalist characters) because then it would be possible to collate the information with other sources of character usage conventions for a one-stop solution for anyone in the field.


There are now thousands of documents, extending over 27 years
of history (and actually more when you go back to earlier work
on 10646). Much of the early half of that document trail is
paper only, in material that most of the participants have long ago

The status of what a "character" even is can change during the
development of proposals, as they morph over time. This is
also exceedingly non-trivial in some cases, where argumentation
about cases of unification and/or disunification of different
source attestations might proceed over an extended period.
That makes it pretty difficult to just willy-nilly produce a
magical character index that points to exactly the right place.

In recent years we have had some individuals who have tracked the
specific documents associated with repertoire new to particular
releases much more thoroughly than in prior years -- but truth
to tell, the *majority* of people involved in maintenance of
the Unicode Standard and ISO/IEC 10646 care little about the
details of that history. Instead, they are basically focused on
whatever happens to be the next thing to argue about. It is
all about shinies -- not about piecing together dusty old artifacts. ;-)


Received on Tue Mar 29 2016 - 13:57:28 CDT

This archive was generated by hypermail 2.2.0 : Tue Mar 29 2016 - 13:57:28 CDT