L2/04-146

From: Mark Davis
Date: 2004-05-02 16:00:23 -0700
Subject: A Script Model

Please add this to the document registry and agenda. (If you respond to this
message, please take "UTC Agenda Item" off the title and drop all but
unicore@unicode.org from the addressees.)

================

I'm putting on a Whistler hat here for a moment: we need a "Script Model"

Part of the problem I see in the debate over Phoenician is that we don't
*present* quite clear enough a picture of when two bodies of characters should
be encoded as separate scripts or not. I think we've been able to get along
pretty well up until now without one simply because we were dealing with the
relatively simple cases. We all have no problem separating, say, Katakana and
Cyrillic. But as people start to propose ancient scripts, the picture gets a bit
fuzzier. Where, if at all, along a historical continuum that leads from set of
glyphs A to set of glyphs B, do we want to draw the lines? The situation is
further muddied when B has a cousin C, who is also a scion of A.

To quote Dr. Whistler: "For the Aramaic script continuum there are two potential
easy answers:

1. Hebrew is already encoded, so just use Hebrew letters for everything and
change fonts for every historical variety.

2. Encode a separate repertoire for each stylistically distinct abjad ever
recorded in the history of Aramaic studies, from Proto-Canaanite to modern
Hebrew (and toss in cursive Hebrew, for that matter), starting with Tables 5.1,
5.3, 5.4, and 5.5 of Daniels and Bright and adding whatever you wish to that."


Now, I am not saying *at all* that we can have a cookie-cutter view of the
picture; we could never have a rule book where if you answer, say, 50% of the
questions as true then you get your own script. There will always be a strong
element of judgment involved.

But what we should have is at least an articulated list of *factors* that play
into the decision of whether to encode a new script X, or treat it as a variant
of an existing Y. That way, as we are faced with trying to assess more and more
obscure sets of characters, the committee can at least be presented with an
organized set of background information. This will also give more guidance to
the proposers, so that they know what the factors are, and can prepare the
information that is needed. The model could be referenced from
http://www.unicode.org/pending/proposals.html.

Here is a quick list off the top of my head. I'm sure there are many others, and
that these could be stated more clearly, but just to get things rolling.... (The
"Overridden examples" are cases where the factor in question was overridden by
other factors.)

1. Independence: there is no shared ancestor (or if there is one, it is not
known to the average person).
Examples: Katakana and Cyrillic
Overridden: Devanagari and Bengali

2. Intelligibility: the average person using X would not recognize text if it
were written in Y, and vice versa.
Examples: Katakana and Cyrillic, Devanagari and Bengali
Overridden: Fraktur or Uncial and Latin

3. Architecture: the behavior of many characters in X is substantially different
than Y
This is in terms of shaping, combining, order, etc.

4. Semantics: even where they have roughly similar shapes, the bulk of the
characters in X have substantially different semantics than in Y. This is very
much a matter of judgement: the difference in pronunciation between 'u' in
French and English, or 'j' in Spanish vs German does not qualify.
Example: Cherokee vs. Latin

....

At the same time, there are faux criteria, that should be pointed out so that
proposers don't go down the wrong path.

1. Superset: X has some characters that are not in Y (or vice versa).
We don't use that argument to de-unify any script that added characters over
time, nor to de-unify modern-day sets of characters that differ in size.
Examples: Polish and Hawaiian, nor Arabic and Urdu, nor Chinese and Japanese
Han.

2. Religion, language, or national origin
We don't de-unify scripts for these reasons. (Nor do we, of course, change the
placement of existing characters, as N. Korea requested.)
Examples: Polish and Hawaiian, nor Arabic and Urdu, nor Chinese and Japanese
Han.
....