L2/04-146 From: Mark Davis Date: 2004-05-02 16:00:23 -0700 Subject: A Script Model Please add this to the document registry and agenda. (If you respond to this message, please take "UTC Agenda Item" off the title and drop all but unicore@unicode.org from the addressees.) ================ I'm putting on a Whistler hat here for a moment: we need a "Script Model" Part of the problem I see in the debate over Phoenician is that we don't *present* quite clear enough a picture of when two bodies of characters should be encoded as separate scripts or not. I think we've been able to get along pretty well up until now without one simply because we were dealing with the relatively simple cases. We all have no problem separating, say, Katakana and Cyrillic. But as people start to propose ancient scripts, the picture gets a bit fuzzier. Where, if at all, along a historical continuum that leads from set of glyphs A to set of glyphs B, do we want to draw the lines? The situation is further muddied when B has a cousin C, who is also a scion of A. To quote Dr. Whistler: "For the Aramaic script continuum there are two potential easy answers: 1. Hebrew is already encoded, so just use Hebrew letters for everything and change fonts for every historical variety. 2. Encode a separate repertoire for each stylistically distinct abjad ever recorded in the history of Aramaic studies, from Proto-Canaanite to modern Hebrew (and toss in cursive Hebrew, for that matter), starting with Tables 5.1, 5.3, 5.4, and 5.5 of Daniels and Bright and adding whatever you wish to that." Now, I am not saying *at all* that we can have a cookie-cutter view of the picture; we could never have a rule book where if you answer, say, 50% of the questions as true then you get your own script. There will always be a strong element of judgment involved. But what we should have is at least an articulated list of *factors* that play into the decision of whether to encode a new script X, or treat it as a variant of an existing Y. That way, as we are faced with trying to assess more and more obscure sets of characters, the committee can at least be presented with an organized set of background information. This will also give more guidance to the proposers, so that they know what the factors are, and can prepare the information that is needed. The model could be referenced from http://www.unicode.org/pending/proposals.html. Here is a quick list off the top of my head. I'm sure there are many others, and that these could be stated more clearly, but just to get things rolling.... (The "Overridden examples" are cases where the factor in question was overridden by other factors.) 1. Independence: there is no shared ancestor (or if there is one, it is not known to the average person). Examples: Katakana and Cyrillic Overridden: Devanagari and Bengali 2. Intelligibility: the average person using X would not recognize text if it were written in Y, and vice versa. Examples: Katakana and Cyrillic, Devanagari and Bengali Overridden: Fraktur or Uncial and Latin 3. Architecture: the behavior of many characters in X is substantially different than Y This is in terms of shaping, combining, order, etc. 4. Semantics: even where they have roughly similar shapes, the bulk of the characters in X have substantially different semantics than in Y. This is very much a matter of judgement: the difference in pronunciation between 'u' in French and English, or 'j' in Spanish vs German does not qualify. Example: Cherokee vs. Latin .... At the same time, there are faux criteria, that should be pointed out so that proposers don't go down the wrong path. 1. Superset: X has some characters that are not in Y (or vice versa). We don't use that argument to de-unify any script that added characters over time, nor to de-unify modern-day sets of characters that differ in size. Examples: Polish and Hawaiian, nor Arabic and Urdu, nor Chinese and Japanese Han. 2. Religion, language, or national origin We don't de-unify scripts for these reasons. (Nor do we, of course, change the placement of existing characters, as N. Korea requested.) Examples: Polish and Hawaiian, nor Arabic and Urdu, nor Chinese and Japanese Han. ....