Document L2/03-416  for UTC and X3L2 meeting, 4 November, 2003

The Cuneiform Encoding Proposal -- a View of its Current Status  

The following is written by Lloyd Anderson, 3 November 2003.
It refers to the proposal posted at 
http://www.evertype.com/standards/iso10646/pdf/n2664-cuneiform.pdf Much
supporting information on various points is now or will in December be 
posted on the web site

http://www.CuneiformSigns.org

***

Our work is an encoding for one of the most important bodies of information 
in human history, and if our encoding fits the writing system like a glove, it 
will make both data handling and additional discoveries much easier.  

To draw a parallel:  when one assyriologist was discussing a hypotheses about 
the earliest records of lunar cycles, the question of a 19-year cycle came 
up, and I suggested simply arranging the data in 19-year columns so we could see 
visually any repeating patterns.  The result was almost immediate, we could 
see that the cycles were strongly established almost 200 years before we had 
previously thought they were (before 700 BC instead of at 524 BC).

I believe encoding Cuneiform so it works smoothly will have similar large 
effects on future discovery and understanding.

In support of this effort, I am contributing results from a concordance to 
all major sign lists which I am generating in the process of producing an 
etymological dictionary of the origins and development of Cuneiform signs.  

***

It is probably most useful if I summarize my views of where we are now, and 
what remains to be done.

The proposal which is before the UTC has been drafted with input from quite a 
number of people, and has benefited greatly from the efforts of Dean Snyder 
here at Johns Hopkins, in bringing people together so that more professional 
assyriologists have chosen to get involved.  That proposal is progressing 
relatively well.  We have a substantial repertoire under discussion, and will be 
refining that as to both sign inclusion / exclusion and sign names.  Here follow 
some issues which are still being worked on or for which more information will 
be gathered.

***

Cuneiform is unlike Han characters because Han uses blocks of constant size, 
so there is never any doubt where one character ends and the next begins. 
Cuneiform signs are by contrast of varying width.  It is unlike Latin, because any 
combining elements are not singled out as different from base characters when 
standing alone.  We must use more subtle methods in the inevitable borderline 
cases of various kinds, to identify just which are the independently 
functioning distinctive characters of the writing system.

We are devising an encoding for a large historical range, of thousands of 
years, during which time of course some changes occurred.  Our principle, fully 
agreed to, is that a distinctive contrast occurring anywhere in the time we 
cover must be provided for in the character set, even if not all users will need 
it at all times.  Just as with extended-Latin or any other script.  We are 
increasingly conscious that we run into a few odd cases when doing this, and that 
we cannot have identical text content of all eras encoded the same way if the 
set of distinctive elements has changed across those eras and we want them 
each to be encoded true to their own system of distinctions.  In some cases we 
may make compromises among our several goals.

Specifics:

***

1.  Number signs
Cuneiform number signs in general do not share glyphic appearances with signs 
used for non-numerical text.  This is especially clear when we consider the 
historical range of Cuneiform, since signs which look identical in one era are 
distinct in another era.  No problem here.
    There are a very few signs where identity of form is complete or nearly 
complete between a number sign and a non-number sign.  We will be working 
further on these to see what will best serve the community of users of Cuneiform 
texts.
     An illustration of some of the oldest number signs will accompany this 
document at the UTC meeting of 4 November, 2003, to demonstrate the com
bining-diacritic pattern among early Cuneiform number signs (those overlayed marks 
which signaled what kind of thing was being counted).

***

2.  So-called "compound signs".  
We have progressed beyond a blanket rule that anything *called* a compound 
sign is to be encoded as glyphs which we define as its parts and into which we 
fragment it.  A few such signs may be encoded as single characters we treat as 
atomic, even if some have sometimes treated them as a sequence. This can be 
for different reasons: 

(a) because in fact they both appear differently and are in functional 
contrast with the mere sequence of two other signs which look similar to the parts 
we claim to see, even if they are not distinguished under all circumstances; or

(b) because the political repercussions of not doing so would be a widespread 
rejection of the encoding by those for whom it is intended, with some very 
high-profile and/or high-frequency signs.  With the advice of Ken Whistler, our 
active participants on Nov. 3rd agreed to treat these two spellings as NOT 
canonically equivalent, and accept that those who do not understand the texts 
would be prone to a few spelling errors.  Since only professional assyriologists 
are likely to be inputting significant amounts of text in any case, this was 
regarded as not a significant problem, but the issue will be discussed further.

Some similarities may exist here to the historical debates over the <ae> digraph,
which is a ligature for users of English, but is a single atomic 
character for users of Danish.  So it was encoded as a distinct character, and is not 
in Danish usage a structural ligature at all, no more than is the ampersand 
"&".

***

3.  For both Compound signs (just above) and Container-Infixed signs (next), 
we are increasingly recognizing that these are not simple or straightforward 
categories, that there are several groups of signs under each blanket term, and 
that we *may* choose to distinguish such groups in the final proposal. 
Additional data and patterns of signs are constantly being accumulated, so we will 
have gradually increasing support for our choices.

***

4. Container-infixed signs 
These can be encoded either as atoms as code sequences.  Our group has so far 
chosen to encode primarily as atoms.  
Either is workable and extendable to additional signs as they are discovered, 
but under particular conditions.

Chief advantages of atomic coding:
   The parts of A-with-infixed-B may develop in the combination the same way 
they do when independent signs, or they may develop in a special way in the 
combination.  We can handle certain changes in fused components over time, treat 
the sign as still the "same" sign so texts retain their identity in encoded 
form across at least a substantial span of historical change, as when an 
original component NA is replaced later by the rather similar KI, yet the sign as a 
whole sign retains its identity.  This is an especially obvious solution for 
irregularities in signs which mostly behave as fused, so that the "parts" cease 
to be recognizable.  In other words, deep etymological origins can be 
disregarded in such cases, we are not *forced* to encode the sign two different ways 
simply because we know it underwent some change, is not a direct inheritance.

Chief advantages of code sequences (SIGN <infix> SIGN or more inclusively 
SIGN <merge> SIGN):
     Certain of the "container" signs are highly productive, permitting many 
infixed signs.  The vast majority of signs we have found which may be added to 
the repertoire are of the form container-sign-with-infixed-sign.  One of the 
container signs (GA2) takes the widest range of infixed signs.  It may have 
conveyed the meaning content "basket of ____",  so utterly transparent that it 
is like an independent phrase in a sentence.  That will be more conveniently 
encoded as a sequence of codes, CONTAINER SIGN <infix> INFIXED SIGN.  There are 
other complexes container-with-infix which are at the opposite extreme, fused 
and not productive.
     Glyphic representation:  All systems which can handle the code points 
for Cuneiform will also be able to handle fonts in which a sequence of codes is 
represented by a single glyph.  So font makers can add support for new 
sequences and thus new container-infixed signs without needing any change in the 
standard, if encoding is as a sequence SIGN <infix> SIGN.  The default binary sort 
order will continue to work for new signs encoded that way.
     There are some fluctuations either at one time or across eras, what 
appears as A with B infixed at one time may appear as A followed by B at another 
time.  Recognition of equivalences between infixed and extraposed versions of 
the "same" sign is much easier if the container-infixed signs are encoded as 
sequences. Yet there are not very many examples of this type.  

Neutral or nearly so:
     Sort order can keep all signs with the same "Container" component 
together either by binary sort order or by table-driven sort, under either method of 
encoding.  A difference is that new signs with known components will not be 
automatically sorted correctly if container-infixed signs are encoded as atoms, 
not in binary sorts, and not in table-driven sorts until the table is 
modified.  In the case of encoding as atoms, the table can specify sorts *as if* 
sequences SIGN <infix> SIGN, so new container-infix signs can be added; all signs 
known distinct should however be included from the beginning, since only the 
container-infix ones can be added later and sorted corrrectly by default.  A 
particular encoded character will be needed which we can refer to as INFX, 
<infix> or better <merger> both to support the table sort instructions and to 
permit the addition of new container-infixed signs.
     
A minor problem with encoding as sequences.
We NEED CODE POINT <infix> or better <merge> under EITHER encoding method.  
Problems of hierarchy of the following types (nested infixation or infixation 
of a sign sequence
not merely of a single sign) are very rare; 
(A B) <infix> (C D) or 
A <infix> (C <infix> D)
so we can probably manage without support for hierarchies, handle only the 
simple containers with simple infixed signs.
Having a large number of atomic encoded signs decreases need for such 
hierarchical structures.

***

5.  Sort order:
     I can see no justification yet for any binary sort order other than the 
de facto standard.  And the same for the default table sort order.  The 
traditional sort order reflects the dominant practice across generations of 
scholarship.  It is based on the types of wedges in the signs, and their arrangement.  
This has been most often done for later forms of the signs, specifically for 
Neo-Assyrian.  When done for other scribal traditions, the sign order may be 
slightly different because the actual sign forms are used, or it may be the 
same as the most common standard because the forms of equivalent signs of the 
standard NeoAssyrian may be used in determining the sort order.
     Concerns about where to interleave additional signs which do not have a 
later sign representative are now to me very small.  Ellermeier's list has 
already done this for the matches between Neo-Assyrian and the quite old Gudea 
(Lagash) signs, and most of the rest are equally straightforward.
     The proposal currently before you uses an ordering based on a selection 
from among the many possible ways in which each particular sign is named.  
Some of this will change since some of the name choices are under discussion and 
are expected to change.  
     It is my experience that it is almost *NEVER* a good idea to change a de 
facto standard, unless there is an overwhelming preponderane of evidence that 
the newer way will be overall *substantially* better.  The reason is that the 
change itself has such a high cost.  Proponents of new ways almost always 
overrate the virtues of a new way, and unerrate the virtues of stability.  
Encoding standards emphasize stability, and I hope we can in this matter also.

***

6.  Sort order and sign identity implementations:
There are a host of particular decisions that cuneiform specialists will need 
to make about sorting and searching particular sets of signs, and on treating 
them as same vs. distinct.  Also about what changes across historical eras 
are such as to force us to encode signs differently, and what changes we can 
treat as purely glyphic, not affecting text content encoding.  I am compiling 
lists of the most difficult borderline cases to highlight for specialists to 
consider decisions on them.  Most of them have not been offered yet for any 
discussion at all.

***

7.  Inclusion of older signs (Fara, Uruk, etc.).  
     Assyriologist specialists are most reluctant to consider encoding older 
signs because they feel that knowledge of the older eras is not complete.  Yet 
inclusion of all signs on which we have secure knowledge is especially 
important now before an encoding standard reaches a final proposal.  What is known 
about the older signs can make encoding even of later stages more useful, so 
the encoding fits the long cuneiform tradition more "like a glove".  
     The wording of the proposal before the UTC is consistent with the 
approach I am urging, where it says that we "will take into account factors arising 
from the earliest stages of cuneiform to the extent that these are already 
known and understood".  It is inconsistent where it says there must be a fist 
stage which "will *not* include Archaic Cuneiform".  That is a large 
generalization.  If we add to the previous statement that we "will take into account both 
signs and factors arising from the earliest stages of cuneiform to the extent 
that these are already known and understood" I think we would have full 
agreement in principle.  
     There is still the question of in practice.  There is a great reluctance 
concerning the earlier stages.  I believe this reluctance is in part 
perfectionism, in part it is simply the avoidance of what can be a lot of work ("no 
special effort has been made to go back farther than Ur III"), and partly it is 
a consequence of using blanket terms to cover a complex and multi-faceted 
situation.  Instead of using a word like "archaic", I believe our principle should 
simply be the one stated in the proposal, that we take into account all 
secure knowledge of all eras from the beginning of the encoding process.  That 
allows us to use late "dialects" of cuneiform and early attestations of cuneiform 
(including its earliest forms produced not using wedge tools) whenever we have 
solid information, and to disregard them when we do not.  No one is compelled 
to work on the earliest levels, or on the latest for that matter, but we 
should not throw away information which is easily available.  Nor is absolute 
comprehensiveness a goal -- "comprehensiveness at the level of pre-Old-Akkadian 
periods is not appropriate given the current state of paleographic research" is 
certainly a commen-sense statement.  On the other hand, a large proportion of 
pre-Old-Akkadian signs are known and unerstood, including some which function 
differently from later signs.
     To support this inclusion of more solid knowledge, I have been pressing 
forward more rapidly with a complete concordance to all eras of cuneiform, and 
expect a nearly publishable version will be done by the end of December.  
Here follow, as illustrations, comments about what is known of two features of 
older signs, to clarify why they are *not* a problem.

***

8. "Turned" signs
In the Uruk IV stage, a number of signs occur turned 90 or 180 degress, or 45 
degrees.  The vast majority of such turned signs do not occur later, they 
occur only in that oldest layer with substantial writing, Uruk IV. (R. Englund 
statement).  The vast majority of turned signs appear never to have expressed 
any distinction in content.  They were merely random variation or adjustment of 
their glyphic shape to better fit their context.  One possible method to 
represent this is to have a few COMBINING CHARACTERS (turned 90 degrees, 45 
degrees, 180 degrees) or the existing variant-selection characters, which can be used 
as needed, and which can be disregarded if as appears to be the case they do 
not convey differences of content.  We can quite easily list the few cases 
where turning does create a separate sign, as $E (vegetation, originally upright) 
vs. harvested vegetation (originally horizontal); or a reversed hand with a 
meaning referring to the left-hand, or the like.  The number of unclear cases, 
where we can't tell whether we have any substantive evidence for a significant 
distinction, is tiny.  This paragraph is intended to implement what I agree 
with Englund is an approach placing high value on security of analysis, and the 
provision of a mechanism to make addtional distinctions which *may* be 
significant although we do not yet want to attribute status as fully independent 
signs.

9.  "Fused" signs in the archaic typography of Uruk.
It is possible to distinguish even in archaic cuneiform between freely 
recombining signs or sign components, and those combinations which are fused in some 
way (which together essentially constitute a single sign), as  NAMESHDA, NAM 
x SHE, etc.  Citations will be provided to the specialist literature.

Similarly, other questions in archaic signs are not difficult to handle.  
Some comments from Englund (1998) on what he considers conservative choices in 
these matters, and what is known, will be available by noon 4 November on the 
web page http://www.CuneiformSigns.org/Strategy.htm

***

10.  Implementation specifications are needed.
The major lack I see currrently in preparation of the encoding proposal is 
the lack of a draft of implemention guidelines. The task of preparing such 
guidelines will bring to the attention of all involved a number of technical and 
structural questions which it is othewise much too easy to pass by unnoticed, 
and questions about individual characters.

Here are a couple of sentences from the proposal which have implications for 
implementation, but whose implications I think have not yet been discussed, at 
least not publicly.  The two following sentences both apply to splits of what 
was one character into two significantly distinct characters, but they 
recommend different implementations.  There is nothing wrong with having these two 
different implementation strategies available, because there are situations 
appropriate to each.  But the decision will have to be made for each split or 
merger, and specialists will need to consider each of these.  Perhaps 
generalizations can be discovered which work, but it can be dangerous if we let the 
generalizations themselves become the goal, rather than the goal being the best 
implementation for each split and each merger on its own terms.  Facing such 
implementation questions is simply one of the consequences of attempting an 
encoding across historical changes such as splits and mergers.  

(a)  "[mergers and splits] must be encoded at the point of maximum 
differentiation, with reduplication of glyphs as necessary in other periods"
(b)  "Glyph variants such as TA*, a Middle Assyrian form of the sign TA which 
in Neo-Assyrian usage has its own logographic interpretation, will be 
assigned their own code positions, to be used only when the new interpretation 
applies."

For sentence (a), if signs S1 and S2 are distinct at one period, then in 
periods where they are not distinct, the single glyphic rendering shall be 
duplicated and used for both characters.  Users would of course have to choose which 
of the two characters to input based on their knowledge of the distinction in 
texts other than the one they are inputting.

For sentence (b), the character code for S2 (TA*) shall not be used except 
for those texts where S1 (TA) and S2 (TA*) are signficantly distinct.

***

Best wishes,
Lloyd Anderson
Ecological Linguistics
PO Box 15156
Washington DC 20003
(202) 547-7683
ecoling@aol.com