Document L2/03-416  for UTC and X3L2 meeting, 4 November, 2003



The Cuneiform Encoding Proposal -- a View of its Current Status  



The following is written by Lloyd Anderson, 3 November 2003.

It refers to the proposal posted at 

http://www.evertype.com/standards/iso10646/pdf/n2664-cuneiform.pdf Much

supporting information on various points is now or will in December be 

posted on the web site



http://www.CuneiformSigns.org



***



Our work is an encoding for one of the most important bodies of information 

in human history, and if our encoding fits the writing system like a glove, it 

will make both data handling and additional discoveries much easier.  



To draw a parallel:  when one assyriologist was discussing a hypotheses about 

the earliest records of lunar cycles, the question of a 19-year cycle came 

up, and I suggested simply arranging the data in 19-year columns so we could see 

visually any repeating patterns.  The result was almost immediate, we could 

see that the cycles were strongly established almost 200 years before we had 

previously thought they were (before 700 BC instead of at 524 BC).



I believe encoding Cuneiform so it works smoothly will have similar large 

effects on future discovery and understanding.



In support of this effort, I am contributing results from a concordance to 

all major sign lists which I am generating in the process of producing an 

etymological dictionary of the origins and development of Cuneiform signs.  



***



It is probably most useful if I summarize my views of where we are now, and 

what remains to be done.



The proposal which is before the UTC has been drafted with input from quite a 

number of people, and has benefited greatly from the efforts of Dean Snyder 

here at Johns Hopkins, in bringing people together so that more professional 

assyriologists have chosen to get involved.  That proposal is progressing 

relatively well.  We have a substantial repertoire under discussion, and will be 

refining that as to both sign inclusion / exclusion and sign names.  Here follow 

some issues which are still being worked on or for which more information will 

be gathered.



***



Cuneiform is unlike Han characters because Han uses blocks of constant size, 

so there is never any doubt where one character ends and the next begins. 

Cuneiform signs are by contrast of varying width.  It is unlike Latin, because any 

combining elements are not singled out as different from base characters when 

standing alone.  We must use more subtle methods in the inevitable borderline 

cases of various kinds, to identify just which are the independently 

functioning distinctive characters of the writing system.



We are devising an encoding for a large historical range, of thousands of 

years, during which time of course some changes occurred.  Our principle, fully 

agreed to, is that a distinctive contrast occurring anywhere in the time we 

cover must be provided for in the character set, even if not all users will need 

it at all times.  Just as with extended-Latin or any other script.  We are 

increasingly conscious that we run into a few odd cases when doing this, and that 

we cannot have identical text content of all eras encoded the same way if the 

set of distinctive elements has changed across those eras and we want them 

each to be encoded true to their own system of distinctions.  In some cases we 

may make compromises among our several goals.



Specifics:



***



1.  Number signs

Cuneiform number signs in general do not share glyphic appearances with signs 

used for non-numerical text.  This is especially clear when we consider the 

historical range of Cuneiform, since signs which look identical in one era are 

distinct in another era.  No problem here.

    There are a very few signs where identity of form is complete or nearly 

complete between a number sign and a non-number sign.  We will be working 

further on these to see what will best serve the community of users of Cuneiform 

texts.

     An illustration of some of the oldest number signs will accompany this 

document at the UTC meeting of 4 November, 2003, to demonstrate the com

bining-diacritic pattern among early Cuneiform number signs (those overlayed marks 

which signaled what kind of thing was being counted).



***



2.  So-called "compound signs".  

We have progressed beyond a blanket rule that anything *called* a compound 

sign is to be encoded as glyphs which we define as its parts and into which we 

fragment it.  A few such signs may be encoded as single characters we treat as 

atomic, even if some have sometimes treated them as a sequence. This can be 

for different reasons: 



(a) because in fact they both appear differently and are in functional 

contrast with the mere sequence of two other signs which look similar to the parts 

we claim to see, even if they are not distinguished under all circumstances; or



(b) because the political repercussions of not doing so would be a widespread 

rejection of the encoding by those for whom it is intended, with some very 

high-profile and/or high-frequency signs.  With the advice of Ken Whistler, our 

active participants on Nov. 3rd agreed to treat these two spellings as NOT 

canonically equivalent, and accept that those who do not understand the texts 

would be prone to a few spelling errors.  Since only professional assyriologists 

are likely to be inputting significant amounts of text in any case, this was 

regarded as not a significant problem, but the issue will be discussed further.



Some similarities may exist here to the historical debates over the <ae> digraph,

which is a ligature for users of English, but is a single atomic 

character for users of Danish.  So it was encoded as a distinct character, and is not 

in Danish usage a structural ligature at all, no more than is the ampersand 

"&".



***



3.  For both Compound signs (just above) and Container-Infixed signs (next), 

we are increasingly recognizing that these are not simple or straightforward 

categories, that there are several groups of signs under each blanket term, and 

that we *may* choose to distinguish such groups in the final proposal. 

Additional data and patterns of signs are constantly being accumulated, so we will 

have gradually increasing support for our choices.



***



4. Container-infixed signs 

These can be encoded either as atoms as code sequences.  Our group has so far 

chosen to encode primarily as atoms.  

Either is workable and extendable to additional signs as they are discovered, 

but under particular conditions.



Chief advantages of atomic coding:

   The parts of A-with-infixed-B may develop in the combination the same way 

they do when independent signs, or they may develop in a special way in the 

combination.  We can handle certain changes in fused components over time, treat 

the sign as still the "same" sign so texts retain their identity in encoded 

form across at least a substantial span of historical change, as when an 

original component NA is replaced later by the rather similar KI, yet the sign as a 

whole sign retains its identity.  This is an especially obvious solution for 

irregularities in signs which mostly behave as fused, so that the "parts" cease 

to be recognizable.  In other words, deep etymological origins can be 

disregarded in such cases, we are not *forced* to encode the sign two different ways 

simply because we know it underwent some change, is not a direct inheritance.



Chief advantages of code sequences (SIGN <infix> SIGN or more inclusively 

SIGN <merge> SIGN):

     Certain of the "container" signs are highly productive, permitting many 

infixed signs.  The vast majority of signs we have found which may be added to 

the repertoire are of the form container-sign-with-infixed-sign.  One of the 

container signs (GA2) takes the widest range of infixed signs.  It may have 

conveyed the meaning content "basket of ____",  so utterly transparent that it 

is like an independent phrase in a sentence.  That will be more conveniently 

encoded as a sequence of codes, CONTAINER SIGN <infix> INFIXED SIGN.  There are 

other complexes container-with-infix which are at the opposite extreme, fused 

and not productive.

     Glyphic representation:  All systems which can handle the code points 

for Cuneiform will also be able to handle fonts in which a sequence of codes is 

represented by a single glyph.  So font makers can add support for new 

sequences and thus new container-infixed signs without needing any change in the 

standard, if encoding is as a sequence SIGN <infix> SIGN.  The default binary sort 

order will continue to work for new signs encoded that way.

     There are some fluctuations either at one time or across eras, what 

appears as A with B infixed at one time may appear as A followed by B at another 

time.  Recognition of equivalences between infixed and extraposed versions of 

the "same" sign is much easier if the container-infixed signs are encoded as 

sequences. Yet there are not very many examples of this type.  



Neutral or nearly so:

     Sort order can keep all signs with the same "Container" component 

together either by binary sort order or by table-driven sort, under either method of 

encoding.  A difference is that new signs with known components will not be 

automatically sorted correctly if container-infixed signs are encoded as atoms, 

not in binary sorts, and not in table-driven sorts until the table is 

modified.  In the case of encoding as atoms, the table can specify sorts *as if* 

sequences SIGN <infix> SIGN, so new container-infix signs can be added; all signs 

known distinct should however be included from the beginning, since only the 

container-infix ones can be added later and sorted corrrectly by default.  A 

particular encoded character will be needed which we can refer to as INFX, 

<infix> or better <merger> both to support the table sort instructions and to 

permit the addition of new container-infixed signs.

     

A minor problem with encoding as sequences.

We NEED CODE POINT <infix> or better <merge> under EITHER encoding method.  

Problems of hierarchy of the following types (nested infixation or infixation 

of a sign sequence

not merely of a single sign) are very rare; 

(A B) <infix> (C D) or 

A <infix> (C <infix> D)

so we can probably manage without support for hierarchies, handle only the 

simple containers with simple infixed signs.

Having a large number of atomic encoded signs decreases need for such 

hierarchical structures.



***



5.  Sort order:

     I can see no justification yet for any binary sort order other than the 

de facto standard.  And the same for the default table sort order.  The 

traditional sort order reflects the dominant practice across generations of 

scholarship.  It is based on the types of wedges in the signs, and their arrangement.  

This has been most often done for later forms of the signs, specifically for 

Neo-Assyrian.  When done for other scribal traditions, the sign order may be 

slightly different because the actual sign forms are used, or it may be the 

same as the most common standard because the forms of equivalent signs of the 

standard NeoAssyrian may be used in determining the sort order.

     Concerns about where to interleave additional signs which do not have a 

later sign representative are now to me very small.  Ellermeier's list has 

already done this for the matches between Neo-Assyrian and the quite old Gudea 

(Lagash) signs, and most of the rest are equally straightforward.

     The proposal currently before you uses an ordering based on a selection 

from among the many possible ways in which each particular sign is named.  

Some of this will change since some of the name choices are under discussion and 

are expected to change.  

     It is my experience that it is almost *NEVER* a good idea to change a de 

facto standard, unless there is an overwhelming preponderane of evidence that 

the newer way will be overall *substantially* better.  The reason is that the 

change itself has such a high cost.  Proponents of new ways almost always 

overrate the virtues of a new way, and unerrate the virtues of stability.  

Encoding standards emphasize stability, and I hope we can in this matter also.



***



6.  Sort order and sign identity implementations:

There are a host of particular decisions that cuneiform specialists will need 

to make about sorting and searching particular sets of signs, and on treating 

them as same vs. distinct.  Also about what changes across historical eras 

are such as to force us to encode signs differently, and what changes we can 

treat as purely glyphic, not affecting text content encoding.  I am compiling 

lists of the most difficult borderline cases to highlight for specialists to 

consider decisions on them.  Most of them have not been offered yet for any 

discussion at all.



***



7.  Inclusion of older signs (Fara, Uruk, etc.).  

     Assyriologist specialists are most reluctant to consider encoding older 

signs because they feel that knowledge of the older eras is not complete.  Yet 

inclusion of all signs on which we have secure knowledge is especially 

important now before an encoding standard reaches a final proposal.  What is known 

about the older signs can make encoding even of later stages more useful, so 

the encoding fits the long cuneiform tradition more "like a glove".  

     The wording of the proposal before the UTC is consistent with the 

approach I am urging, where it says that we "will take into account factors arising 

from the earliest stages of cuneiform to the extent that these are already 

known and understood".  It is inconsistent where it says there must be a fist 

stage which "will *not* include Archaic Cuneiform".  That is a large 

generalization.  If we add to the previous statement that we "will take into account both 

signs and factors arising from the earliest stages of cuneiform to the extent 

that these are already known and understood" I think we would have full 

agreement in principle.  

     There is still the question of in practice.  There is a great reluctance 

concerning the earlier stages.  I believe this reluctance is in part 

perfectionism, in part it is simply the avoidance of what can be a lot of work ("no 

special effort has been made to go back farther than Ur III"), and partly it is 

a consequence of using blanket terms to cover a complex and multi-faceted 

situation.  Instead of using a word like "archaic", I believe our principle should 

simply be the one stated in the proposal, that we take into account all 

secure knowledge of all eras from the beginning of the encoding process.  That 

allows us to use late "dialects" of cuneiform and early attestations of cuneiform 

(including its earliest forms produced not using wedge tools) whenever we have 

solid information, and to disregard them when we do not.  No one is compelled 

to work on the earliest levels, or on the latest for that matter, but we 

should not throw away information which is easily available.  Nor is absolute 

comprehensiveness a goal -- "comprehensiveness at the level of pre-Old-Akkadian 

periods is not appropriate given the current state of paleographic research" is 

certainly a commen-sense statement.  On the other hand, a large proportion of 

pre-Old-Akkadian signs are known and unerstood, including some which function 

differently from later signs.

     To support this inclusion of more solid knowledge, I have been pressing 

forward more rapidly with a complete concordance to all eras of cuneiform, and 

expect a nearly publishable version will be done by the end of December.  

Here follow, as illustrations, comments about what is known of two features of 

older signs, to clarify why they are *not* a problem.



***



8. "Turned" signs

In the Uruk IV stage, a number of signs occur turned 90 or 180 degress, or 45 

degrees.  The vast majority of such turned signs do not occur later, they 

occur only in that oldest layer with substantial writing, Uruk IV. (R. Englund 

statement).  The vast majority of turned signs appear never to have expressed 

any distinction in content.  They were merely random variation or adjustment of 

their glyphic shape to better fit their context.  One possible method to 

represent this is to have a few COMBINING CHARACTERS (turned 90 degrees, 45 

degrees, 180 degrees) or the existing variant-selection characters, which can be used 

as needed, and which can be disregarded if as appears to be the case they do 

not convey differences of content.  We can quite easily list the few cases 

where turning does create a separate sign, as $E (vegetation, originally upright) 

vs. harvested vegetation (originally horizontal); or a reversed hand with a 

meaning referring to the left-hand, or the like.  The number of unclear cases, 

where we can't tell whether we have any substantive evidence for a significant 

distinction, is tiny.  This paragraph is intended to implement what I agree 

with Englund is an approach placing high value on security of analysis, and the 

provision of a mechanism to make addtional distinctions which *may* be 

significant although we do not yet want to attribute status as fully independent 

signs.



9.  "Fused" signs in the archaic typography of Uruk.

It is possible to distinguish even in archaic cuneiform between freely 

recombining signs or sign components, and those combinations which are fused in some 

way (which together essentially constitute a single sign), as  NAMESHDA, NAM 

x SHE, etc.  Citations will be provided to the specialist literature.



Similarly, other questions in archaic signs are not difficult to handle.  

Some comments from Englund (1998) on what he considers conservative choices in 

these matters, and what is known, will be available by noon 4 November on the 

web page http://www.CuneiformSigns.org/Strategy.htm



***



10.  Implementation specifications are needed.

The major lack I see currrently in preparation of the encoding proposal is 

the lack of a draft of implemention guidelines. The task of preparing such 

guidelines will bring to the attention of all involved a number of technical and 

structural questions which it is othewise much too easy to pass by unnoticed, 

and questions about individual characters.



Here are a couple of sentences from the proposal which have implications for 

implementation, but whose implications I think have not yet been discussed, at 

least not publicly.  The two following sentences both apply to splits of what 

was one character into two significantly distinct characters, but they 

recommend different implementations.  There is nothing wrong with having these two 

different implementation strategies available, because there are situations 

appropriate to each.  But the decision will have to be made for each split or 

merger, and specialists will need to consider each of these.  Perhaps 

generalizations can be discovered which work, but it can be dangerous if we let the 

generalizations themselves become the goal, rather than the goal being the best 

implementation for each split and each merger on its own terms.  Facing such 

implementation questions is simply one of the consequences of attempting an 

encoding across historical changes such as splits and mergers.  



(a)  "[mergers and splits] must be encoded at the point of maximum 

differentiation, with reduplication of glyphs as necessary in other periods"

(b)  "Glyph variants such as TA*, a Middle Assyrian form of the sign TA which 

in Neo-Assyrian usage has its own logographic interpretation, will be 

assigned their own code positions, to be used only when the new interpretation 

applies."



For sentence (a), if signs S1 and S2 are distinct at one period, then in 

periods where they are not distinct, the single glyphic rendering shall be 

duplicated and used for both characters.  Users would of course have to choose which 

of the two characters to input based on their knowledge of the distinction in 

texts other than the one they are inputting.



For sentence (b), the character code for S2 (TA*) shall not be used except 

for those texts where S1 (TA) and S2 (TA*) are signficantly distinct.



***



Best wishes,

Lloyd Anderson

Ecological Linguistics

PO Box 15156

Washington DC 20003

(202) 547-7683

ecoling@aol.com