Short verbal IDs for UCS characters

From: John Clews (10646er@sesame.demon.co.uk)
Date: Tue Oct 05 1999 - 17:45:58 EDT


Via the Unicode List <unicode@unicode.org> on Monday, 4 October 1999
Marco Cimarosti <Marco.Cimarosti@icl.com> suggested that
"a sort of UDS = Universal Description Sequence" might be of use, and
Gregg Reynolds <greynolds@datalogics.com> asked similarly:
"Has this been done before? Would anybody other than me find this useful?"

I think this sort of thing is useful, and it's worth flinging a few
ideas around, as identifiers of the form U+hhhh (hex identifiers) are
not always meaningful out of context.

However, I think it's also worth trying to relate any short verbal IDs
for UCS characters to the UCS character names themselves.

I think the scheme below manages to be brief, generally meaningful in
context, and predictable, and relates to UCS character names, and can
also be machine generated, and also reversible to full character names.

Other views, pro or con, will be welcome.

I am not suggesting that

I'd be interested in any reactions to whether the following, for
instance carries any obvious meaning (or not) and whether it has any
general utility. The ideas below were only for a small proportion of
the UCS repertoire: they may or may not be extensible to other parts
of the UCS repertoire.

I look forward to any reactions with interest: example follows ---- .
You'll need to view it with a mono-spaced font that comfortably
handles 80-character lines, ideally.

Best regards

John Clews

---------------------------------------------------------------------
Cyrillic transliteration tables: practical examples of short IDs.

This table shows lower-case letters in ISO 9:1995(E), Table 1, in
a Pan-Cyrillic order. In the published standard, the columns Cyrillic ID
and Latin ID are replaced by specific Cyrillic or Latin characters.

+--------------------------------------------------------------------------+
| No. Source Target Cyrillic Latin Examples/Comments |
| ID ID ID ID |
+--------------------------------------------------------------------------+

        +0430 +0061 Cy_a a
        +0431 +0062 Cy_be b
        +0432 +0076 Cy_ve v
        +0433 +0067 Cy_ghe g
        +0434 +0064 Cy_de d
        +0452 +0111 Cy_dje d_stro
        +0453 +01F5 Cy_gje g_acut
        +0435 +0065 Cy_ie e
        +0451 +00EB Cy_io e_diae
        +0454 +00EA Cy_uk-ie e_circ
        +0436 +017E Cy_zhe z_caro
        +0437 +007A Cy_ze z
        +0455 +1E91 Cy_dze z_circ
        +0438 +0069 Cy_i i
        +0456 +00EC Cy_be-uk-i i_grav
        +0457 +00EF Cy_yi i_diae
        +0458 +01F0 * Cy_je j_caro
        +0439 +006A Cy_short_i j
        +043A +006B Cy_ka k
        +043B +006C Cy_el l
        +0459 +XX Cy_lje l_circ
        +043C +006D Cy_em m
        +043D +006E Cy_en n
        +045A +XX Cy_nje n_circ
        +043E +006F Cy_o o
        +043F +0070 Cy_pe p
        +0440 +0072 Cy_er r
        +0441 +0073 Cy_es s
        +0442 +0074 Cy_te t
        +045B +0107 Cy_tshe c_acut
        +045C +1E31 Cy_kje k_acut
        +0443 +0075 Cy_u u
        +045E +01D4 Cy_shor_u u_caro
        +0444 +0066 Cy_ef f
        +0445 +0068 Cy_ha h
        +0446 +0063 Cy_tse c
        +0447 +010D Cy_che c_caro
        +045F +XX Cy_dzhe d_circ
        +0448 +0161 Cy_sha s_caro
        +0449 +015D Cy_shcha s_circ
        +044A +0022 Cy_hard_sign quot_mark
        +044B +0079 Cy_yeru y
        +044C +0027 Cy_soft_sign apos
        +044D +00E8 Cy_e e_grav
        +044E +00FB Cy_yu u_circ
        +044F +00E2 Cy_ya a_circ

+XX = Not in ISO/IEC 10646 as precomposed characters, although they
      are available as level 3 characters of ISO/IEC 10646 through
      use of combining sequences.

* = Capital J_caro only available as level 3 characters of ISO/IEC 10646
      as [J] + [caro]

---------------------------------------------------------------------
Method for deriving short IDs from the letter names in ISO/IEC 10646:

CAPITAL LETTER retains its letter element in capitals and the term
        CAPITAL LETTER is dropped;

SMALL LETTER changes its letter element to small equivalents and the term
        SMALL LETTER is dropped.

In all other occurences the words SMALL, CAPITAL, LETTER, ACCENT, WITH, AND
and BY are dropped.

All elements except the actual letter element (e.g. AE above) are in small
        letters.

An underline character ( _ ) is used to separate elements in place of spaces;
it may be possible to drop this in databases etc. (but with less
readability).

4 letters is the normal name for elements in IDs, except for 1-, 2- and
3-letter words. 1-, 2- and 3-letter abbreviations are also used: these use a
hyphen.

Notes:

1. Single-letter abbreviations (with hyphen) are mainly positional.

       -a for above (e.g. dot-a for DOT ABOVE)
       -b for below
        m- for middle (e.g. m-dot for MIDDLE DOT)
        v- for vertical
        i- inverted
        l- left
        r- right
        s- small

   This is the complete list of single-letter abbreviations.

2. Two letter codes are only used for Script codes or Language codes
    e.g. 'Cy_be-uk-I' "CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I").

(a) Script codes (start of string; always 1 capital, 1 small, 1 underline)

Gr_ for Greek; Cy_ for Cyrillic; Am_ for Armenian;
Ge_ for Georgian; He_ for Hebrew; etc. Latin: left blank

(following usage in ISO/TC46/SC2 email survey (June/July 1996)

(b) Language codes (always midle of string; always 2 smalls, 1 hyphen)

Language codes are taken from ISO 639: e.g.

be- Byelorussian
uk- Ukranian

3. Three-letter codes (without hyphen) are for 3-letter words, e.g.

   dot for DOT, leg for LEG, bar for BAR or eth for ETH.

   Three-letter codes (with hyphen) are for 3-letter abbreviations (mainly
   phonetic descriptions), which most users will rarely use, e.g.

        den- DENTAL
        lat- LATERAL
        alv- ALVEOLAR
        ret- RETROFLEX
        glo- GLOTTAL
        bil- BILABIL
        pha- PHARYNGEAL
        voi- VOICED
        fri- FRICATIVE
        pal- PALATAL

   None of these are used in this table.

4. Four-letter codes are for 4-letter words

   e.g. left, half, ring, stop, curl, tail, sign, open, baby, long

   and abbreviations of 5-letter words

   e.g.

     lowe lower
     brev breve
     fina final
     acut acute
     grav grave
        
   and abbreviations of larger words

   e.g.

     desc for descender
     dotl for DOTLESS
     digr for DIGRAP
     liga for ligature
     reve for REVERSED
     apos for APOSTROPHE
     scri for SCRIPT
     clos for CLOSED
     diae for DIAERESIS
     stro for STROKE
     symb for SYMBOL
     circ for CIRCUMFLEX
     cedi for CEDILLA
     macr for MACRON
     modi for MODIFIER
     ogon for OGONEK
     prec for PRECEDED
     ques for QUESTION
     excl for EXCLAMATION
     abbr for ABBREVIATION
     punc for PUNCTUATION
     turn for TURNED
     cros for CROSSED

---------------------------------------------------------------------

--
John Clews, SESAME Computer Projects, 8 Avenue Rd, Harrogate, HG2 7PG
Email: 10646er@sesame.demon.co.uk; tel: 0171 272 8397 & 01423 888 432
Committee Chair of  ISO/TC46/SC2: Conversion of Written Languages;
Committee Member of ISO/IEC/JTC1/SC22/WG20: Internationalization;
Committee Member of CEN/TC304: Information and Communications
 Technologies: European Localization Requirements
Committee Member of the Foundation for Endangered Languages;
Committee Member of ISO/IEC/JTC1/SC2: Coded Character Sets



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT