RE: Short verbal IDs for UCS characters

From: Marco.Cimarosti@icl.com
Date: Fri Oct 08 1999 - 04:32:17 EDT


Gregg Reynolds actually suggested this, not I.
I liked his idea and just added some comments.

The "Universal Description Sequences" definition is mine, but it was
intended just as a *joke*, modeled on the "Ideographic Description
Sequences" (IDS) about which the list has been discussing lately. Sorry if I
used obscure humor.

Regards. Marco

> -----Original Message-----
> From: 10646er@sesame.demon.co.uk [SMTP:10646er@sesame.demon.co.uk]
> Sent: 1999 October 06, Wednesday 00.45
> To: Unicode List
> Cc: 10646er@sesame.demon.co.uk
> Subject: Short verbal IDs for UCS characters
>
> Via the Unicode List <unicode@unicode.org> on Monday, 4 October 1999
> Marco Cimarosti <Marco.Cimarosti@icl.com> suggested that
> "a sort of UDS = Universal Description Sequence" might be of use, and
> Gregg Reynolds <greynolds@datalogics.com> asked similarly:
> "Has this been done before? Would anybody other than me find this
> useful?"
>
> I think this sort of thing is useful, and it's worth flinging a few
> ideas around, as identifiers of the form U+hhhh (hex identifiers) are
> not always meaningful out of context.
>
> However, I think it's also worth trying to relate any short verbal IDs
> for UCS characters to the UCS character names themselves.
>
> I think the scheme below manages to be brief, generally meaningful in
> context, and predictable, and relates to UCS character names, and can
> also be machine generated, and also reversible to full character names.
>
> Other views, pro or con, will be welcome.
>
> I am not suggesting that
>
> I'd be interested in any reactions to whether the following, for
> instance carries any obvious meaning (or not) and whether it has any
> general utility. The ideas below were only for a small proportion of
> the UCS repertoire: they may or may not be extensible to other parts
> of the UCS repertoire.
>
> I look forward to any reactions with interest: example follows ---- .
> You'll need to view it with a mono-spaced font that comfortably
> handles 80-character lines, ideally.
>
> Best regards
>
> John Clews
>
> ---------------------------------------------------------------------
> Cyrillic transliteration tables: practical examples of short IDs.
>
> This table shows lower-case letters in ISO 9:1995(E), Table 1, in
> a Pan-Cyrillic order. In the published standard, the columns Cyrillic ID
> and Latin ID are replaced by specific Cyrillic or Latin characters.
>
> +-------------------------------------------------------------------------
> -+
> | No. Source Target Cyrillic Latin Examples/Comments
> |
> | ID ID ID ID
> |
> +-------------------------------------------------------------------------
> -+
>
> +0430 +0061 Cy_a a
> +0431 +0062 Cy_be b
> +0432 +0076 Cy_ve v
> +0433 +0067 Cy_ghe g
> +0434 +0064 Cy_de d
> +0452 +0111 Cy_dje d_stro
> +0453 +01F5 Cy_gje g_acut
> +0435 +0065 Cy_ie e
> +0451 +00EB Cy_io e_diae
> +0454 +00EA Cy_uk-ie e_circ
> +0436 +017E Cy_zhe z_caro
> +0437 +007A Cy_ze z
> +0455 +1E91 Cy_dze z_circ
> +0438 +0069 Cy_i i
> +0456 +00EC Cy_be-uk-i i_grav
> +0457 +00EF Cy_yi i_diae
> +0458 +01F0 * Cy_je j_caro
> +0439 +006A Cy_short_i j
> +043A +006B Cy_ka k
> +043B +006C Cy_el l
> +0459 +XX Cy_lje l_circ
> +043C +006D Cy_em m
> +043D +006E Cy_en n
> +045A +XX Cy_nje n_circ
> +043E +006F Cy_o o
> +043F +0070 Cy_pe p
> +0440 +0072 Cy_er r
> +0441 +0073 Cy_es s
> +0442 +0074 Cy_te t
> +045B +0107 Cy_tshe c_acut
> +045C +1E31 Cy_kje k_acut
> +0443 +0075 Cy_u u
> +045E +01D4 Cy_shor_u u_caro
> +0444 +0066 Cy_ef f
> +0445 +0068 Cy_ha h
> +0446 +0063 Cy_tse c
> +0447 +010D Cy_che c_caro
> +045F +XX Cy_dzhe d_circ
> +0448 +0161 Cy_sha s_caro
> +0449 +015D Cy_shcha s_circ
> +044A +0022 Cy_hard_sign quot_mark
> +044B +0079 Cy_yeru y
> +044C +0027 Cy_soft_sign apos
> +044D +00E8 Cy_e e_grav
> +044E +00FB Cy_yu u_circ
> +044F +00E2 Cy_ya a_circ
>
>
> +XX = Not in ISO/IEC 10646 as precomposed characters, although they
> are available as level 3 characters of ISO/IEC 10646 through
> use of combining sequences.
>
> * = Capital J_caro only available as level 3 characters of ISO/IEC 10646
> as [J] + [caro]
>
> ---------------------------------------------------------------------
> Method for deriving short IDs from the letter names in ISO/IEC 10646:
>
>
> CAPITAL LETTER retains its letter element in capitals and the term
> CAPITAL LETTER is dropped;
>
> SMALL LETTER changes its letter element to small equivalents and the term
> SMALL LETTER is dropped.
>
> In all other occurences the words SMALL, CAPITAL, LETTER, ACCENT, WITH,
> AND
> and BY are dropped.
>
> All elements except the actual letter element (e.g. AE above) are in small
> letters.
>
> An underline character ( _ ) is used to separate elements in place of
> spaces;
> it may be possible to drop this in databases etc. (but with less
> readability).
>
> 4 letters is the normal name for elements in IDs, except for 1-, 2- and
> 3-letter words. 1-, 2- and 3-letter abbreviations are also used: these use
> a
> hyphen.
>
>
> Notes:
>
> 1. Single-letter abbreviations (with hyphen) are mainly positional.
>
> -a for above (e.g. dot-a for DOT ABOVE)
> -b for below
> m- for middle (e.g. m-dot for MIDDLE DOT)
> v- for vertical
> i- inverted
> l- left
> r- right
> s- small
>
> This is the complete list of single-letter abbreviations.
>
> 2. Two letter codes are only used for Script codes or Language codes
> e.g. 'Cy_be-uk-I' "CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I").
>
> (a) Script codes (start of string; always 1 capital, 1 small, 1 underline)
>
> Gr_ for Greek; Cy_ for Cyrillic; Am_ for Armenian;
> Ge_ for Georgian; He_ for Hebrew; etc. Latin: left blank
>
> (following usage in ISO/TC46/SC2 email survey (June/July 1996)
>
> (b) Language codes (always midle of string; always 2 smalls, 1 hyphen)
>
> Language codes are taken from ISO 639: e.g.
>
> be- Byelorussian
> uk- Ukranian
>
>
> 3. Three-letter codes (without hyphen) are for 3-letter words, e.g.
>
> dot for DOT, leg for LEG, bar for BAR or eth for ETH.
>
> Three-letter codes (with hyphen) are for 3-letter abbreviations (mainly
> phonetic descriptions), which most users will rarely use, e.g.
>
> den- DENTAL
> lat- LATERAL
> alv- ALVEOLAR
> ret- RETROFLEX
> glo- GLOTTAL
> bil- BILABIL
> pha- PHARYNGEAL
> voi- VOICED
> fri- FRICATIVE
> pal- PALATAL
>
> None of these are used in this table.
>
> 4. Four-letter codes are for 4-letter words
>
> e.g. left, half, ring, stop, curl, tail, sign, open, baby, long
>
> and abbreviations of 5-letter words
>
> e.g.
>
> lowe lower
> brev breve
> fina final
> acut acute
> grav grave
>
> and abbreviations of larger words
>
> e.g.
>
> desc for descender
> dotl for DOTLESS
> digr for DIGRAP
> liga for ligature
> reve for REVERSED
> apos for APOSTROPHE
> scri for SCRIPT
> clos for CLOSED
> diae for DIAERESIS
> stro for STROKE
> symb for SYMBOL
> circ for CIRCUMFLEX
> cedi for CEDILLA
> macr for MACRON
> modi for MODIFIER
> ogon for OGONEK
> prec for PRECEDED
> ques for QUESTION
> excl for EXCLAMATION
> abbr for ABBREVIATION
> punc for PUNCTUATION
> turn for TURNED
> cros for CROSSED
>
> ---------------------------------------------------------------------
>
> --
> John Clews, SESAME Computer Projects, 8 Avenue Rd, Harrogate, HG2 7PG
> Email: 10646er@sesame.demon.co.uk; tel: 0171 272 8397 & 01423 888 432
> Committee Chair of ISO/TC46/SC2: Conversion of Written Languages;
> Committee Member of ISO/IEC/JTC1/SC22/WG20: Internationalization;
> Committee Member of CEN/TC304: Information and Communications
> Technologies: European Localization Requirements
> Committee Member of the Foundation for Endangered Languages;
> Committee Member of ISO/IEC/JTC1/SC2: Coded Character Sets



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT