RE: short Unicode names?

From: Nobuo(SOTECH Nobuo) (nobuo@sotech.com.cn)
Date: Sun Jan 04 1998 - 20:50:11 EST


Dear Sir,

I don't know what's happening, but we received more than
180 email (with the same content) from you in the past 4
days. We don't know how to stop this.

> ----------
> From: 10646er@sesame.demon.co.uk[SMTP:10646er@sesame.demon.co.
> uk]
> Reply To: unicode@unicode.org
> Sent: Sunday, January 4, 1998 1:04 AM
> To: Multiple Recipients of
> Subject: Re: short Unicode names?
>
> In message <9801011949.AA07010@unicode.org> Werner Lemberg writes via
> unicode@unicode.org:
>
> > Is there an algorithm how to convert long Unicode names like 'LATIN
> > CAPITAL LETTER A WITH ACUTE' into short Adobe-ish names like
> 'Aacute'?
> >
> > With `short' I mean a name not longer than about 32 characters and
> no
> > spaces in it.
> >
> > Or are there already short Unicode names defined? U+00C1 is not very
> > descriptive...
>
> The following short IDS which match your specifications have been
> used in email documents used by those active in ISO/TC46/SC2 and its
> working groups to provide readable text that use 7-bit characters,
> and are less than 32 characters, and can survive any potential
> distortion that may arise through their going through 7-bit character
> mechanisms along the way.
>
> Because they are directly related to character names in UCS (ISO/IEC
> 10646 and Unicode) it is posible to generate these by algorithm, and
> also to produce short IDs that can be reversed to their authentic
> character name in USO/IEC 10646 and Unicode.
>
> Examples in more detail are given below.
>
> John Clews
>
> * * * * * * * *
>
> Cyrillic transliteration tables: practical examples of short IDs.
>
> This table shows lower-case letters in ISO 9:1995(E), Table 1, in
> a Pan-Cyrillic order. In the published standard, the columns Cyrillic
> ID
> and Latin ID will be replaced by specific Cyrillic or Latin
> characters.
>
> +---------------------------------------------------------------------
> -----+
> | No. Source Target Cyrillic Latin
> Examples/Comments |
> | ID ID ID ID
> |
> +---------------------------------------------------------------------
> -----+
>
> +0430 +0061 Cy_a a
> +0431 +0062 Cy_be b
> +0432 +0076 Cy_ve v
> +0433 +0067 Cy_ghe g
> +0434 +0064 Cy_de d
> +0452 +0111 Cy_dje d_stro
> +0453 +01F5 Cy_gje g_acut
> +0435 +0065 Cy_ie e
> +0451 +00EB Cy_io e_diae
> +0454 +00EA Cy_uk-ie e_circ
> +0436 +017E Cy_zhe z_caro
> +0437 +007A Cy_ze z
> +0455 +1E91 Cy_dze z_circ
> +0438 +0069 Cy_i i
> +0456 +00EC Cy_be-uk-i i_grav
> +0457 +00EF Cy_yi i_diae
> +0458 +01F0 * Cy_je j_caro
> +0439 +006A Cy_short_i j
> +043A +006B Cy_ka k
> +043B +006C Cy_el l
> +0459 +XX Cy_lje l_circ
> +043C +006D Cy_em m
> +043D +006E Cy_en n
> +045A +XX Cy_nje n_circ
> +043E +006F Cy_o o
> +043F +0070 Cy_pe p
> +0440 +0072 Cy_er r
> +0441 +0073 Cy_es s
> +0442 +0074 Cy_te t
> +045B +0107 Cy_tshe c_acut
> +045C +1E31 Cy_kje k_acut
> +0443 +0075 Cy_u u
> +045E +01D4 Cy_shor_u u_caro
> +0444 +0066 Cy_ef f
> +0445 +0068 Cy_ha h
> +0446 +0063 Cy_tse c
> +0447 +010D Cy_che c_caro
> +045F +XX Cy_dzhe d_circ
> +0448 +0161 Cy_sha s_caro
> +0449 +015D Cy_shcha s_circ
> +044A +0022 Cy_hard_sign quot_mark
> +044B +0079 Cy_yeru y
> +044C +0027 Cy_soft_sign apos
> +044D +00E8 Cy_e e_grav
> +044E +00FB Cy_yu u_circ
> +044F +00E2 Cy_ya a_circ
>
>
> +XX = Not in ISO/IEC 10646
>
> * Capital J_caro only available as level 3 characters of ISO/IEC 10646
> as
> [J] + [caro]
>
>
>
> Method for deriving short IDs from the letter names in ISO/IEC 10646:
>
>
> CAPITAL LETTER retains its letter element in capitals and the term
> CAPITAL LETTER is dropped;
>
> SMALL LETTER changes its letter element to small equivalents and the
> term
> SMALL LETTER is dropped.
>
> In all other occurences the words SMALL, CAPITAL, LETTER, ACCENT,
> WITH, AND
> and BY are dropped.
>
> All elements except the actual letter element (e.g. AE above) are in
> small
> letters.
>
> An underline character ( _ ) is used to separate elements in place of
> spaces;
> it may be possible to drop this in databases etc. (but with less
> readability).
>
> 4 letters is the normal name for elements in IDs, except for 1-, 2-
> and
> 3-letter words. 1-, 2- and 3-letter abbreviations are also used: these
> use a
> hyphen.
>
>
> Notes:
>
> 1. Single-letter abbreviations (with hyphen) are mainly positional.
>
> -a for above (e.g. dot-a for DOT ABOVE)
> -b for below
> m- for middle (e.g. m-dot for MIDDLE DOT)
> v- for vertical
> i- inverted
> l- left
> r- right
> s- small
>
> This is the complete list of single-letter abbreviations.
>
> 2. Two letter codes are only used for Script codes or Language codes
> e.g. 'Cy_be-uk-I' "CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN
> I").
>
> (a) Script codes (start of string; always 1 capital, 1 small, 1
> underline)
>
> Gr_ for Greek; Cy_ for Cyrillic; Am_ for
> Armenian;
> Ge_ for Georgian; He_ for Hebrew; etc. Latin: left
> blank
>
> (following usage in ISO/TC46/SC2 email survey (June/July 1996)
>
> (b) Language codes (always midle of string; always 2 smalls, 1 hyphen)
>
> Language codes are taken from ISO 639: e.g.
>
> be- Byelorussian
> uk- Ukranian
>
>
> 3. Three-letter codes (without hyphen) are for 3-letter words, e.g.
>
> dot for DOT, leg for LEG, bar for BAR or eth for ETH.
>
> Three-letter codes (with hyphen) are for 3-letter abbreviations
> (mainly
> phonetic descriptions), which most users will rarely use, e.g.
>
> den- DENTAL
> lat- LATERAL
> alv- ALVEOLAR
> ret- RETROFLEX
> glo- GLOTTAL
> bil- BILABIL
> pha- PHARYNGEAL
> voi- VOICED
> fri- FRICATIVE
> pal- PALATAL
>
> None of these are used in this table.
>
> 4. Four-letter codes are for 4-letter words
>
> e.g. left, half, ring, stop, curl, tail, sign, open, baby, long
>
> and abbreviations of 5-letter words
>
> e.g.
>
> lowe lower
> brev breve
> fina final
> acut acute
> grav grave
>
> and abbreviations of larger words
>
> e.g.
>
> desc for descender
> dotl for DOTLESS
> digr for DIGRAP
> liga for ligature
> reve for REVERSED
> apos for APOSTROPHE
> scri for SCRIPT
> clos for CLOSED
> diae for DIAERESIS
> stro for STROKE
> symb for SYMBOL
> circ for CIRCUMFLEX
> cedi for CEDILLA
> macr for MACRON
> modi for MODIFIER
> ogon for OGONEK
> prec for PRECEDED
> ques for QUESTION
> excl for EXCLAMATION
> abbr for ABBREVIATION
> punc for PUNCTUATION
> turn for TURNED
> cros for CROSSED
>
>
>
> * * * * * * * *
>
> UCS: UNOFFICIAL SHORT IDS (used in ISO/TC46/SC2 draft documents)
>
> This section lists the most commonly used accented and modified
> letters, as
> used in the ISO 8859-1 character set standard, from hexadecimal A0
> through
> hex FF, showing Hex value, Decimal value, Short ID * (as used in
> earlier
> postings of the tc46sc2@elot.gr list) and the Name in ISO/IEC
> 10646-1:1993.
>
> * Note: the short IDs used in most translitration tables tend to use a
> much
> more simply-named repertoire than many of the characters in this
> table, and
> so any transliteration tables using these conventions will be much
> simpler to
> read than this table of ISO 8859.
>
> Short IDs are readable, and mostly systematically constructed from
> the full name in ISO/IEC 10646:
> - a Script Code such as Cy for Cyrillic replaces CYRILLIC CAPITAL
> LETTER, etc., (omitted for Latin letters);
> - the letter name is changed to A or a etc, accordingly;
> - WITH is omitted;
> - other name elements use only the first letter;
> - RING ABOVE or DOT BELOW become ring-a or dot-b etc;
> - spaces are changed to _ (LOW LINE)
>
> +--------+------------------+-------------------------------------
> | UCS ID | Short ID | Name in ISO/IEC 10646-1:1993(E)
> +--------+------------------+-------------------------------------
> | | |
> | 00A0 | nbsp | NO-BREAK SPACE
> | 00A1 | i-excl_mark | INVERTED EXCLAMATION MARK
> | 00A2 | cent_sign | CENT SIGN
> | 00A3 | poun_sign | POUND SIGN
> | 00A4 | curr_sign | CURRENCY SIGN
> | 00A5 | yen_sign | YEN SIGN
> | 00A6 | brok_bar | BROKEN BAR
> | 00A7 | sect_sign | SECTION SIGN
> | 00A8 | diae | DIAERESIS
> | 00A9 | copy_sign | COPYRIGHT SIGN
> | 00AA | femi_ordi_indi | FEMININE ORDINAL INDICATOR
> | 00AB | << | LEFT-POINTING DOUBLE ANGLE QUOTATION
> MARK
> | 00AC | not_sign | NOT SIGN
> | 00AD | soft_hyph | SOFT HYPHEN
> | 00AE | regi_sign | REGISTERED SIGN
> | 00AF | macr | MACRON
> | | |
> | 00B0 | degr_sign | DEGREE SIGN
> | 00B1 | plus_minu_sign | PLUS-MINUS SIGN
> | 00B2 | supe_2 | SUPERSCRIPT TWO
> | 00B3 | supe_3 | SUPERSCRIPT THREE
> | 00B4 | acut | ACUTE ACCENT
> | 00B5 | micr_sign | MICRO SIGN
> | 00B6 | pilc_sign | PILCROW SIGN
> | 00B7 | m-dot | MIDDLE DOT
> | 00B8 | cedi | CEDILLA
> | 00B9 | supe_1 | SUPERSCRIPT ONE
> | 00BA | masc_ordi_indi | MASCULINE ORDINAL INDICATOR
> | 00BB | >> | RIGHT-POINTING DOUBLE ANGLE QUOTATION
> MARK
> | 00BC | one_quar | VULGAR FRACTION ONE QUARTER
> | 00BD | one_half | VULGAR FRACTION ONE HALF
> | 00BE | thre_quar | VULGAR FRACTION THREE QUARTERS
> | 00BF | i-ques_mark | INVERTED QUESTION MARK
> | | |
> | 00C0 | A_grav | LATIN CAPITAL LETTER A WITH GRAVE ACCENT
> | 00C1 | A_acut | LATIN CAPITAL LETTER A WITH ACUTE ACCENT
> | 00C2 | A_circ | LATIN CAPITAL LETTER A WITH CIRCUMFLEX
> ACCENT
> | 00C3 | A_tild | LATIN CAPITAL LETTER A WITH TILDE
> | 00C4 | A_diae | LATIN CAPITAL LETTER A WITH DIAERESIS
> | 00C5 | A_ring-a | LATIN CAPITAL LETTER A WITH RING ABOVE
> | 00C6 | AE | LATIN CAPITAL LIGATURE AE
> | 00C7 | C_cedi | LATIN CAPITAL LETTER C WITH CEDILLA
> | 00C8 | E_grav | LATIN CAPITAL LETTER E WITH GRAVE ACCENT
> | 00C9 | E_acut | LATIN CAPITAL LETTER E WITH ACUTE ACCENT
> | 00CA | E_circ | LATIN CAPITAL LETTER E WITH CIRCUMFLEX
> ACCENT
> | 00CB | E_diae | LATIN CAPITAL LETTER E WITH DIAERESIS
> | 00CC | I_grav | LATIN CAPITAL LETTER I WITH GRAVE ACCENT
> | 00CD | I_acut | LATIN CAPITAL LETTER I WITH ACUTE ACCENT
> | 00CE | I_circ | LATIN CAPITAL LETTER I WITH CIRCUMFLEX
> ACCENT
> | 00CF | I_diae | LATIN CAPITAL LETTER I WITH DIAERESIS
> | | |
> | 00D0 | ETH | LATIN CAPITAL LETTER ETH
> | 00D1 | N_tild | LATIN CAPITAL LETTER N WITH TILDE
> | 00D2 | O_grav | LATIN CAPITAL LETTER O WITH GRAVE ACCENT
> | 00D3 | O_acut | LATIN CAPITAL LETTER O WITH ACUTE ACCENT
> | 00D4 | O_circ | LATIN CAPITAL LETTER O WITH CIRCUMFLEX
> ACCENT
> | 00D5 | O_tild | LATIN CAPITAL LETTER O WITH TILDE
> | 00D6 | O_diae | LATIN CAPITAL LETTER O WITH DIAERESIS
> | 00D7 | mult_sign | MULTIPLICATION SIGN
> | 00D8 | O_stro | LATIN CAPITAL LETTER O WITH STROKE
> | 00D9 | U_grav | LATIN CAPITAL LETTER U WITH GRAVE ACCENT
> | 00DA | U_acut | LATIN CAPITAL LETTER U WITH ACUTE ACCENT
> | 00DB | U_circ | LATIN CAPITAL LETTER U WITH CIRCUMFLEX
> ACCENT
> | 00DC | U_diae | LATIN CAPITAL LETTER U WITH DIAERESIS
> | 00DD | Y_acut | LATIN CAPITAL LETTER Y WITH ACUTE ACCENT
> | 00DE | THORN | LATIN CAPITAL LETTER THORN
> | 00DF | sharp_s | LATIN SMALL LETTER SHARP S
> | | |
> | 00E0 | a_grav | LATIN SMALL LETTER A WITH GRAVE ACCENT
> | 00E1 | a_acut | LATIN SMALL LETTER A WITH ACUTE ACCENT
> | 00E2 | a_circ | LATIN SMALL LETTER A WITH CIRCUMFLEX
> ACCENT
> | 00E3 | a_tild | LATIN SMALL LETTER A WITH TILDE
> | 00E4 | a_diae | LATIN SMALL LETTER A WITH DIAERESIS
> | 00E5 | a_ring-a | LATIN SMALL LETTER A WITH RING ABOVE
> | 00E6 | ae | LATIN SMALL LIGATURE AE
> | 00E7 | c_cedi | LATIN SMALL LETTER C WITH CEDILLA
> | 00E8 | e_grav | LATIN SMALL LETTER E WITH GRAVE ACCENT
> | 00E9 | e_acut | LATIN SMALL LETTER E WITH ACUTE ACCENT
> | 00EA | e_circ | LATIN SMALL LETTER E WITH CIRCUMFLEX
> ACCENT
> | 00EB | e_diae | LATIN SMALL LETTER E WITH DIAERESIS
> | 00EC | i_grav | LATIN SMALL LETTER I WITH GRAVE ACCENT
> | 00ED | i_acut | LATIN SMALL LETTER I WITH ACUTE ACCENT
> | 00EE | i_circ | LATIN SMALL LETTER I WITH CIRCUMFLEX
> ACCENT
> | 00EF | i_diae | LATIN SMALL LETTER I WITH DIAERESIS
> | | |
> | 00F0 | eth | LATIN SMALL LETTER ETH
> | 00F1 | n_tild | LATIN SMALL LETTER N WITH TILDE
> | 00F2 | o_grav | LATIN SMALL LETTER O WITH GRAVE ACCENT
> | 00F3 | o_acut | LATIN SMALL LETTER O WITH ACUTE ACCENT
> | 00F4 | o_circ | LATIN SMALL LETTER O WITH CIRCUMFLEX
> ACCENT
> | 00F5 | o_tild | LATIN SMALL LETTER O WITH TILDE
> | 00F6 | o_diae | LATIN SMALL LETTER O WITH DIAERESIS
> | 00F7 | divi_sign | DIVISION SIGN
> | 00F8 | o_obli_bar | LATIN SMALL LETTER O WITH OBLIQUE BAR
> | 00F9 | u_grav | LATIN SMALL LETTER U WITH GRAVE ACCENT
> | 00FA | u_acut | LATIN SMALL LETTER U WITH ACUTE ACCENT
> | 00FB | u_circ | LATIN SMALL LETTER U WITH CIRCUMFLEX
> ACCENT
> | 00FC | u_diae | LATIN SMALL LETTER U WITH DIAERESIS
> | 00FD | y_acut | LATIN SMALL LETTER Y WITH ACUTE ACCENT
> | 00FE | thorn | LATIN SMALL LETTER THORN
> | 00FF | y_diae | LATIN SMALL LETTER Y WITH DIAERESIS
> +--------+------------------+-------------------------------------
>
> Yours sincerely
>
>
> John Clews
> --
> John Clews (Chair of ISO/TC46/SC2: Conversion of Written Languages)
>
> SESAME Computer Projects, 8 Avenue Road, Harrogate, HG2 7PG, England
> Email: Converse@sesame.demon.co.uk; tel: +44 (0) 1423 888 432
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:38 EDT