Re: Short verbal IDs for UCS characters

From: Gregg Reynolds (
Date: Tue Oct 05 1999 - 22:41:26 EDT


Thanks for the feedback and the ideas. I'm afraid my Cyrillic is rather weak,
but I'll take a good look at your ideas and try them out in the coming weeks.
They look promising for a short form.

Do think about hierarchical structuring notations as well. What got me started
on this was trying to encode the text of an English grammar of Classical Arabic
that uses 6 or 7 different languages (Arabic, Hebrew, Syriac, Latin, Greek,
German, French, English, and transliterated Arabic, at least.) I started out
using XML/SGML character entity notations in various ways, but it quickly became
clear that such a notation is quite impractical for hand-coded texts (I'm pretty
fanatical about emacs). Even if the data entry were not such a pain, the result
is unreadable, which is a major problem, since I need to proofread several times.

After a fair amount of experimentation I've found that the simplest way to do
this, both from a data entry and from a proofreading perspective, is to settle on
an ascii-based transliteration for each language, along with a set of 2- or
3-character delimiters used to demarcate language regions of the text.

An example (Wright's Arabic Grammar, p. 11):

 REM. b. <<y>> at the end of a word after a [(fatha)] is
pronounced B
 like <<A>>, e.g. <[fatay]>, <[ramay]>, <[`Ailay]>*, and is called, like <<A>>
 in the same position (e.g. <[bahnasA]>, <[gazA]>), <<al-`Aalifu _l-maqSUra#u>>
 _the [(elif)] that can be abbreviated_, in contradistinction to _the lengthened
 [(elif)]_, <<al-`Aalifu _l-mamdUda#u>> (see $ 22 and $ 23, rem. _a_), which is
protected by
 [(hemza)]. It receives this name because, when it comes in contact
 with a [<[(hemza)] conjunctionis>] (see $ 19, rem. _f_), it is shortened in
 pronunciation before the following consonant, as are the <<w>> and <<y>> in
 <<`AabU>> and <<`AabI>> before <<_l-wazIri>> (see $ 20, _b_)+.

Here <<foo> denotes a string of Arabic glyphs, <[foo]> denotes Arabic followed by
its transliteration, [(foo)] means the transliteration of 'foo', <<Y>> means
dotted ya, <<y> means dotless ya, [<foo>] means latin in italics, _foo_ means
italics, <{Hebrew here}>, <(Greek here)>, etc. Of course this depends on knowing
that strings like '<[' do not occur in the text and so can be interpreted as
metatext without problems. I'm still tweaking it and encoding text, but
eventually I expect that a simple perl script should suffice to translate it into
TeX or HTML or whatever, with appropriate encoding.

The transliteration is my own scheme, which uses a single ascii character for
each Arabic sign; I'd provide an account of it now, but that would take a few
hours and its past my bedtime. With a little creative thinking ascii can be
remarably expressive. For example, last night I realized that what I had been
treating as an error - OCR software translating e+acute into '6' - is actually a
rather clever encoding. Bonne id6e! Digits can in the text can be encoding in
a digit region (e.g. <+999+> or the like). Once I finish transcribing (goal:
first 30 or so pages of the grammar, which covers the writing system) I'll put it
up on a webpage.

In any case, I've found this works quite well for Arabic, and have begun thinking
about a similar strategy for encoding bilingual Sanskrit grammars, for which I
think a similar scheme will work. I'd be interested in knowing if it would work
for e.g. Russian.


John Clews wrote:

> Via the Unicode List <> on Monday, 4 October 1999
> Marco Cimarosti <> suggested that
> "a sort of UDS = Universal Description Sequence" might be of use, and
> Gregg Reynolds <> asked similarly:
> "Has this been done before? Would anybody other than me find this useful?"
> I think this sort of thing is useful, and it's worth flinging a few
> ideas around, as identifiers of the form U+hhhh (hex identifiers) are
> not always meaningful out of context.
> However, I think it's also worth trying to relate any short verbal IDs
> for UCS characters to the UCS character names themselves.
> I think the scheme below manages to be brief, generally meaningful in
> context, and predictable, and relates to UCS character names, and can
> also be machine generated, and also reversible to full character names.
> Other views, pro or con, will be welcome.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT