RE: metalanguage (was RE: Why is Unicode inconsistant?)

From: Marco.Cimarosti@icl.com
Date: Mon Oct 04 1999 - 10:16:12 EDT


A sort of UDS = Universal Description Sequences?

Just a few notices:

1: In your example, you use GUJ as an abbreviation of Gujarati. Notice
that there is a standard for 2-letter abbreviation of languages (from ISO, I
think; there should be a link to it also on the Unicode site). This could be
used in your proposal also to identify scripts (where each script is
identified by the language it took its name from: Latin, Arabic, Greek,
etc.). A handful of non conflicting 2-letter abbreviations should however be
added for scripts that didn't take their name from a language (Cyrillic,
Devanagari, Hangul, Kana, etc.).

2: The "character entities" used in HTML (like Egrave; for E with grave
accent) could bring some inspiration, at least for Western scripts.

3: There is a similar notation used for API on linguistics newsgroups,
and it could be exploited for the API block.

4: (This will bring flashes of lightnings on my head :-)
CJK Ideographs currently have uninformative names like "CJK IDEOGRAPH 4E00".
They could be more validly named using IDS's (Ideographic Description
Sequences) with only Kangxi and Supplement radicals as DC's. E.g.:
        4E00 = CJK IDEOGRAPH ONE
        4E01 = CJK IDEOGRAPH ATB ONE HOOK
        4E0A = CJK IDEOGRAPH ATB DIVINATION ONE
        4E0B = CJK IDEOGRAPH ATB ONE DIVINATION
        4E62 = CJK IDEOGRAPH LTR MOUNTAIN SECOND-TWO
        4EC0 = CJK IDEOGRAPH LTR STANDING-PERSON TEN
and these could then be somehow shortened for your proposed short notation:
        4E00 = [one]
        4E01 = [Z one hook]
        4E0A = [Z divinat one]
        4E0B = [Z one divinat]
        4E62 = [N mount two2]
        4EC0 = [N standp ten]

Regards. Marco

> -----Original Message-----
> From: Reynolds, Gregg [SMTP:greynolds@datalogics.com]
> Sent: 1999 October 04, Monday 15.14
> To: Unicode List
> Subject: metalanguage (was RE: Why is Unicode inconsistant?)
>
> > -----Original Message-----
> > From: Michael Everson [mailto:everson@indigo.ie]
> > Sent: Monday, October 04, 1999 6:44 AM
> > To: Unicode List
> > Subject: Re: Why is Unicode inconsistant?
> >
> > ...
> > >If you look att letter: 0xD8 it cannot be decomposed,
> >
> > (that's LATIN CAPITAL LETTER O WITH SLASH)
> >
>
> One thing I and no doubt many others would find very useful is a standard
> short name for the repertoire. Also a standardized abstract syntax
> notation. This would be quite useful especially in cases where writers
> use
> ascii as a metalanguage to talk about ascii, as has occurred frequently in
> the terminal discussion. A good example is one writer's reference to 'the
> character n~'; with short names and an abstract syntax, he could have
> written something like 'the character {n,+~}' or {n,~} or {n~}, using
> ascii
> to denote ascii, comma to delimit codepoints, '+' to mean 'combining', and
> adjacency to mean 'with'. And curly braces to delimit a unicode abstract
> syntactic phrase.
>
> For example, 0xD8 could be {O/}, and a 'word' using it might be written
> '{c,O/,t}'.
>
> One could also combine the number and short name: {c,0xD8:O/,t} or the
> like.
>
> "Above" could be written '/ /', below '\ \'. The motivation for this is
> from the Z language, which uses 0x2197:NE-arrow paired with
> 0x2199:SW-arrow
> to denote superscripted expressions, and 0x2198:SE-arrow and
> 0x2196:NW-arrow
> for subscripts. Since we don't have arrows in ascii, we can use the
> slashes
> (Z uses /^ for 0x2197 and v/ for 0x2199.) So for example, everybody's
> fave
> composed letter would be written {a/ring/} = {a,+/ring/} or
> {0xE5:a/ring/}
> = {0x61:a,0x030A:+/ring/}. Etc.
>
> For letters outside of ascii, we could use a two or three character
> language
> prefix: 0x0A95 = GUJ-ka (instead of GUJARATI LETTER KA).
>
> To a certain extent this would give us the ability to concisely encode
> (for
> discussion purposes) characters that are not in Unicode. For example,
> {a,+\AR-hamza\} = latin a followed by combining arabic hamza below. It
> would give us a kind of metalanguage for encoding the visual grammar of
> letter forms.
>
> More generally, '+' = affix; '/ /' = surfix; '\ \' = subfix; '- -' =
> infix,
> and so on.
>
> Has this been done before? Would anybody other than me find this useful?
>
> -gregg



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT