E0000 Language Tags for Archaic Greek Alphabets

From: UList@dfa-mail.com
Date: Sun Feb 27 2005 - 15:15:15 CST

  • Next message: vlad: "Hentaigana"


    I've been informed a certain language is *not* to be mentioned!

    I apologize. You will see from my post subject title I thought I was dealing
    with an "obscure" subject there.

    So let me change the subject to what I'm *actually* interested in, Greek
    archaic alphabets.

    > I'm not sure if you are suggesting a language tag at the start of a
    > string of UNMENTIONABLE text or before each UNMENTIONABLE letter.

    In that post I was backing off from my love of "single-codepoint tagging" and
    trying tagging the entire section of text.

    > If the former, this simpy doesn't work with current font technologies,

    >From what I've been told about OpenType, it should actually currently be
    possible to detect any arbitrary string of codepoints, like


    and then detect another arbitrary string of codepoints, like


    and then do something (glyph swapping) to the codepoints in between them.

    This is a "context" detection, something is next to something else, not a
    "state" detection.

    The person who told me that is someone very much in the know about such
    things, but he may have misunderstood what I was asking. So I don't know for
    sure that it can be done.

    Given the choice of using absolutely any arbitrary string of codepoints for my
    "start" and "end" markers, I thought it would be best received by Unicode if I
    used the E0000 codepoints. There's actually a specific way defined for using
    the E0000 tags as "custom language tags", where you start with the Tag
    [Language], and then (I think) the Tag [x] and then the Tag [-] and then the
    Tags that spell out your custom language name. Then there's another Tag that
    means [End Language].

    Thats what I mean with my shorthand

       DORIC ---> [LANGUAGE][x][-][D][O][R][I][C]
       /DORIC ---> [END LANGUAGE]

    I'm using the specific approved E0000 way of doing "custom language tags"
    currently, *solely* to go along with how Unicode says to do things, even
    though any arbitrary string of codepoints could be used the same way by OT to
    detect "context". OT cannot currently detect these E0000 "custom language
    tags" *as* "language tags", but it seems conceivable it might be able to do so
    in the future. So there is also the benefit of possible forward-compatibility
    for documents (as well as keeping Unicode happy).

    On a different subject, the OpenType font technology *can* (again I am told)
    deal with real, normal language tags such as you would use in XML. The
    language tags OT will respond to are created by Microsoft and maintained in a
    list on the MS site. You can program the OT font to carry out a set of
    instructions when it detects a particular XML language tag.

    The Microsoft site says the list of recognized languages is being expanded.
    For Ancient Greek though, I don't think this official list is a good way to go
    -- even if Microsoft would accept something like "Old Cretan Doric".

    The little secret I should let out is that I am using the E0000 "language"
    tags, really for "scripts". Although it can be defended that each Ancient
    Greek dialect is different enough to be called a "language". As most people on
    the list know, there is no scientific line between "dialect" and "language" --
    the joke being "a language is a dialect with an army". And of course each
    Greek city-state had its own army : )

    But I want to go even a little further, in order to implement *all* the
    possible variations of archaic Greek scripts in one smart font. I want to define

       OLD_CRETAN_DORIC_LTR (left to right)
       OLD_CRETAN_DORIC_RTL (right to left)
       OLD_CRETAN_DORIC_ALT_LTR (alternate, left to right)
       OLD_CRETAN_DORIC_ALT_RTL (alternate, right to left)

    and this, or more, for perhaps 10 different archaic Greek scripts... err, I
    mean dialects.

    That probably isn't something that could/should end up on Microsoft's official
    language list.

    And perhaps I should actually ask Unicode for an E0000 Tag [Script] and Tag
    [End Script] rather than continuing my fairly poor attempt to masquerade as
    dialects. The E0000 documentation says more Tags like [Language] will be
    added, and [Script] sounds like a good one.


    Peter Kirk wrote:
    > On 27/02/2005 17:04, UList@dfa-mail.com wrote:
    > >...
    > >UNMENTIONABLE: use Hebrew transliteration text plus a smart font to swap in
    > >UNMENTIONABLE glyphs when the E0000 "UNMENTIONABLE" language tags are encountered.
    > >
    > >
    > >
    > Doug, I am infamous on this list for having suggested several different
    > alternatives for representing UNMENTIONABLE in Unicode. See the list
    > archives, especially for May 2004 - and please don't try to reopen those
    > discussions! But this is one suggestion which I did not consider. Why
    > not? Simply because it has nothing at all to commend it.
    > I'm not sure if you are suggesting a language tag at the start of a
    > string of UNMENTIONABLE text or before each UNMENTIONABLE letter.
    > If the former, this simpy doesn't work with current font technologies,
    > which are not stateful in the way necessary to support this.
    > If the latter, I suppose in principle current font technologies could
    > support this if the language tag and the letter were treated as a
    > multi-character ligature. But it would surely be ruled out by its
    > extreme inefficiency, and because the rather similar alternative of
    > using a variation selector after each UNMENTIONABLE letter is much more
    > efficient but was ruled out for various reasons, including its lesser
    > inefficiency, which apply all the more to your solution.
    > If what you really mean is that you want to use higher level markup to
    > distinguish UNMENTIONABLE from other languages by a change of font perhaps
    > indicated by a different markup style, with language tags as one
    > specific way of doing this markup: Well, that might work, but the UTC
    > and WG2 have already rejected the argument that UNMENTIONABLE should be
    > distinguised only by a font changed signalled by markup.
    > --
    > Peter Kirk
    > peter@qaya.org (personal)
    > peterkirk@qaya.org (work)
    > http://www.qaya.org/
    > --
    > No virus found in this outgoing message.
    > Checked by AVG Anti-Virus.
    > Version: 7.0.300 / Virus Database: 266.5.1 - Release Date: 27/02/2005

    This archive was generated by hypermail 2.1.5 : Mon Feb 28 2005 - 00:21:18 CST