Dialects and orthographies in BCP 47 (was: Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

From: Doug Ewell (doug@ewellic.org)
Date: Wed Aug 04 2010 - 14:29:39 CDT

  • Next message: Karl Pentzlin: "Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters"

    verdy_p <verdy underscore p at wanadoo dot fr> wrote:

    > Really, "Hans", "Hant", "Latf", "Latg" could have been avoided in ISO 15924, if orthographic variants of the same
    > languages had been encoded in the IANA database for BCP 47, independantly of the effective font style.

    Actually it was the opposite; the ability to use standardized ISO 15924
    code elements to express concepts like "Simplified Han" was one of the
    driving forces behind RFC 4646 and its shift in focus from whole tags to

    In any case, the bibliographers and others who use ISO 15924 but not BCP
    47 might need to make these distinctions as well.

    > But for now there's still no formal model for encoding language dialects, so BCP 47 language tags still need to use
    > tags for ISO 3166-1 region codes and for the script variant, when it should just qualify the generic script code (or
    > it could even drop this ISO 15924 code if there was a formal code for the dialect written in a specific orthography:
    > we would also deprecate "Jpan", "Hrkt" in ISO 15924).

    There is no "formal model" in the sense of a standard N-letter subtag
    for dialects, because the concept of a dialect is too open-ended and
    unsystematic. The word means different things to different people.
    What may be a dialect to one person might be a full-blown National
    Language to another, or just a funny accent to a third.

    BCP 47 tags never *need* to use either the region subtag or the script
    subtag, unless they are necessary to convey the intended meaning. A tag
    like "ja-Jpan-JP" is almost never needed, because almost all written
    Japanese is "using the Japanese writing system" ('Jpan') and "as used in
    Japan" ('JP').

    I'm not sure what dialect is being posited here that would make the
    difference between having to specify a script subtag and not having to.

    > Orthographic variants would include also:
    > - the various romanization systems (for example Pinyin) and phonetic transcriptions (IPA phonetic, simplified IPA
    > phonology),

    'pinyin', 'fonipa'

    > - the simplified orthographies (e.g. orthographic reforms in French and German),

    '1606nict', '1694acad', '1901', '1996'

    > - and some other minor variants (like the vertical presentation for East-Asian scripts, or Boustrophedon
    > presentation for Ancient Greek, if this alters the orientation of characters that had to be encoded differently, and
    > the default mirroring properties are not applicable to the encoded characters in the basic language).
    > For now these dialectal/orthographic variants of written languages can be registered in the IANA database for BCP
    > 47, using codes with at least 5 letters (or with at least 4 letters or digits if there's at least one digit),

    A 4-character variant subtag must *begin* with a digit.

    > but
    > ideally the dialectal variant should be encoded as a tag BEFORE the orthographic variant.

    Why is this important?

    Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
    RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­

    This archive was generated by hypermail 2.1.5 : Wed Aug 04 2010 - 14:33:16 CDT