L2/01-209

From: Ordering@sesame.demon.co.uk
Sent: Friday, May 11, 2001 6:40 AM

Language codes - information from ISO/TC37/SC2


1. Update since the WG20 meeting, November 2000

Since the meeting of ISO/TC37 (Terminology) in London in August 2000,
and since ISO/IEC JTC1/SC22/WG20's last meeting, Gerhard Budin
(Austria) had taken over as Chair of ISO/TC37/SC2 from Aat Vervoorn
(Netherlands).

The 2-letter code language code standard ISO 639-1: Codes for
representation of names of languages, has still not yet been published:
this was designed to replace ISO 639: Codes for representation of
names of languages, by removing some errors and adding codes for more
languages.

The 3-letter code language code standard ISO 639-2: Codes for
representation of names of languages, remains in force. ISO 639 (and
its imminent replacement, ISO 639-1) remains a subset of ISO 639-2.

In the area of Internet specifications, RFC 1766 (Language tags) had
been superseded by RFC 3066 (Language tags). This provided a
precedence order which provided that 2-letter codes would be used
where they exist in preference to 3-letter codes. It was also planned
to freeze ISO 639-1 so that no new 2-letter codes are added, to avoid
duplicate codes being in use.

This provides an unambiguous coding mechanism for over 400 language
codes.


2. Additional practices to consider

However, there are areas where confusion may be possible, and care
should be taken in case non-normative use may be encountered:
1. for the same repertoire of language codes, variant 3-letter codes
   are sometimes used (a) in libraries and (b) in some linguistic
   organizations which use SIL codes for the same values;
2. there is frequently a user demand for additional codes beyond the
   400-plus codes in ISO 639-2. In particular, SIL provides
   3-letter codes for around 7,000 languages.

Just as RFC 3066 provides that
(a) 2-letter codes from ISO 639 are used where they exist;
(b) 3-letter codes from ISO 639-2 are used where 2-letter ISO 639
    codes do not exist; so
(c) some users also tend to use 3-letter SIL codes where there are no
    codes for languages in ISO 639, ISO 639-1 or ISO 639-2. This use
    is unregulated, and while generally there are no collisions in
    use, there is a possibility.

The UK national member body intends to provide comprehensive
information on point (c) above, aligning it with information from
the Linguasphere Registry, which documents (but does not provide
codes for) around 70,000 languages. This would be a national member
body contribution to the upcoming ISO/TC37/SC2 meeting in Toronto,
August 2001, to assist in the development of the now approved
New Work Item ISO 639-3"Coding systems."


3. Future plans by ISO/TC37/SC2/WG1

ISO/TC37/SC2/WG1 N69 "Coding systems" (2001-01-31) by Haavard
Hjulstad (convenor of ISO/TC37/SC2/WG1) describes this NWI, which has
now been approved by ISO CS.

Currently, three (closely interlinked) projects are planned.
1. Development and maintenance of a database of language coding,
   (extracts of) which should be freely available on the web.
2. Adding to this those languages that are currently not included
   in ISO 639-1 or ISO 639-2, without assigning standardized
   identifiers.
3. Development of an International Standard for coding mechanisms for
language variation, including variation through time, geographically
determined dialectal variation, writing system, etc.

        Comment on 1 and 2: the UK is concerned that insufficient
        information is proposed. Currently, ISO 639-1 contains 180
        codes, and ISO 639-2 contains 438 entries. As at 2001-01-31,
        the database currently contains 493 entries.

        This compares with SIL (7,000 codes) and the Linguasphere
        Register (around 70,000 codes). Subsetting information from
        either or both of these sources would be a better basis.

        Comment on 3: this aims to regulate the possible
        language combinations where ISO 639 codes can be combined
        with codes from other sources, e.g. from ISO 3166: Codes for
        representation of names of countries, and from the draft
        standard, and ISO 15924: Codes for representation of names of
        scripts, and potentially other standards too, to provide
        codes such as "en US" = "English in the USA", "en CA" =
        "English in Canada", "en US-CA" = "English in the state of
        California"; or "ku Cyrl" = "Kurdish in Cyrillic script", and
        "ku RU Cyrl" = Kurdish in Russia in Cyrillic script". The
        paper also suggests that standardized mechanisms should be
        developed to specify, e.g. "English in North America" or
        "English in southern California", and possibly to identify
        dialects, and a mechanism to specify linking of the ISO 639-2
        code "sgn" = "Sign languages" with other elements in order to
        specify specific sign languages.

        Also the possibility of adding codes for groups of languages
        would be investigated: currently this is a partial but not
        systematic part of ISO 639-2.

        NB: in discussion, Canadian and US members of SC22/WG20 had
        considerable opposition to Item 3 above, particularly as N69
        announces its intention that this is the only part that is an
        international standard. for reasons of conformance issues, as
        the scope of ISO 639 is very general in practice, and affects
        other areas besides terminology.

        John Clews agreed to pass information about North American
        contacts in ISO/TC37/SC2/WG1 to the US delegates to WG20, so
        that their concerns could be expressed.



Best regards

John Clews

--
John Clews, SESAME Computer Projects, 8 Avenue Rd, Harrogate, HG2 7PG
tel: +44 1423 888 432; fax: + 44 1423 889061;
Email: Ordering@sesame.demon.co.uk

Committee Chair of  ISO/TC46/SC2: Conversion of Written Languages;
Committee Member of ISO/IEC/JTC1/SC22/WG20: Internationalization;
Committee Member of ISO/TC37: Terminology
	3