Re: what is Latn?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon May 16 2005 - 15:22:57 CDT

  • Next message: Patrick Andries: "Re: Cyrillic guillemotleft and guillemotright"

    From: "David Starner" <prosfilaes@gmail.com>
    > On 5/14/05, Jonathan Coxhead <jonathan@doves.demon.co.uk> wrote:
    >> >>Could someone tell me where to find the list of the characters
    >> >>belonging to ISO 15924 "Latn" script?
    >>
    >> Ha ha, you are wrong! chris.jacobs@hetnet.nl has already done it.
    >
    > Then it's incomplete. I can find a dozen rare characters that Unicode does
    > not include (due to Unicode or the nature of the characters) that belong
    > to
    > Latn.

    Do you speak about "Common" characters that don't belong to a specific
    script? There are tons!

    1021355 exactly (today with Unicode 4.1), if we count the 878924 unassigned
    codepoints, and the (6400+65534*2) PUAs, the other ones being spaces,
    symbols, numbers, punctuations and most formatting controls, plus a few
    letters (like the MICRO SIGN, or more significantly DOUBLE-STRUCK ITALIC
    CAPITAL D..DOUBLE-STRUCK ITALIC SMALL J) and modifier letters (like PRIME
    and GLOTAL STOP). This number will decrease over time, when characters will
    be assigned with a non-"Common" script property.

    Or about the "Inherited" characters that adopt the script of the previous
    character they modify? (combining characters or some format controls like
    ZWJ and ZWNJ). There are 448 today, including the 256 variant selectors, the
    22 MUSICAL SYMBOL COMBINING characters, the 11 ARABIC combining accents, the
    combining ARABIC LETTER SUPERSCRIPT ALEF, and the 2 COMBINING
    HIRAGANA-KATAKANA (SEMI-)VOICED SOUND MARK's.

    > It's not a closed set, either.

    Yes, but in a given version of Unicode/ISO/IEC 10646, it is closed within
    the set of assigned characters.

    I would define ISO15924 "Latn" simply by what Unicode classes as "Latin" or
    "Common" or "Inherited" in the "Scripts.txt" file of the UCD.

    Identically, I would define ISO15924 "Arab" with Unicode "Arabic" or
    "Common" or "Inherited".

    If something is wrong here, there's something not documented in ISO15924
    (the list of characters that are considered part of the script).

    However, written languages may still use characters from other scripts than
    their default one (it's common to see greek letters in many languages
    written with other scripts than greek, and even in dictionnary entries that
    reference chemical names with mixed scripts in the same compound word).

    So having a strict definition of characters that make a script is probably
    impossible without being at least a bit arbitrary.

    What is more useful is to know the list of letters used in words or phrases
    of actual languages for their normal, complete, (unabbreviated?)
    orthographies. For that, I will consult the informative CLDR database, when
    it will be completed and hopefully debugged (without the omissions of "rare"
    letters used in toponyms and their derived words, and local people names).

    This is still an issue for example in US English, because there are people
    with names imported with their European accents that can't be composed on
    normal US keyboards. (How do Americans compose the classic 'é' or 'ñ' for
    example? Not many have international keyboards.)



    This archive was generated by hypermail 2.1.5 : Mon May 16 2005 - 17:48:42 CDT