From: Philippe Verdy (email@example.com)
Date: Mon May 16 2005 - 15:22:57 CDT
From: "David Starner" <firstname.lastname@example.org>
> On 5/14/05, Jonathan Coxhead <email@example.com> wrote:
>> >>Could someone tell me where to find the list of the characters
>> >>belonging to ISO 15924 "Latn" script?
>> Ha ha, you are wrong! firstname.lastname@example.org has already done it.
> Then it's incomplete. I can find a dozen rare characters that Unicode does
> not include (due to Unicode or the nature of the characters) that belong
Do you speak about "Common" characters that don't belong to a specific
script? There are tons!
1021355 exactly (today with Unicode 4.1), if we count the 878924 unassigned
codepoints, and the (6400+65534*2) PUAs, the other ones being spaces,
symbols, numbers, punctuations and most formatting controls, plus a few
letters (like the MICRO SIGN, or more significantly DOUBLE-STRUCK ITALIC
CAPITAL D..DOUBLE-STRUCK ITALIC SMALL J) and modifier letters (like PRIME
and GLOTAL STOP). This number will decrease over time, when characters will
be assigned with a non-"Common" script property.
Or about the "Inherited" characters that adopt the script of the previous
character they modify? (combining characters or some format controls like
ZWJ and ZWNJ). There are 448 today, including the 256 variant selectors, the
22 MUSICAL SYMBOL COMBINING characters, the 11 ARABIC combining accents, the
combining ARABIC LETTER SUPERSCRIPT ALEF, and the 2 COMBINING
HIRAGANA-KATAKANA (SEMI-)VOICED SOUND MARK's.
> It's not a closed set, either.
Yes, but in a given version of Unicode/ISO/IEC 10646, it is closed within
the set of assigned characters.
I would define ISO15924 "Latn" simply by what Unicode classes as "Latin" or
"Common" or "Inherited" in the "Scripts.txt" file of the UCD.
Identically, I would define ISO15924 "Arab" with Unicode "Arabic" or
"Common" or "Inherited".
If something is wrong here, there's something not documented in ISO15924
(the list of characters that are considered part of the script).
However, written languages may still use characters from other scripts than
their default one (it's common to see greek letters in many languages
written with other scripts than greek, and even in dictionnary entries that
reference chemical names with mixed scripts in the same compound word).
So having a strict definition of characters that make a script is probably
impossible without being at least a bit arbitrary.
What is more useful is to know the list of letters used in words or phrases
of actual languages for their normal, complete, (unabbreviated?)
orthographies. For that, I will consult the informative CLDR database, when
it will be completed and hopefully debugged (without the omissions of "rare"
letters used in toponyms and their derived words, and local people names).
This is still an issue for example in US English, because there are people
with names imported with their European accents that can't be composed on
normal US keyboards. (How do Americans compose the classic 'é' or 'ñ' for
example? Not many have international keyboards.)
This archive was generated by hypermail 2.1.5 : Mon May 16 2005 - 17:48:42 CDT