L2/01-191
Dotting
the i�s
Kent Karlsson and Vladas Tumasonis
2001-05-02
This is a proposal to update the SpecialCasing.txt data file in the Unicode Character Database. The current handling of dots above for lowercase i�s and j�s in SpecialCasing.txt for case mapping is not sufficient, in particular for Lithuanian where an explicit dot above sometimes needs to be introduced. This proposal also attempts a somewhat more systematic treatment of dots above lowercase i�s and j�s for other languages too.
to upper and to title
Normal
����������� Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot, then uppercase.
Turkish
����������� Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: keep the extra dot, but don�t add another one (for the cases below), then uppercase.
|
|
i |
I-dot |
Lithuanian
����������� Any lowercase variant of i or j with an unblocked extra dot above, even if there are more accents above on that base letter: remove the extra dot, then uppercase.
to lower
Normal
����������� Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot.
|
|
I-dot (if more accents above) |
i -dot |
|
|
I -dot (if no more accents above) |
i |
|
|
J -dot (if no more accents above) |
j |
Turkish
����������� Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot.
|
|
I |
i-dotless |
Lithuanian
����������� Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot.
|
|
I (if more accents above) |
i -dot |
|
|
J (if more accents above) |
j -dot |
|
|
I-ogonek (if more accents above) |
i-ogonek -dot |
|
|
I-grave |
i -dot -grave |
|
|
I-acute |
i -dot -acute |
|
|
I-tilde |
i -dot �tilde |
Suggested changes to
SpecialCasing.txt regarding dotting i�s and j�s
The following are edits to capture the short informal descriptions above.
Old lines (to remove)
1st-------------------
# characters where they are 1-1, and does not have
locale-specific mappings.)
2nd-------------------
# The <condition_list> is optional. Where
present, it consists of one or more locales or contexts,
# separated by spaces.
3rd-------------------
# A locale is defined as:
# <locale> := <ISO_639_code> ( "_" <ISO_3166_code>
( "_" <variant> )? )?
# <ISO_3166_code> := 2-letter ISO country code,
# <ISO_639_code> := 2-letter ISO language code
4th-------------------
# A context is one of the following choices:
5th-------------------
# AFTER_i: The last base character was "i"
0069
6th-------------------
7th-------------------
# ================================================================================
# Locale-sensitive mappings
#
================================================================================
# Lithuanian
0307; 0307; ; ; lt AFTER_i; # Remove DOT ABOVE after "i" with upper
or titlecase
# Turkish, Azeri
0049; 0131; 0049; 0049; tr; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
0049; 0131; 0049; 0049; az; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
# Note: the following cases are already in the UnicodeData file.
# 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
# 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
end-------------------
New lines (to
insert, replacing the old ones listed above)
1st-------------------
# characters where they are 1-1, and does not have
language-specific mappings.)
2nd-------------------
# The <condition_list> is optional. Where
present, it consists of one or more
# contexts, one of which may be a language code, separated by spaces.
3rd-------------------
# A _subset_ of RFC 3066 conforming language codes,
_sufficient for this file_,
# can be described as:
# <langcode> := two-letter ISO 639-1 language code
4th-------------------
# A context is a <langcode> or one of the
following choices (test on original string):
5th-------------------
# AFTER_i: The last preceding base character was
"i" (0069), i-ogonek (012F),
# i-stroke (0268), i-tilde-below (1E2D), i-dot-below (1ECB), "j"
(006A),
# or j-crosstailed (029D), and no combining character class 230 has intervened.
# AFTER_CAP_I: The last preceding base character was "I" (0049),
I-ogonek (012E),
# I-stroke (0197), I-tilde-below (1E2C), I-dot-below (1ECA), or "J"
(004A) and
# no combining character class 230 has intervened.
# MORE_ACCENTS_ABOVE: The current combining sequence has at least one class 230
# combining character after the currently considered character.
6th-------------------[no old text]
# Normal dotting/undotting of i's and j's (capital
and small):
# Remove explicit dot above capital i or j when lowercasing, if no more accents
above:
0307; ; 0307; 0307; AFTER_CAP_I NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
# Remove explicit dot above small i or j when case mapping, if no more accents
above:
0307; ; ; ; AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
0130; 0069 0307; 0130; 0130; MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH
DOT
# Note: the following cases are already in the UnicodeData file.
# 0131; 0131; 0049; 0049; # LATIN SMALL LETTER DOTLESS I
# 0130; 0069; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
7th-------------------
#
================================================================================
# Language-sensitive mappings
# ================================================================================
# Lithuanian
# Remove dot above small i's or j's when uppercasing, even if there are more
accents above:
0307; 0307; ; ; lt AFTER_i # COMBINING DOT ABOVE
# Introduce an explicit dot above when lowercasing capital I's and J's
# if there are more accents above:
0049; 0069 0307; 0049; 0049; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
004A; 006A 0307; 004A; 004A; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER J
012E; 012F 0307; 012E; 012E; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
WITH OGONEK
# Other precomposed capital i's without accents above are skipped here since
they do not
# occur in Lithuanian (this creates a case mapping difference between NFD and
on NFC strings).
00CC; 0069 0307 0300; 00CC; 00CC; lt # LATIN CAPITAL LETTER I WITH GRAVE
00CD; 0069 0307 0301; 00CD; 00CD; lt # LATIN CAPITAL LETTER I WITH ACUTE
0128; 0069 0307 0303; 0128; 0128; lt # LATIN CAPITAL LETTER I WITH TILDE
# Other precomposed capital i's and j's with accents above are skipped here since
they do not
# occur in Lithuanian (this creates a case mapping difference between NFD and
NFC strings).
# Turkish, Azeri
# Remove dot above small i's when lowercasing, if no more accents above:
0307; ; 0307; 0307; tr AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
0307; ; 0307; 0307; az AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
0069; 0069; 0130; 0130; tr # LATIN SMALL LETTER I
0069; 0069; 0130; 0130; az # LATIN SMALL LETTER I
0049; 0131; 0049; 0049; tr; # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az; # LATIN CAPITAL LETTER I
end-------------