L2/01-191
Dotting
the i’s
Kent Karlsson and Vladas Tumasonis
2001-05-02
This is a proposal to update the SpecialCasing.txt data file in the Unicode Character Database. The current handling of dots above for lowercase i’s and j’s in SpecialCasing.txt for case mapping is not sufficient, in particular for Lithuanian where an explicit dot above sometimes needs to be introduced. This proposal also attempts a somewhat more systematic treatment of dots above lowercase i’s and j’s for other languages too.
to upper and to title
Normal
Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot, then uppercase.
Turkish
Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: keep the extra dot, but don’t add another one (for the cases below), then uppercase.
|
i |
I-dot |
Lithuanian
Any lowercase variant of i or j with an unblocked extra dot above, even if there are more accents above on that base letter: remove the extra dot, then uppercase.
to lower
Normal
Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot.
|
I-dot (if more accents above) |
i -dot |
|
I -dot (if no more accents above) |
i |
|
J -dot (if no more accents above) |
j |
Turkish
Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot.
|
I |
i-dotless |
Lithuanian
Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot.
|
I (if more accents above) |
i -dot |
|
J (if more accents above) |
j -dot |
|
I-ogonek (if more accents above) |
i-ogonek -dot |
|
I-grave |
i -dot -grave |
|
I-acute |
i -dot -acute |
|
I-tilde |
i -dot –tilde |
Suggested changes to
SpecialCasing.txt regarding dotting i’s and j’s
The following are edits to capture the short informal descriptions above.
Old lines (to remove)
1st-------------------
# characters where they are 1-1, and does not have
locale-specific mappings.)
2nd-------------------
# The <condition_list> is optional. Where
present, it consists of one or more locales or contexts,
# separated by spaces.
3rd-------------------
# A locale is defined as:
# <locale> := <ISO_639_code> ( "_" <ISO_3166_code>
( "_" <variant> )? )?
# <ISO_3166_code> := 2-letter ISO country code,
# <ISO_639_code> := 2-letter ISO language code
4th-------------------
# A context is one of the following choices:
5th-------------------
# AFTER_i: The last base character was "i"
0069
6th-------------------
7th-------------------
# ================================================================================
# Locale-sensitive mappings
#
================================================================================
# Lithuanian
0307; 0307; ; ; lt AFTER_i; # Remove DOT ABOVE after "i" with upper
or titlecase
# Turkish, Azeri
0049; 0131; 0049; 0049; tr; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
0049; 0131; 0049; 0049; az; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
# Note: the following cases are already in the UnicodeData file.
# 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
# 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
end-------------------
New lines (to
insert, replacing the old ones listed above)
1st-------------------
# characters where they are 1-1, and does not have
language-specific mappings.)
2nd-------------------
# The <condition_list> is optional. Where
present, it consists of one or more
# contexts, one of which may be a language code, separated by spaces.
3rd-------------------
# A _subset_ of RFC 3066 conforming language codes,
_sufficient for this file_,
# can be described as:
# <langcode> := two-letter ISO 639-1 language code
4th-------------------
# A context is a <langcode> or one of the
following choices (test on original string):
5th-------------------
# AFTER_i: The last preceding base character was
"i" (0069), i-ogonek (012F),
# i-stroke (0268), i-tilde-below (1E2D), i-dot-below (1ECB), "j"
(006A),
# or j-crosstailed (029D), and no combining character class 230 has intervened.
# AFTER_CAP_I: The last preceding base character was "I" (0049),
I-ogonek (012E),
# I-stroke (0197), I-tilde-below (1E2C), I-dot-below (1ECA), or "J"
(004A) and
# no combining character class 230 has intervened.
# MORE_ACCENTS_ABOVE: The current combining sequence has at least one class 230
# combining character after the currently considered character.
6th-------------------[no old text]
# Normal dotting/undotting of i's and j's (capital
and small):
# Remove explicit dot above capital i or j when lowercasing, if no more accents
above:
0307; ; 0307; 0307; AFTER_CAP_I NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
# Remove explicit dot above small i or j when case mapping, if no more accents
above:
0307; ; ; ; AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
0130; 0069 0307; 0130; 0130; MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH
DOT
# Note: the following cases are already in the UnicodeData file.
# 0131; 0131; 0049; 0049; # LATIN SMALL LETTER DOTLESS I
# 0130; 0069; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
7th-------------------
#
================================================================================
# Language-sensitive mappings
# ================================================================================
# Lithuanian
# Remove dot above small i's or j's when uppercasing, even if there are more
accents above:
0307; 0307; ; ; lt AFTER_i # COMBINING DOT ABOVE
# Introduce an explicit dot above when lowercasing capital I's and J's
# if there are more accents above:
0049; 0069 0307; 0049; 0049; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
004A; 006A 0307; 004A; 004A; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER J
012E; 012F 0307; 012E; 012E; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
WITH OGONEK
# Other precomposed capital i's without accents above are skipped here since
they do not
# occur in Lithuanian (this creates a case mapping difference between NFD and
on NFC strings).
00CC; 0069 0307 0300; 00CC; 00CC; lt # LATIN CAPITAL LETTER I WITH GRAVE
00CD; 0069 0307 0301; 00CD; 00CD; lt # LATIN CAPITAL LETTER I WITH ACUTE
0128; 0069 0307 0303; 0128; 0128; lt # LATIN CAPITAL LETTER I WITH TILDE
# Other precomposed capital i's and j's with accents above are skipped here since
they do not
# occur in Lithuanian (this creates a case mapping difference between NFD and
NFC strings).
# Turkish, Azeri
# Remove dot above small i's when lowercasing, if no more accents above:
0307; ; 0307; 0307; tr AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
0307; ; 0307; 0307; az AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
0069; 0069; 0130; 0130; tr # LATIN SMALL LETTER I
0069; 0069; 0130; 0130; az # LATIN SMALL LETTER I
0049; 0131; 0049; 0049; tr; # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az; # LATIN CAPITAL LETTER I
end-------------