L2/01-191

Dotting the i�s

Kent Karlsson and Vladas Tumasonis

2001-05-02

This is a proposal to update the SpecialCasing.txt data file in the Unicode Character Database. The current handling of dots above for lowercase i�s and j�s in SpecialCasing.txt for case mapping is not sufficient, in particular for Lithuanian where an explicit dot above sometimes needs to be introduced. This proposal also attempts a somewhat more systematic treatment of dots above lowercase i�s and j�s for other languages too.

to upper and to title

Normal

����������� Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot, then uppercase.

Turkish

����������� Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: keep the extra dot, but don�t add another one (for the cases below), then uppercase.

 

i

I-dot

Lithuanian

����������� Any lowercase variant of i or j with an unblocked extra dot above, even if there are more accents above on that base letter: remove the extra dot, then uppercase.

to lower

Normal

����������� Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot.

 

I-dot (if more accents above)

i -dot

 

I -dot (if no more accents above)

i

 

J -dot (if no more accents above)

j

Turkish

����������� Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot.

 

I

i-dotless

Lithuanian

����������� Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot.

 

I (if more accents above)

i -dot

 

J (if more accents above)

j -dot

 

I-ogonek (if more accents above)

i-ogonek -dot

 

I-grave

i -dot -grave

 

I-acute

i -dot -acute

 

I-tilde

i -dot �tilde

 

Suggested changes to SpecialCasing.txt regarding dotting i�s and j�s

 

The following are edits to capture the short informal descriptions above.

Old lines (to remove)

1st-------------------
# characters where they are 1-1, and does not have locale-specific mappings.)
2nd-------------------
# The <condition_list> is optional. Where present, it consists of one or more locales or contexts,
# separated by spaces.
3rd-------------------
# A locale is defined as:
# <locale> := <ISO_639_code> ( "_" <ISO_3166_code> ( "_" <variant> )? )?
# <ISO_3166_code> := 2-letter ISO country code,
# <ISO_639_code> := 2-letter ISO language code
4th-------------------
# A context is one of the following choices:
5th-------------------
# AFTER_i: The last base character was "i" 0069
6th-------------------
7th-------------------
# ================================================================================
# Locale-sensitive mappings
# ================================================================================
# Lithuanian
0307; 0307; ; ; lt AFTER_i; # Remove DOT ABOVE after "i" with upper or titlecase
# Turkish, Azeri
0049; 0131; 0049; 0049; tr; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
0049; 0131; 0049; 0049; az; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
# Note: the following cases are already in the UnicodeData file.
# 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
# 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
end-------------------
 

 

New lines (to insert, replacing the old ones listed above)

1st-------------------
# characters where they are 1-1, and does not have language-specific mappings.)
2nd-------------------
# The <condition_list> is optional. Where present, it consists of one or more
# contexts, one of which may be a language code, separated by spaces.
3rd-------------------
# A _subset_ of RFC 3066 conforming language codes, _sufficient for this file_,
# can be described as:
# <langcode> := two-letter ISO 639-1 language code
4th-------------------
# A context is a <langcode> or one of the following choices (test on original string):
5th-------------------
# AFTER_i: The last preceding base character was "i" (0069), i-ogonek (012F),
# i-stroke (0268), i-tilde-below (1E2D), i-dot-below (1ECB), "j" (006A),
# or j-crosstailed (029D), and no combining character class 230 has intervened.
# AFTER_CAP_I: The last preceding base character was "I" (0049), I-ogonek (012E),
# I-stroke (0197), I-tilde-below (1E2C), I-dot-below (1ECA), or "J" (004A) and
# no combining character class 230 has intervened.
# MORE_ACCENTS_ABOVE: The current combining sequence has at least one class 230
# combining character after the currently considered character.
6th-------------------[no old text]
# Normal dotting/undotting of i's and j's (capital and small):
# Remove explicit dot above capital i or j when lowercasing, if no more accents above:
0307; ; 0307; 0307; AFTER_CAP_I NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
# Remove explicit dot above small i or j when case mapping, if no more accents above:
0307; ; ; ; AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
0130; 0069 0307; 0130; 0130; MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH DOT
# Note: the following cases are already in the UnicodeData file.
# 0131; 0131; 0049; 0049; # LATIN SMALL LETTER DOTLESS I
# 0130; 0069; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
7th-------------------
# ================================================================================
# Language-sensitive mappings
# ================================================================================
# Lithuanian
# Remove dot above small i's or j's when uppercasing, even if there are more accents above:
0307; 0307; ; ; lt AFTER_i # COMBINING DOT ABOVE
# Introduce an explicit dot above when lowercasing capital I's and J's
# if there are more accents above:
0049; 0069 0307; 0049; 0049; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
004A; 006A 0307; 004A; 004A; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER J
012E; 012F 0307; 012E; 012E; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH OGONEK
# Other precomposed capital i's without accents above are skipped here since they do not
# occur in Lithuanian (this creates a case mapping difference between NFD and on NFC strings).
00CC; 0069 0307 0300; 00CC; 00CC; lt # LATIN CAPITAL LETTER I WITH GRAVE
00CD; 0069 0307 0301; 00CD; 00CD; lt # LATIN CAPITAL LETTER I WITH ACUTE
0128; 0069 0307 0303; 0128; 0128; lt # LATIN CAPITAL LETTER I WITH TILDE
# Other precomposed capital i's and j's with accents above are skipped here since they do not
# occur in Lithuanian (this creates a case mapping difference between NFD and NFC strings).
 
# Turkish, Azeri
# Remove dot above small i's when lowercasing, if no more accents above:
0307; ; 0307; 0307; tr AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
0307; ; 0307; 0307; az AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
0069; 0069; 0130; 0130; tr # LATIN SMALL LETTER I
0069; 0069; 0130; 0130; az # LATIN SMALL LETTER I
0049; 0131; 0049; 0049; tr; # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az; # LATIN CAPITAL LETTER I
end-------------