L2/01-191R

Dotting the is

Kent Karlsson and Vladas Tumasonis

2001-05-05

This is a proposal to update the SpecialCasing.txt data file in the Unicode Character Database. The current handling of dots above for lowercase is and js in SpecialCasing.txt for case mapping is not sufficient, in particular for Lithuanian where an explicit dot above sometimes needs to be introduced. This proposal also attempts a somewhat more systematic treatment of dots above lowercase is and js for other languages too.

The dot above lowercase i and lowercase j are 'soft' in the sense that they usually disappear upon uppercasing as well as upon given accents above the i or j. There are, however exceptions to this. For these exceptions, where the dot is not 'soft', a 'hard dot above' (U+0307) is the best way to deal with this matter. For Turkish, the soft dot must be hardened for uppercasing (when there are no accents above, otherwise the soft dot is already gone), but for Lithuanian it must be hardened before accenting above, but not for uppercasing.

The tables in the exposition are not complete. The formal table in the update to SpecialCasing.txt are, however, intended to be complete.


to upper and to title

Normal

Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot, then uppercase. This removes any spurious dot above, a dot that is not recommended to be there in the first place.

 

i+dot (no more accents above)

I

 

i-ogonek+dot (no more accents above)

I-ogonek [etc.]

 

j+dot (no more accents above)

J

Lithuanian

Any lowercase variant of i or j with an unblocked extra dot above, even if there are more accents above on that base letter: remove the extra dot, then uppercase. 

 

i+dot

I

 

j+dot

J

Turkish

An i with an unblocked extra dot above, if there are no more accents above on that base letter: keep the extra dot, but dont add another one (for the cases below), then uppercase. This, again, takes care of the spurious case where

 

i (no more accents above)

I-dot

 

i+dot (no more accents above)

I-dot

 


to lower

Normal

Any lowercase or uppercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot.

 

i+dot (no more accents above)

i

 

i-ogonek+dot (no more accents above)

i-ogonek

 

...

...

 

j+dot (no more accents above)

j

 

I-dot (if more accents above)

i -dot

 

I-dot (if no more accents above)

i (already in UniData.txt)

 

I -dot (if more accents above)

i -dot (for NFDNFC consistency; already in UniData)

 

I -dot (if no more accents above)

i (for NFDNFC consistency)

 

J -dot (if no more accents above)

j (some degree of systematic...)

 

Lithuanian

Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot. Uppercase Is and Js that have extra accents above must get an extra dot above inserted.

 

I (if more accents above)

i -dot

 

J (if more accents above)

j -dot

 

I-ogonek (if more accents above)

i-ogonek -dot

 

I-grave

i -dot -grave

 

I-acute

i -dot -acute

 

I-tilde

i -dot -tilde

 For NFDNFC consistency a number of I-letters that are not used in Lithuanian must be handled too.

 

Turkish

Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot. Turkish and Azeri (at least) use a dotless i as the lowercase of I. It should not be used if there are more accents above (then use an ordinary i which then looses the dot...).

 

I (no more accents above)

i-dotless

 


Suggested changes to SpecialCasing.txt regarding dotting is and js

The exposition tables above were not intended to be complete. The formal tables below are intended to be complete enough to cover the orthographic requirements and also be such that NFD and NFC are handled consistently. Cases like barred i or j-crosstail are not covered. Review and comments are welcome. The intent is for these modifications to be included in Unicode 3.2, or if possible, in an update to Unicode 3.1.

Old lines (to remove)

1st-------------------
# characters where they are 1-1, and does not have locale-specific mappings.)
2nd-------------------
# The <condition_list> is optional. Where present, it consists of one or more locales or contexts,
# separated by spaces.
3rd-------------------
# A locale is defined as:
# <locale> := <ISO_639_code> ( "_" <ISO_3166_code> ( "_" <variant> )? )?
# <ISO_3166_code> := 2-letter ISO country code,
# <ISO_639_code> := 2-letter ISO language code
4th-------------------
# A context is one of the following choices:
5th-------------------
# AFTER_i: The last base character was "i" 0069
6th-------------------
7th-------------------
# ================================================================================
# Locale-sensitive mappings
# ================================================================================
# Lithuanian
0307; 0307; ; ; lt AFTER_i; # Remove DOT ABOVE after "i" with upper or titlecase
# Turkish, Azeri
0049; 0131; 0049; 0049; tr; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
0049; 0131; 0049; 0049; az; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
# Note: the following cases are already in the UnicodeData file.
# 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
# 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
end-------------------

New lines (to insert, replacing the old ones listed above)

1st-------------------
# characters where they are 1-1, and does not have language-specific mappings.)
#
# Note that when case mapping a string in a normal form,
# the result need not be in any normal form.
#
2nd-------------------
# The <condition_list> is optional. Where present, it consists of one or more
# contexts, one of which may be a language code, separated by spaces.
3rd-------------------
# A _subset_ of RFC 3066 conforming language codes, _sufficient for this file_,
# can be described as:
# <langcode> := two-letter ISO 639-1 language code
4th-------------------
# A context is a <langcode> or one of the following choices (test on original string):
5th-------------------
# AFTER_i: The last preceding base character was "i" (0069), "j" (006A),
# or has a canonical decomposition that begins with an "i" or "j" but has no
# combining characters above (i.e., i-ogonek (012F), i-tilde-below (1E2D),
# or i-dot-below (1ECB)); AND no combining character class 230 (above) has
# intervened. (Neither i-stroke (0268) or j-crosstailed (029D) need be
# specially handled below, while they also have a soft dot above that
# is lost on normal uppercase or accenting above.)
#
# AFTER_CAP_I: The last preceding base character was "I" (0049), "J" (004A),
# or has a canonical decomposition that begins with an "I" or "J" but has no
# combining characters above (i.e., I-ogonek (012E), I-tilde-below (1E2C),
# or I-dot-below (1ECA)); AND no combining character class 230 (above) has
# intervened. (I-stroke (0197) need not be specially handled below, while
# it also has a soft dot above in lowercase form.)
#
# MORE_ACCENTS_ABOVE: The current combining sequence has at least one class 230
# (above) combining character after the currently considered character.
6th-------------------[no old text]
#-----
# Normal dotting/undotting of i's and j's (capital and small):
#-----
# Remove spurious explicit dot above small i or j when case mapping,
# if no more accents above:
0307; ; ; ; AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
# Remove explicit dot above capital i or j when lowercasing,
# if no more accents above (mainly for NFC-NFD consistency for i--I-dot):
0307; ; 0307; 0307; AFTER_CAP_I NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
# For NFC-NFD consistency for I-dot--i:
0130; 0069 0307; 0130; 0130; MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH DOT
# Note: the following cases are already in the UnicodeData file.
# 0131; 0131; 0049; 0049; # LATIN SMALL LETTER DOTLESS I
# 0130; 0069; 0130; 0130; [NON_MORE_ACCENTS_ABOVE] # LATIN CAPITAL LETTER I WITH DOT ABOVE
7th-------------------
# ================================================================================
# Language-sensitive mappings
# ================================================================================
#
# Lithuanian:
#
# Remove dot above small i's or j's when uppercasing,
# even if there are more accents above:
0307; 0307; ; ; lt AFTER_i # COMBINING DOT ABOVE
# Introduce an explicit dot above when lowercasing capital I's and J's
# if there are more accents above (grave, acute, tilde above, and ogonek
# occur in Lithuanian; the rest are just for consistency between NFC and NFD):
0049; 0069 0307; 0049; 0049; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
004A; 006A 0307; 004A; 004A; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER J
012E; 012F 0307; 012E; 012E; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH OGONEK
00CC; 0069 0307 0300; 00CC; 00CC; lt # LATIN CAPITAL LETTER I WITH GRAVE
00CD; 0069 0307 0301; 00CD; 00CD; lt # LATIN CAPITAL LETTER I WITH ACUTE
0128; 0069 0307 0303; 0128; 0128; lt # LATIN CAPITAL LETTER I WITH TILDE
1E2C; 1E2D 0307; 1E2C; 1E2C; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH TILDE BELOW
1ECA; 1ECB 0307; 1ECA; 1ECA; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH DOT BELOW
00CE; 0049 0307 0302; 00CE; 00CE; lt # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
0134; 004A 0307 0302; 0134; 0134; lt # LATIN CAPITAL LETTER J WITH CIRCUMFLEX
0128; 0049 0307 0303; 0128; 0128; lt # LATIN CAPITAL LETTER I WITH TILDE
012A; 0049 0307 0304; 012A; 012A; lt # LATIN CAPITAL LETTER I WITH MACRON
012C; 0049 0307 0306; 012C; 012C; lt # LATIN CAPITAL LETTER I WITH BREVE
01CF; 0049 0307 030C; 01CF; 01CF; lt # LATIN CAPITAL LETTER I WITH CARON
0208; 0049 0307 030F; 0208; 0208; lt # LATIN CAPITAL LETTER I WITH DOUBLE GRAVE
020A; 0049 0307 0311; 020A; 020A; lt # LATIN CAPITAL LETTER I WITH INVERTED BREVE
1E2E; 0049 0307 0308 0301; 1E2E; 1E2E; lt # LATIN CAPITAL LETTER I WITH DIAERESIS AND ACUTE
1EC8; 0049 0307 0309; 1EC8; 1EC8; lt # LATIN CAPITAL LETTER I WITH HOOK ABOVE
#
# Turkish, Azeri:
#
# Remove spurious dot above small i's when lowercasing, if no more accents above:
0307; ; 0307; 0307; tr AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
0307; ; 0307; 0307; az AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
# Ii-dotless and I-dot--i-with-soft-dot are case pairs in Turkish and Azeri,
# when there are no more accents above (otherwise use the ordinary casing rules):
0069; 0069; 0130; 0130; tr NON_MORE_ACCENTS_ABOVE # LATIN SMALL LETTER I
0069; 0069; 0130; 0130; az NON_MORE_ACCENTS_ABOVE # LATIN SMALL LETTER I
0049; 0131; 0049; 0049; tr NON_MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az NON_MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
end-------------