Re: Differences between UnicodeData.txt and SpecialCasing.txt Case Mappings

From: Addison Phillips (addison@yahoo-inc.com)
Date: Thu Oct 19 2006 - 18:05:32 CST

Next message: Philippe Verdy: "Re: Differences between UnicodeData.txt and SpecialCasing.txt Case Mappings"

Previous message: Richard Wordingham: "Re: Differences between UnicodeData.txt and SpecialCasing.txt Case Mappings"
In reply to: Andrew Miller: "Differences between UnicodeData.txt and SpecialCasing.txt Case Mappings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi Andrew,

Andrew Miller wrote:
> There appear to be a number of differences in the case mappings defined
> in UnicodeData.txt and SpecialCasing.txt

This is as it should be. Right at the top of the file it says:

# This file is a supplement to the UnicodeData file.
# It contains additional information about the casing of Unicode characters.
# (For compatibility, the UnicodeData.txt file only contains case
mappings for
# characters where they are 1-1, and does not have locale-specific
mappings.)
# For more information, see the discussion of Case Mappings in the
Unicode Standard.

In other words, this is where you will find every instance of case
mappings that consume a larger number of code points than the source text.

>
> For example, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) has a
> lowercase mapping of U+0069 in UnicodeData.txt and a mapping of U+0069
> U+0307 in SpecialCasing.txt.
>
> All of the greek YPOGEGRAMMENI letters in SpecialCasing.txt have
> different uppercase mappings to those specified in UnicodeData.txt
>
> Can I just ignore the UnicodeData.txt mappings for these characters, and
> just use the ones defined in SpecialCasing ones instead?
>

Not entirely, you can't. The bottom part of the file contains
locale-specific mappings. These are mappings that should be used in
specific languages/locales and not elsewhere. For example:

# When uppercasing, i turns into a dotted capital I

0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I

You wouldn't want the letter "i" to become İ (U+0130) under "normal"
(i.e. non-Turkish/non-Azerbaijani) circumstances.

Hope that helps.

Addison

-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Internationalization is an architecture.
It is not a feature.

Next message: Philippe Verdy: "Re: Differences between UnicodeData.txt and SpecialCasing.txt Case Mappings"
Previous message: Richard Wordingham: "Re: Differences between UnicodeData.txt and SpecialCasing.txt Case Mappings"
In reply to: Andrew Miller: "Differences between UnicodeData.txt and SpecialCasing.txt Case Mappings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Oct 19 2006 - 18:07:08 CST