Re: Question about “Uppercase” in DerivedCoreProperties.txt from Philippe Verdy on 2014-11-08 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sun, 9 Nov 2014 00:50:07 +0100

Do not try to get consisant results with only a character to character
mapping, it does not work with all letters, because sometimes you need 1->2
or 2->1 mappings (not all composable characters exist in precombined forms,
or sometimes the combination must be split into its canonical decomposed
equivalent prior to map the base character) or other mappings.
toupper() and tolower() should not be used for something else than just
mapping number-like sequences (e.g. to convert hexadecimal numbers).

Use strupper() and strlower() (or equivalent functions not alocating memory
but writing to a given buffer or stream, and similiar functions to other
languages than C/C++) to perform mappings on full strings so that the
string length can safely change.
- this is needed for example to convert city names or people names to
capitals in a postal address, or to style a book title or chapter heading).
- it is needed as well to perform case insensitive searches (using "case
folding", which is different from converting to lowercase or to uppercase)
to match input, or to implement some input completion UI to locate possible
matches within a known dictionnary or input history.

2014-11-08 10:22 GMT+01:00 Mike FABIAN <mfabian_at_redhat.com>:

> Philippe Verdy <verdy_p_at_wanadoo.fr> さんはかきました:
>
> > note that tolower() and toupper() can only work one 1-character level, it
> > is not recommended for use for changing case of plain text.
> >
> > For correct handling of locales, to upper and toupper should be replaced
> by
> > strtolower and strtoupper (or their aliases) which will be able to
> process
> > character clusters and contextual casing rules needed for a language or
> > orthographic style
>
> Yes, thank you for explaining this.
>
> But these details of upper and lower casing cannot be expressed in the
> “i18n” file of glibc:
>
> https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/i18n
>
> For toupper and tolower, this file just has character -> character
> mapping tables, for example the “tolower” table contains only
>
> (<U03A3>,<U03C3>)
>
> (i.e. mapping Σ U+03A3 -> σ U+03C3, never to the final sigma ς
> U+03C2).
>
> More correct, detailed information about upper and lower case must come
> from elsewhere, not from this “i18n” file in glibc. Using only the
> information from this “i18n” file, not even the Greek sigma can be
> handled correctly.
>
> Pravin and me want to update this “i18n” file to the latest
> data from Unicode 7.0.0, doing it as correct as possible within
> the limitations caused by this file and the ISO C standard.
>
> --
> Mike FABIAN <mfabian_at_redhat.com>
> ☏ Office: +49-69-365051027, internal 8875027
> 睡眠不足はいい仕事の敵だ。
>

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Sat Nov 08 2014 - 17:51:45 CST

This archive was generated by hypermail 2.2.0 : Sat Nov 08 2014 - 17:51:45 CST