From: Mark Davis (firstname.lastname@example.org)
Date: Tue May 25 2004 - 19:10:31 CDT
I don't think the "fold to base" is as useful as some other information. For
those characters with a canonical decomposition, the decomposition carries more
more information, since you can combine it with a "remove combining marks"
folding to get the folding to base.
For my part, what would be more interesting would be a "full" decomposition of
the characters that don't have a canonical decomposition, e.g.
LATIN CAPITAL LETTER O WITH STROKE => O + /
BTW, I had posted some commentary on TR30, which I will repeat here.
... I found these files almost
impossible to assess in code point form, so I ran them through a quick ICU
transform to add comments with the real characters and names. I also NFC'd the
forms, just for consistency. These files generated from Asmus's are in
I had suggest posting them in this form for public review of the TR, since
others will have the same difficulty in assessing the quality of the data.
Here are some quick comments.
Adding digraph expansions seems quite odd.
When in NFC, whole batches of these mappings are NOPs. Don't know why they are
there; they are also not consistent in the use of composed vs. decomposed forms.
This file combines half-width katakana folding. I think it is much more useful
if that is separated out. Someone can apply a sequence of two transforms if they
This feels like a real potpourri of stuff. Why superscripts and not subscripts?
Why annotation characters? Why modifier letters -- those are not really
This file would be MUCH more useful if in two separate files.
Full-width to half-width
Half-width to full-width
Again, remove the NFC mappings.
27E6; 301A # ⟦ → 〚 MATHEMATICAL LEFT WHITE SQUARE BRACKET → LEFT WHITE SQUARE
These don't appear to be a width issue.
Note that I have not checked these new data tables for completeness; these were
just some quick observations.
► शिष्यादिच्छेत्पराजयम् ◄
----- Original Message -----
Sent: Tue, 2004 May 25 14:57
Subject: Re: New Public Review Issue posted
> Rick McGowan scripsit:
> > The Unicode Technical Committee has posted a new issue for public
> > review and comment. Details are on the following web page:
> > http://www.unicode.org/review/
> I have prepared a draft DiacriticFolding.txt file for this issue; it is
> temporarily available at http://www.ccil.org/~cowan/DiacriticFolding.txt .
> This was prepared by looking for lines in UnicodeData that matched
> the regex '(GREEK|LATIN|CYRILLIC|HEBREW).*WITH'. (I added Hebrew to the
> set of scripts specified by the current draft of #30.)
> Characters with decompositions were mapped into the base character of the
> decomposition; characters without decompositions were mapped by name.
> The file http://www.ccil.org/~cowan/DiacriticFoldingExceptions.txt contains
> a list of 32 characters matching the pattern which did not seem to me
> to be suitable for diacritic folding.
> I have posted a short version of this note to the Unicode comment form.
> A rabbi whose congregation doesn't want John Cowan
> to drive him out of town isn't a rabbi, http://www.ccil.org/~cowan
> and a rabbi who lets them do it email@example.com
> isn't a man. --Jewish saying http://www.reutershealth.com
This archive was generated by hypermail 2.1.5 : Tue May 25 2004 - 19:12:20 CDT