L2/00-261

Unicode Technical Report #25

CHARACTER FOLDINGS

Revision 0d1
Authors Asmus Freytag (asmus@unicode.org)
Date 2000-07-12
This Version http://www.unicode.org/unicode/reports/tr23-0.0d0.html
Previous Version none
Latest Version http://www.unicode.org/unicode/reports/tr25/draft

Summary

This report identifies a set of character foldings, in other words, operations that ignore certain distinctions between similar characters.

Status of this document

<Use formal language for proposed draft>

Contents

<Make real>

1 Section1
2 Section2 
2.1 Subsection 2.1
2.2 Subsection 2.2
3 Definitions
4 Conformance
6 References
Acknowledgements
Revisions

1 Scope

 

!= Categorization (broader categories)

1.3 Definitions:

Folding operation – a folding operation removes a distinction between related characters. For example, case folding removes the case distinction, by replacing upper and title case variants of a character with the lower case.

Compatibility mappings – compatibility mappings substitute characters with their compatibility decomposition. Many compatibility mappings are foldings, some are multigraph expansions. 

Multigraph expansion – a multigraph expansion replaces a multigraph, such as e.g. double prime, by its expansion into an equivalent series of single characters, in this case, two single primes. Multigraph expansions are a subset of compatibility mappings. 

This technical reports defines a consistent set of folding operations useful for fuzzy searches, among other things. They are:

The canonical duplicates are noted separately to distinguish them from the case of normalizing series of combining accents or decomposing accented characters, which are the bulk of canonical mappings. For certain loose matches, explicitly allowing accent removal as a form of separate folding is required.

Where not noted otherwise, the folding description is a subset of the compatibility mappings defined in Normalization form NFKC, and can be derived by inspection from the data file. (I'll post a sorted extract of the Unicode data file to allow easier comments).

1.2 Rationale

Trying to use Normalization form NFKC form for the purpose of general text searching (which may be different in some respects from identifier matching). Some of the compatability mappings seem absolutely necessary for sensible searching - for example the <wide> and <narrow> mappings. Others cannot be used. One particular example is:  There are other examples of this in NFKC. Most involve digits and a loss of clear term separation.

The simple answer is that form NFKC does not achieve what people 'naively' assume it does, i.e. provide a set of sensible _folding_ operations for fuzzy equivalence.

Commonly requested folding operations are

Native digit folding (not provided by NFKC)

Hyphen/Dash folding (not provided by NFKC) 

2.0 Existing sources for folding and expansion data

2.1 Case folding

The [CaseFolding] data file provides pure case folding information.

2.1 Collation

The weight tables from the Unicode Collation Algorithm provide the source information for a number of foldings. These are not provided explicitly, but fall out when some of the sort weight differences are selectively ignored.

The Collation data tables provides intrinsic multigraph expansions which cannot be separated from the foldings.

2. 2 Normalization forms C and D

Normalization forms C and D use canonical decomposition. They differ only in whether the result is (partially) recomposed (Normalization form C). The only folding they provide  is canonical duplicates folding.

2.3 Normalization forms KC or KD

NFKC and NFKD add compatibility decomposition to the canonical decomposition. Compatibility decomposition provides several foldings and expansions so that NFKC and NFKD provides these

Notes:

(*) The foldings noted with a (*) could be considered 'near canonical' mappings, since these distinctions are required solely for interoparation in some legacy environments, but are never needed for ordinary plain text and these folding operations are therefore lossless.

(**) Han duplicates are folded by NFKC only if they occur across the boundary of the set of unified ideographs. There are many duplicates inside the unified set itself that might need folding for some forms of loose matching.

(***) Certain Arabic ligature expansions are provided as part of the data for positional forms foldings.

Each of the folding operations from the preceding list has well understood properties, and is appropriate in specific contexts. Not all of these folding operations may be appropriate in the same contexts. See the description for some of the more problematic folding or expansion operations. Also, the list of folding and expansion operations leaves out many common and useful folding operations, and only partially provides others.

There are four problems with using form NFKC as 'default folding'

  1. Normalization Form NFKC bundles many of them together 
  2. NFKC does not provide some common and needed folding operations 
  3. NFKC provides some inappropriate foldings and side effects 
  4. As defined today NFKC is frozen wrt. some form of updates to Unicode while search operations would not need to be restricted

2.4 Categorization of properties 

(doesn't give target character)

2.5 Which ones are not provided, or spread across forms.

 

3.0 Folding and expansion operations:

3.1 General notes

3.1.1 Accent folding

need to add barred, slashed etc, hook descender not yet in canon...

 unidata.txt in collation dir has annotation 

3.1.2 Letter forms

Letterforms need to have 'final sigma' and 'final Hebrew' added to them. 

Greek letter forms should be folded for Greek text. They should not be folded for mathematical and scientific usage as doing so would conflate very distinct concepts (e.g. angle (THETA) and temperature (THETA SYMBOL) to give an examples of common usage in physics).

3.1.3 Multigraphs

Some  multigraphs that are EAW ambiguous should potentially be treated differently when resolved to EAW wide than when resolved to narrow.

3.2 Problematic foldings or expansions

3.2.1 Fraction expansion

Fraction expansion as defined in the compatibility decompositions can lead to a drastic change of the semantics of a string and can lead to term boundary issues for searching. For example:  Expanding the fraction in this string: DIGIT 5 + VULGAR FRACTION ONE QUARTER  turns it into DIGIT 5 + DIGIT 1 + FRACTION SLASH + DIGIT 4. This now will be found by a search for "51". Because of the semantics of FRACTION SLASH the expansion changed the numeric value from "5 and a quarter" into "51 over 4". Fraction expansion is therefore best avoided altogether.

3.2.2 Bullet expansions

If a circled bullet character is simply replaced by its contents, e.g.  CIRCLED DIGIT 5 is replaced by DIGIT 5, the separation from the surrounding text is lost, and the DIGIT 5 could run together with adjacent numbers. For bullet characters using parenthesized or dotted letters or digit, this issue is somewhat mitigated by fact that the bullet itself contains punctuation.

3.2.3 Spacing accents substitution

Spacing accents are mapped by compatibility decomposition to SPACE + non-spacing accent. This inappropriately introduces a space character into the term, as well as introducing non-spacing marks where none were in the data before.

3.2.4 Math folding

Form NFKC provides an aggressive folding of letter like mathematical symbols to their nearest ASCII or Hebrew equivalent. In particular the Hebrew characters used as letterlike symbols do not have RIGHT TO LEFT directionality and the set of such letters in mathematical usage is sufficiently restricted that such folding makes little sense, except in pure 'looks like' style searches.

3.2.5 Various "cluster" expansions

Unicode contains many clusters, e.g. square symbols, some of the letterlike characters that are made up of several characters. 'Decomposing' these may or may not be the right thing for search equivalence. Parenthesized characters and numbers would probably be immune to the term boundaries issues raised earlier, but the story is less clear for others.

4.0 Specifications

The following table summarizes the definition of the folding operations. The source column identifies the set of characters subject to the folding operation by referencing a set of code points, a set of general categories, or a compatibility mapping tag. All characters matching the source condition are subject to the given folding. Note that this column does not indicate the set of characters with which the source characters are equivalenced by the folding. The target column indicates the result of the folding, either by reference to an operation, or, in some cases, by providing the single Unicode character to which a whole set of source characters is folded. The data file column indicates which data file carries the character by character information to implement the operation referred to in the target column. 

Description Source characters Target Characters Data file specifying the mapping
Accent removal  Latin/Greek/Cyrillic characters with canonical decomposition base characters of canonical decomposition CanonMapsAnnotated.txt (*)
Accent folding (includes stroke, hook, descender) Latin/Greek/Cyrillic characters with accents related base character AccentFolding.txt [TBD]
Case folding Lu and Lt Lower case [CaseFolding]
Canonical duplicates folding (Ohm - Omega)    canonical decomposition CanonMapsAnnotated.txt (*)
Dashes folding Pd U+002D  
Fraction expansion  <fraction> compatibility decomposition [UnicodeData]
Han Radical folding  2F00..2F5D, 2EF3, 2E9F compatibility decomposition [UnicodeData]
Hangzhou Numbers folding  3038..303A compatibility decomposition [UnicodeData]
Hebrew Alternates folding  FB20..FB28 compatibility decomposition [UnicodeData]
Jamo folding  3131..3183 compatibility decomposition [UnicodeData]
Kana folding Hiragana Katakana [KanaFolding]
Ligature expansion Misc.  0587,FB00..FB06,  FB13..FB17, FB4F compatibility decomposition [UnicodeData]
Letterforms folding  Variants of letter forms Related base form [TBD]
Multigraph expansion      
 - circled <circled> compatibility expansion [UnicodeData]
 - parenthesized   compatibility expansion [UnicodeData]
 - dotted    compatibility expansion [UnicodeData]
 - Ellipsis expansion 2024..2026 compatibility decomposition [UnicodeData]
 - Integral expansion 222D..222C,222F..2230 compatibility decomposition [UnicodeData]
 - Prime expansion  2033..2034,2036..2037 compatibility decomposition [UnicodeData]
 - Roman numerals  2160..2183 compatibility decomposition [UnicodeData]
 - other, e.g. c/o, TEL       
Non-break folding  <no-break> compatibility decomposition [UnicodeData]
Overline folding  FE49..FE4B 203E  
Positional Forms folding 
- includes Arabic ligatures
<initial>, <medial>, <final>, <isolate> compatibility decomposition [UnicodeData]
Small forms folding  <small> compatibility decomposition [UnicodeData]
Space folding  Zs U+0020  
Spacing Accents <+> 00AF,00B4,00B8,02D8..02DD, 037A,0384,1FBD,1FBE..1FC0,1FFE,2017,203E,309B..309C    
Square multigraphs expansion  <square> compatibility decomposition  [UnicodeData]
Subscript folding  <sub> compatibility decomposition  [UnicodeData]
Superscript folding  <super> compatibility decomposition  [UnicodeData]
Symbol folding <+> 2107,2135..2138 compatibility decomposition [UnicodeData]
Underline folding  FE4D..FE4F 005E  
Vertical forms folding  <vertical> compatibility decomposition [UnicodeData]
Width folding <wide>, <narrow> compatibility decomposition [UnicodeData]

Notation: 

  • .. indicates an inclusive range
  • , indicates an alternative
  • <xxxx> refers to a compatibility mapping tag as defined in [CompatibilityTags]
  • Xx refers to a paticular value for the General Category property defined in [UnicodeData]
  • <+> means a folding is contained in the Unicode data files, but is not recommended
  • 5.0 References

    Delete unneeded ones

    [AccentFolding]
    Data file <ftp://ftp.unicode.org/Public/UNIDATA/AccentFolding.txt>
    [CaseFolding]
    Data file <ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt>
    [Case Mapping] 
    Mark Davis, Unicode Technical Report #21: Case Mapping, <http://www.unicode.org/unicode/reports/tr21>
    [Character Mapping Tables]
    Mark Davis, Unicode Technical Report #22: Character Mapping Tables, <http://www.unicode.org/unicode/reports/tr22>
    [Collation]
    Mark Davis, Unicode Technical Report #10: Collation, <http://www.unicode.org/unicode/reports/tr10>
    [CompatibilityTags]
    [EastAsianWidth]
    Data file <ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt>
    [East Asian Width
    Asmus Freytag, Unicode Standard Annex #11, East Asian Width, <http://www.unicode.org/unicode/reports/tr11>
    [KanaFolding]
    Data file <ftp://ftp.unicode.org/Public/UNIDATA/KanaFolding.txt>
    [SpecialCasing]
    Data file <ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt>
    [Unicode]
    The Unicode Standard, Version 3.0, Addison Wesley Longman, 2000.
    [UnicodeCharacterDatabase]
    Readme file, <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html>
    [UnicodeData]
    Data file <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>
    [UnicodeData-Format]
    Readme file <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html>
    These may not be needed.
    [Bidirectional Algorithm]
    Mark Davis, Unicode Standard Annex #9: The Bidirectional Algorithm, <http://www.unicode.org/unicode/reports/tr9>
    [LineBreak]
    Data file <ftp://ftp.unicode.org/Public/UNIDATA/LineBreak.txt>
    [Line Breaking]
    Asmus Freytag, Unicode Standard Annex #14: Line Breaking Properties, <http://www.unicode.org/unicode/reports/tr14>
    [NamesList]
    Data file <ftp://ftp.unicode.org/Public/UNIDATA/NamesList.txt>
    [NamesList-Format]
    Readme file <ftp://ftp.unicode.org/Public/UNIDATA/NamesList.html>

     

    Changes from previous drafts

    Initial draft 


    Copyright © 2000-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

    Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.