L2/00-261

Unicode Technical Report #25

CHARACTER FOLDINGS

Revision	0d1
Authors	Asmus Freytag (asmus@unicode.org)
Date	2000-07-12
This Version	http://www.unicode.org/unicode/reports/tr23-0.0d0.html
Previous Version	none
Latest Version	http://www.unicode.org/unicode/reports/tr25/draft

Summary

This report identifies a set of character foldings, in other words, operations that ignore certain distinctions between similar characters.

Status of this document

Contents

1 Section1

2 Section2

6 References

1 Scope

!= Categorization (broader categories)

1.3 Definitions:

Folding operation – a folding operation removes a distinction between related characters. For example, case folding removes the case distinction, by replacing upper and title case variants of a character with the lower case.

Compatibility mappings – compatibility mappings substitute characters with their compatibility decomposition. Many compatibility mappings are foldings, some are multigraph expansions.

Multigraph expansion – a multigraph expansion replaces a multigraph, such as e.g. double prime, by its expansion into an equivalent series of single characters, in this case, two single primes. Multigraph expansions are a subset of compatibility mappings.

This technical reports defines a consistent set of folding operations useful for fuzzy searches, among other things. They are:

Accent removal (not provided)
Case folding (not provided)
Canonical duplicates folding (Ohm - Omega)
Dashes folding (not provided)
Fraction Expansion
Han folding (partially provided)
Hebrew alternates folding
Jamo folding
Kana folding (not provided)
Ligature expansion Misc.
Multigraph expansion
- circled bullets
- dotted bullets
- ellipses, integrals, primes
- parenthesized
- roman numerals
- square clusters
- symbol, e.g. c/o, TEL
Native digit folding
Non-break folding
Other punctuation folding (not provided)
Overline folding
Positional Forms folding
- - includes Arabic ligatures
Small forms folding
Space folding
Spacing Accents
Subscript folding
Superscript folding
Symbol folding
Underline folding
Vertical forms folding
Width folding

The canonical duplicates are noted separately to distinguish them from the case of normalizing series of combining accents or decomposing accented characters, which are the bulk of canonical mappings. For certain loose matches, explicitly allowing accent removal as a form of separate folding is required.

Where not noted otherwise, the folding description is a subset of the compatibility mappings defined in Normalization form NFKC, and can be derived by inspection from the data file. (I'll post a sorted extract of the Unicode data file to allow easier comments).

1.2 Rationale

Trying to use Normalization form NFKC form for the purpose of general text searching (which may be different in some respects from identifier matching). Some of the compatability mappings seem absolutely necessary for sensible searching - for example the <wide> and <narrow> mappings. Others cannot be used. One particular example is: There are other examples of this in NFKC. Most involve digits and a loss of clear term separation.

The simple answer is that form NFKC does not achieve what people 'naively' assume it does, i.e. provide a set of sensible _folding_ operations for fuzzy equivalence.

Commonly requested folding operations are

Native digit folding (not provided by NFKC)

Hyphen/Dash folding (not provided by NFKC)

2.0 Existing sources for folding and expansion data

2.1 Case folding

The [CaseFolding] data file provides pure case folding information.

2.1 Collation

The weight tables from the Unicode Collation Algorithm provide the source information for a number of foldings. These are not provided explicitly, but fall out when some of the sort weight differences are selectively ignored.

Accent folding
Case folding
Final forms folding
Kana folding
Width folding

The Collation data tables provides intrinsic multigraph expansions which cannot be separated from the foldings.

2. 2 Normalization forms C and D

Normalization forms C and D use canonical decomposition. They differ only in whether the result is (partially) recomposed (Normalization form C). The only folding they provide is canonical duplicates folding.

2.3 Normalization forms KC or KD

NFKC and NFKD add compatibility decomposition to the canonical decomposition. Compatibility decomposition provides several foldings and expansions so that NFKC and NFKD provides these

Canonical duplicates folding
Digraph folding
Fraction expansion
Han duplicates folding (partially) (**)
Multigraph expansions for:
- Bullets
- Ligatures (***)
- Roman numerals
- Squared clusters
- Symbols
Non-break folding
Overline folding
Positional forms folding (*)
Space folding
Spacing accents expansion
Subscript folding
Superscript folding
Underline folding
Vertical forms folding (*)
Width folding

Notes:

(*) The foldings noted with a (*) could be considered 'near canonical' mappings, since these distinctions are required solely for interoparation in some legacy environments, but are never needed for ordinary plain text and these folding operations are therefore lossless.

(**) Han duplicates are folded by NFKC only if they occur across the boundary of the set of unified ideographs. There are many duplicates inside the unified set itself that might need folding for some forms of loose matching.

(***) Certain Arabic ligature expansions are provided as part of the data for positional forms foldings.

Each of the folding operations from the preceding list has well understood properties, and is appropriate in specific contexts. Not all of these folding operations may be appropriate in the same contexts. See the description for some of the more problematic folding or expansion operations. Also, the list of folding and expansion operations leaves out many common and useful folding operations, and only partially provides others.

There are four problems with using form NFKC as 'default folding'

Normalization Form NFKC bundles many of them together
NFKC does not provide some common and needed folding operations
NFKC provides some inappropriate foldings and side effects
As defined today NFKC is frozen wrt. some form of updates to Unicode while search operations would not need to be restricted

2.4 Categorization of properties

(doesn't give target character)

2.5 Which ones are not provided, or spread across forms.

3.0 Folding and expansion operations:

3.1 General notes

3.1.1 Accent folding

need to add barred, slashed etc, hook descender not yet in canon...

unidata.txt in collation dir has annotation

3.1.2 Letter forms

Letterforms need to have 'final sigma' and 'final Hebrew' added to them.

Greek letter forms should be folded for Greek text. They should not be folded for mathematical and scientific usage as doing so would conflate very distinct concepts (e.g. angle (THETA) and temperature (THETA SYMBOL) to give an examples of common usage in physics).

3.1.3 Multigraphs

Some multigraphs that are EAW ambiguous should potentially be treated differently when resolved to EAW wide than when resolved to narrow.

3.2 Problematic foldings or expansions

3.2.1 Fraction expansion

Fraction expansion as defined in the compatibility decompositions can lead to a drastic change of the semantics of a string and can lead to term boundary issues for searching. For example: Expanding the fraction in this string: DIGIT 5 + VULGAR FRACTION ONE QUARTER turns it into DIGIT 5 + DIGIT 1 + FRACTION SLASH + DIGIT 4. This now will be found by a search for "51". Because of the semantics of FRACTION SLASH the expansion changed the numeric value from "5 and a quarter" into "51 over 4". Fraction expansion is therefore best avoided altogether.

3.2.2 Bullet expansions

If a circled bullet character is simply replaced by its contents, e.g. CIRCLED DIGIT 5 is replaced by DIGIT 5, the separation from the surrounding text is lost, and the DIGIT 5 could run together with adjacent numbers. For bullet characters using parenthesized or dotted letters or digit, this issue is somewhat mitigated by fact that the bullet itself contains punctuation.

3.2.3 Spacing accents substitution

Spacing accents are mapped by compatibility decomposition to SPACE + non-spacing accent. This inappropriately introduces a space character into the term, as well as introducing non-spacing marks where none were in the data before.

3.2.4 Math folding

Form NFKC provides an aggressive folding of letter like mathematical symbols to their nearest ASCII or Hebrew equivalent. In particular the Hebrew characters used as letterlike symbols do not have RIGHT TO LEFT directionality and the set of such letters in mathematical usage is sufficiently restricted that such folding makes little sense, except in pure 'looks like' style searches.

3.2.5 Various "cluster" expansions

Unicode contains many clusters, e.g. square symbols, some of the letterlike characters that are made up of several characters. 'Decomposing' these may or may not be the right thing for search equivalence. Parenthesized characters and numbers would probably be immune to the term boundaries issues raised earlier, but the story is less clear for others.

4.0 Specifications

The following table summarizes the definition of the folding operations. The source column identifies the set of characters subject to the folding operation by referencing a set of code points, a set of general categories, or a compatibility mapping tag. All characters matching the source condition are subject to the given folding. Note that this column does not indicate the set of characters with which the source characters are equivalenced by the folding. The target column indicates the result of the folding, either by reference to an operation, or, in some cases, by providing the single Unicode character to which a whole set of source characters is folded. The data file column indicates which data file carries the character by character information to implement the operation referred to in the target column.

Description	Source characters	Target Characters	Data file specifying the mapping
Accent removal	Latin/Greek/Cyrillic characters with canonical decomposition	base characters of canonical decomposition	CanonMapsAnnotated.txt (*)
Accent folding (includes stroke, hook, descender)	Latin/Greek/Cyrillic characters with accents	related base character	AccentFolding.txt [TBD]
Case folding	Lu and Lt	Lower case	[CaseFolding]
Canonical duplicates folding (Ohm - Omega)		canonical decomposition	CanonMapsAnnotated.txt (*)
Dashes folding	Pd	U+002D
Fraction expansion	<fraction>	compatibility decomposition	[UnicodeData]
Han Radical folding	2F00..2F5D, 2EF3, 2E9F	compatibility decomposition	[UnicodeData]
Hangzhou Numbers folding	3038..303A	compatibility decomposition	[UnicodeData]
Hebrew Alternates folding	FB20..FB28	compatibility decomposition	[UnicodeData]
Jamo folding	3131..3183	compatibility decomposition	[UnicodeData]
Kana folding	Hiragana	Katakana	[KanaFolding]
Ligature expansion Misc.	0587,FB00..FB06, FB13..FB17, FB4F	compatibility decomposition	[UnicodeData]
Letterforms folding	Variants of letter forms	Related base form	[TBD]
Multigraph expansion
- circled	<circled>	compatibility expansion	[UnicodeData]
- parenthesized		compatibility expansion	[UnicodeData]
- dotted		compatibility expansion	[UnicodeData]
- Ellipsis expansion	2024..2026	compatibility decomposition	[UnicodeData]
- Integral expansion	222D..222C,222F..2230	compatibility decomposition	[UnicodeData]
- Prime expansion	2033..2034,2036..2037	compatibility decomposition	[UnicodeData]
- Roman numerals	2160..2183	compatibility decomposition	[UnicodeData]
- other, e.g. c/o, TEL
Non-break folding	<no-break>	compatibility decomposition	[UnicodeData]
Overline folding	FE49..FE4B	203E
Positional Forms folding - includes Arabic ligatures	<initial>, <medial>, <final>, <isolate>	compatibility decomposition	[UnicodeData]
Small forms folding	<small>	compatibility decomposition	[UnicodeData]
Space folding	Zs	U+0020
Spacing Accents <+>	00AF,00B4,00B8,02D8..02DD, 037A,0384,1FBD,1FBE..1FC0,1FFE,2017,203E,309B..309C
Square multigraphs expansion	<square>	compatibility decomposition	[UnicodeData]
Subscript folding	<sub>	compatibility decomposition	[UnicodeData]
Superscript folding	<super>	compatibility decomposition	[UnicodeData]
Symbol folding <+>	2107,2135..2138	compatibility decomposition	[UnicodeData]
Underline folding	FE4D..FE4F	005E
Vertical forms folding	<vertical>	compatibility decomposition	[UnicodeData]
Width folding	<wide>, <narrow>	compatibility decomposition	[UnicodeData]

Notation:

.. indicates an inclusive range

, indicates an alternative

<xxxx> refers to a compatibility mapping tag as defined in [CompatibilityTags]

Xx refers to a paticular value for the General Category property defined in [UnicodeData]

<+> means a folding is contained in the Unicode data files, but is not recommended

5.0 References

Delete unneeded ones

[AccentFolding]: Data file <ftp://ftp.unicode.org/Public/UNIDATA/AccentFolding.txt>
[CaseFolding]: Data file <ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt>
[Case Mapping]: Mark Davis, Unicode Technical Report #21: Case Mapping, <http://www.unicode.org/unicode/reports/tr21>
[Character Mapping Tables]: Mark Davis, Unicode Technical Report #22: Character Mapping Tables, <http://www.unicode.org/unicode/reports/tr22>
[Collation]: Mark Davis, Unicode Technical Report #10: Collation, <http://www.unicode.org/unicode/reports/tr10>
[CompatibilityTags]
[EastAsianWidth]: Data file <ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt>
[East Asian Width]: Asmus Freytag, Unicode Standard Annex #11, East Asian Width, <http://www.unicode.org/unicode/reports/tr11>
[KanaFolding]: Data file <ftp://ftp.unicode.org/Public/UNIDATA/KanaFolding.txt>
[SpecialCasing]: Data file <ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt>
[Unicode]: The Unicode Standard, Version 3.0, Addison Wesley Longman, 2000.
[UnicodeCharacterDatabase]: Readme file, <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html>
[UnicodeData]: Data file <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>
[UnicodeData-Format]: Readme file <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html>

These may not be needed.
[Bidirectional Algorithm]: Mark Davis, Unicode Standard Annex #9: The Bidirectional Algorithm, <http://www.unicode.org/unicode/reports/tr9>
[LineBreak]: Data file <ftp://ftp.unicode.org/Public/UNIDATA/LineBreak.txt>
[Line Breaking]: Asmus Freytag, Unicode Standard Annex #14: Line Breaking Properties, <http://www.unicode.org/unicode/reports/tr14>
[NamesList]: Data file <ftp://ftp.unicode.org/Public/UNIDATA/NamesList.txt>
[NamesList-Format]: Readme file <ftp://ftp.unicode.org/Public/UNIDATA/NamesList.html>

Changes from previous drafts

Initial draft

Copyright © 2000-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.