|Authors||Asmus Freytag (firstname.lastname@example.org)|
Status of this document
<Use formal language for proposed draft>
<Make real>1 Section1
!= Categorization (broader categories)
Folding operation – a folding operation removes a distinction between related characters. For example, case folding removes the case distinction, by replacing upper and title case variants of a character with the lower case.
Compatibility mappings – compatibility mappings substitute characters with their compatibility decomposition. Many compatibility mappings are foldings, some are multigraph expansions.
Multigraph expansion – a multigraph expansion replaces a multigraph, such as e.g. double prime, by its expansion into an equivalent series of single characters, in this case, two single primes. Multigraph expansions are a subset of compatibility mappings.
This technical reports defines a consistent set of folding operations useful for fuzzy searches, among other things. They are:
The canonical duplicates are noted separately to distinguish them from the case of normalizing series of combining accents or decomposing accented characters, which are the bulk of canonical mappings. For certain loose matches, explicitly allowing accent removal as a form of separate folding is required.
Where not noted otherwise, the folding description is a subset of the compatibility mappings defined in Normalization form NFKC, and can be derived by inspection from the data file. (I'll post a sorted extract of the Unicode data file to allow easier comments).
Trying to use Normalization form NFKC form for the purpose of general text searching (which may be different in some respects from identifier matching). Some of the compatability mappings seem absolutely necessary for sensible searching - for example the <wide> and <narrow> mappings. Others cannot be used. One particular example is: There are other examples of this in NFKC. Most involve digits and a loss of clear term separation.
The simple answer is that form NFKC does not achieve what people 'naively' assume it does, i.e. provide a set of sensible _folding_ operations for fuzzy equivalence.
Commonly requested folding operations are
Native digit folding (not provided by NFKC)
Hyphen/Dash folding (not provided by NFKC)
The [CaseFolding] data file provides pure case folding information.
The weight tables from the Unicode Collation Algorithm provide the source information for a number of foldings. These are not provided explicitly, but fall out when some of the sort weight differences are selectively ignored.
The Collation data tables provides intrinsic multigraph expansions which cannot be separated from the foldings.
Normalization forms C and D use canonical decomposition. They differ only in whether the result is (partially) recomposed (Normalization form C). The only folding they provide is canonical duplicates folding.
NFKC and NFKD add compatibility decomposition to the canonical decomposition. Compatibility decomposition provides several foldings and expansions so that NFKC and NFKD provides these
(*) The foldings noted with a (*) could be considered 'near canonical' mappings, since these distinctions are required solely for interoparation in some legacy environments, but are never needed for ordinary plain text and these folding operations are therefore lossless.
(**) Han duplicates are folded by NFKC only if they occur across the boundary of the set of unified ideographs. There are many duplicates inside the unified set itself that might need folding for some forms of loose matching.
(***) Certain Arabic ligature expansions are provided as part of the data for positional forms foldings.
Each of the folding operations from the preceding list has well understood properties, and is appropriate in specific contexts. Not all of these folding operations may be appropriate in the same contexts. See the description for some of the more problematic folding or expansion operations. Also, the list of folding and expansion operations leaves out many common and useful folding operations, and only partially provides others.
There are four problems with using form NFKC as 'default folding'
(doesn't give target character)
need to add barred, slashed etc, hook descender not yet in canon...
unidata.txt in collation dir has annotation
Letterforms need to have 'final sigma' and 'final Hebrew' added to them.
Greek letter forms should be folded for Greek text. They should not be folded for mathematical and scientific usage as doing so would conflate very distinct concepts (e.g. angle (THETA) and temperature (THETA SYMBOL) to give an examples of common usage in physics).
Some multigraphs that are EAW ambiguous should potentially be treated differently when resolved to EAW wide than when resolved to narrow.
Fraction expansion as defined in the compatibility decompositions can lead to a drastic change of the semantics of a string and can lead to term boundary issues for searching. For example: Expanding the fraction in this string: DIGIT 5 + VULGAR FRACTION ONE QUARTER turns it into DIGIT 5 + DIGIT 1 + FRACTION SLASH + DIGIT 4. This now will be found by a search for "51". Because of the semantics of FRACTION SLASH the expansion changed the numeric value from "5 and a quarter" into "51 over 4". Fraction expansion is therefore best avoided altogether.
If a circled bullet character is simply replaced by its contents, e.g. CIRCLED DIGIT 5 is replaced by DIGIT 5, the separation from the surrounding text is lost, and the DIGIT 5 could run together with adjacent numbers. For bullet characters using parenthesized or dotted letters or digit, this issue is somewhat mitigated by fact that the bullet itself contains punctuation.
Spacing accents are mapped by compatibility decomposition to SPACE + non-spacing accent. This inappropriately introduces a space character into the term, as well as introducing non-spacing marks where none were in the data before.
Form NFKC provides an aggressive folding of letter like mathematical symbols to their nearest ASCII or Hebrew equivalent. In particular the Hebrew characters used as letterlike symbols do not have RIGHT TO LEFT directionality and the set of such letters in mathematical usage is sufficiently restricted that such folding makes little sense, except in pure 'looks like' style searches.
Unicode contains many clusters, e.g. square symbols, some of the letterlike characters that are made up of several characters. 'Decomposing' these may or may not be the right thing for search equivalence. Parenthesized characters and numbers would probably be immune to the term boundaries issues raised earlier, but the story is less clear for others.
The following table summarizes the definition of the folding operations. The source column identifies the set of characters subject to the folding operation by referencing a set of code points, a set of general categories, or a compatibility mapping tag. All characters matching the source condition are subject to the given folding. Note that this column does not indicate the set of characters with which the source characters are equivalenced by the folding. The target column indicates the result of the folding, either by reference to an operation, or, in some cases, by providing the single Unicode character to which a whole set of source characters is folded. The data file column indicates which data file carries the character by character information to implement the operation referred to in the target column.
|Description||Source characters||Target Characters||Data file specifying the mapping|
|Accent removal||Latin/Greek/Cyrillic characters with canonical decomposition||base characters of canonical decomposition||CanonMapsAnnotated.txt (*)|
|Accent folding (includes stroke, hook, descender)||Latin/Greek/Cyrillic characters with accents||related base character||AccentFolding.txt [TBD]|
|Case folding||Lu and Lt||Lower case||[CaseFolding]|
|Canonical duplicates folding (Ohm - Omega)||canonical decomposition||CanonMapsAnnotated.txt (*)|
|Fraction expansion||<fraction>||compatibility decomposition||[UnicodeData]|
|Han Radical folding||2F00..2F5D, 2EF3, 2E9F||compatibility decomposition||[UnicodeData]|
|Hangzhou Numbers folding||3038..303A||compatibility decomposition||[UnicodeData]|
|Hebrew Alternates folding||FB20..FB28||compatibility decomposition||[UnicodeData]|
|Jamo folding||3131..3183||compatibility decomposition||[UnicodeData]|
|Ligature expansion Misc.||0587,FB00..FB06, FB13..FB17, FB4F||compatibility decomposition||[UnicodeData]|
|Letterforms folding||Variants of letter forms||Related base form||[TBD]|
|- circled||<circled>||compatibility expansion||[UnicodeData]|
|- parenthesized||compatibility expansion||[UnicodeData]|
|- dotted||compatibility expansion||[UnicodeData]|
|- Ellipsis expansion||2024..2026||compatibility decomposition||[UnicodeData]|
|- Integral expansion||222D..222C,222F..2230||compatibility decomposition||[UnicodeData]|
|- Prime expansion||2033..2034,2036..2037||compatibility decomposition||[UnicodeData]|
|- Roman numerals||2160..2183||compatibility decomposition||[UnicodeData]|
|- other, e.g. c/o, TEL|
|Non-break folding||<no-break>||compatibility decomposition||[UnicodeData]|
|Positional Forms folding
- includes Arabic ligatures
|<initial>, <medial>, <final>, <isolate>||compatibility decomposition||[UnicodeData]|
|Small forms folding||<small>||compatibility decomposition||[UnicodeData]|
|Spacing Accents <+>||00AF,00B4,00B8,02D8..02DD, 037A,0384,1FBD,1FBE..1FC0,1FFE,2017,203E,309B..309C|
|Square multigraphs expansion||<square>||compatibility decomposition||[UnicodeData]|
|Subscript folding||<sub>||compatibility decomposition||[UnicodeData]|
|Superscript folding||<super>||compatibility decomposition||[UnicodeData]|
|Symbol folding <+>||2107,2135..2138||compatibility decomposition||[UnicodeData]|
|Vertical forms folding||<vertical>||compatibility decomposition||[UnicodeData]|
|Width folding||<wide>, <narrow>||compatibility decomposition||[UnicodeData]|
Delete unneeded ones
Copyright © 2000-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.