Proposed Draft Unicode Technical Report #XX

Character Foldings

Version	0d6
Authors	Asmus Freytag ([email protected]
Date	2002-04-29
This Version	This document
Previous Version	none
Latest Version	(UTC review only)
Tracking Number	0

NOTE TO UTC:

In response to the review at UTC#88 I have revised the text and present it again as an L2 document for internal review. I will collect all feedback addressed to me and generate a new version for UTC#89.

Summary

This report identifies a set of character foldings, in other words, operations that ignore certain distinctions between similar characters.

Status

This document has been approved by the Unicode Technical Committee for membership-only review as a Proposed Draft Unicode Technical Report. Making this document available for internal review does not imply endorsement by the Unicode Consortium. This is an internal draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

Please send comments to the authors. A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.

The References provide related information that is useful in understanding this document. Please mail corrigenda and other comments to the author(s).

Scope
Rationale
2.1 Relation to Normalization
Defintions
Specifications
4.1 Folding Algorithm
4.2 Specification of Foldings
Notes and Guidelines
References
Modifications

A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.

1.0 Scope

This technical reports defines a consistent set of folding operations useful for fuzzy searches, among other things. Each of the folding operations specified in this report has well understood properties, and is appropriate in specific contexts. For example some identifiers needs case folding, some do not. Some text searches needs to preserve <super> form (trademark symbol), while others do not. Not all of these folding operations may be appropriate in the same contexts. See the description for some of the more problematic folding or expansion operations.

This report defines a single search term folding specification (STF) which combines canonical normalization with optional folding operations.

e.g. STF = NFx+<case folding>+<font>+<super>+<sup>

This allows implementers to decide which folding option is useful for a particular use.

String foldings for identifier matches are a related to search term foldings, but with the important difference that the set of allowable characters is restricted. Where required, identifier matches are addressed specifically below.

2.0 Rationale

For the purpose of fuzzy text matching, including both defined identifier matching and general text searching it is often necessary to selectively ignore otherwise meaningful distinctions between related characters, for example case, wide vs. narrow, etc. This process can be called search term folding. Depending on the operation, different foldings need to be applied, and possible interactions must be carefully managed. False positive matches may be acceptable more often than false negative matches.

2.1 Relation to Normalization

Search term folding is somewhat similar to normalization in that it also reduces several alternative representations into one, but there are important differences. Normalization is often intended for permanent transformation of data, but search term folding is by nature transient. Where the normalization forms offer two different levels of distinctions they preserve (canonical and compatibility), the choice of what type of distinction can or even must be ignored for search term folding needs to be more specific and depends on the nature of the operation. One size does not fit all. Applying some of the compatibility mappings may lead to unintended false positive or negative matches, preventing their use in general text searches. (See section 5.2 Problematical foldings).

Furthermore, normalization and case folding are defined as separate and independent operations, but case folding often occurs together with other foldings in search term folding. In order to avoid inconsistencies, search term folding needs to address their interaction.

Search term folding is best expressed as applied to text after canonical normalization, however, whether the normalized text is in the composed or decomposed form is of secondary importance in terms of defining the foldings. It will be relevant to distinguish between NFC and NFD whenever pre-folded search terms are to be interchanged. Otherwise, due to the transient nature of search term folding, the distinction is immaterial.

3.0 Definitions

Folding operation – a folding operation removes a distinction between related characters. For example, case folding removes the case distinction, by replacing upper and title case variants of a character with the lower case.

Compatibility mappings – compatibility mappings substitute characters with their compatibility decomposition. Many compatibility mappings are foldings, some are multigraph expansions.

Multigraph expansion – a multigraph expansion replaces a digraph or higher multigraph, such as e.g. double or triple prime, by its expansion into an equivalent series of single characters, in this case, two or three single primes. Multigraph expansions are a subset of expanding compatibility mappings.

4.0 Specifications

4.1 Basic algorithm

The basic algorithm for search term folding can be stated as

a. Apply canonical decomposition
b. Apply optional folding operations
c. Recurse 1 & 2 until stable (*)
d. Apply composition if necessary

(*) Step (c) might be removed by by enforcing certain foldings before decomposition.

For identifier folding, it is important to account for prohibited characters. The basic algorithm then becomes:

a. Apply canonical decomposition
b. Apply optional folding operations (*)
c. Repeat a & b until stable (**)
d. Apply composition if necessary
e. Eliminate or flag forbidden characters

(*) Foldings can disallow certain characters by mapping them to forbidden characters.
(**) Step (c) might be removed by by enforcing certain foldings before decomposition.

4.2 Specification of folding operations

[Ed note: Files containing tables for Letterform folding and Kana folding are needed]

The following table summarizes the definition of the folding operations. Foldings that are multigraph expansions have been collected at the end of the table.

The description column identifies the folding.
The source column identifies the set of characters subject to the folding operation by referencing a set of code points, a set of general categories, or a compatibility mapping tag. All characters matching the source condition are subject to the given folding. Note that this column does not indicate the set of characters with which the source characters are equivalenced by the folding.
The target column indicates the result of the folding, either by reference to an operation, or, in some cases, by providing the single Unicode character to which a whole set of source characters is folded.
The data file column indicates which data file carries the character by character information to implement the operation referred to in the target column.

Descriptive Name	Source Characters	Target Characters	Data file specifying the mapping
Accent removal	Latin/Greek/Cyrillic characters with canonical decomposition	base characters of canonical decomposition	[UnicodeData]
Diacritic folding (includes stroke, hook, descender)	Latin/Greek/Cyrillic characters with accents	related base character	AccentFolding.txt [TBD]
Case folding	any	case fold according to CaseFolding.txt	[CaseFolding]
Canonical duplicates folding (e.g. Ohm - Omega)	0374, 037E, 0387, 1FBE, 1FEF, 1FFD, 2000, 2001, 2126, 212A, 212B, 2329..232A	canonical decomposition	[UnicodeData]
Dashes folding	Pd	U+002D
Greek letterforms folding	03D0..03D2, 03D5..03D6, 03F0..03F2, 03F4..03F5	compatibility decomposition	[UnicodeData]
Han Radical folding	2F00..2F5D, 2EF3, 2E9F	compatibility decomposition	[UnicodeData]
Hangzhou Numbers folding	3038..303A	compatibility decomposition	[UnicodeData]
Hebrew Alternates folding	FB20..FB28	compatibility decomposition	[UnicodeData]
Jamo folding	3131..3183	compatibility decomposition	[UnicodeData]
Kana folding	Hiragana	Katakana	KanaFolding.txt [TBD]
Ligature expansion Misc.	0587, 0675..0678, 0E33, 0EB3, 0EDC..0EDD, 0F77, 0F79, FB00..FB06, FB13..FB17, FB4F	compatibility decomposition	[UnicodeData]
Letterforms folding	Variants of letter forms 017F (long s)	related base form 0073	LetterFolding.txt [TBD]
Math symbol folding	<font>	compatibility decomposition	[UnicodeData]
Native digit folding	Nd	substitute ASCII digit of same numeric property	[UnicodeData]
Non-break folding	<no-break>	compatibility decomposition	[UnicodeData]
Overline folding	FE49..FE4B	203E (*)
Positional Forms folding - includes Arabic ligatures	<initial>, <medial>, <final>, <isolate>	compatibility decomposition	[UnicodeData]
Small forms folding	<small>	compatibility decomposition	[UnicodeData]
Space folding	Zs	U+0020
Spacing Accents <+>	00AF,00B4,00B8,02D8..02DD, 037A,0384,1FBD,1FBE..1FC0, 1FFE,2017,203E,309B..309C	compatibility decomposition	[UnicodeData]
Subscript folding	<sub>	compatibility decomposition	[UnicodeData]
Superscript folding	<super>	compatibility decomposition	[UnicodeData]
Symbol folding <+>	00B5, 2107,2135..2138	compatibility decomposition	[UnicodeData]
Underline folding	2017, FE4D..FE4F	005E
Vertical forms folding	<vertical>	compatibility decomposition	[UnicodeData]
Width folding	<wide>, <narrow>	compatibility decomposition	[UnicodeData] **
Multigraph expansions
- Fraction expansion	<fraction>	compatibility decomposition	[UnicodeData]
- Circled	<circled>	compatibility decomposition	[UnicodeData]
- Parenthesized	2474..2487,249C..24B5, 3200..3243	compatibility decomposition	[UnicodeData]
- Dotted	2488..249B	compatibility decomposition	[UnicodeData]
- Ellipsis expansion	2024..2026	compatibility decomposition	[UnicodeData]
- Integral expansion	222D..222C,222F..2230	compatibility decomposition	[UnicodeData]
- Prime expansion	2033..2034,2036..2037	compatibility decomposition	[UnicodeData]
- Roman numerals	2160..2183	compatibility decomposition	[UnicodeData]
- Squared	<square>	compatibility decomposition	[UnicodeData]
- Squared (unmarked)	3358..3370, 33E0..33FE, 32C0..32CB	compatibility decomposition	[UnicodeData]
- Digraphs	0132..0133, 013F..0140. 0149, 01C4..01CC, 01F1..01F3, 1E9A	compatibility decomposition	[UnicodeData]
-Other multigraph, e.g. c/o, TEL	203C, 2047..2049, 20A8, 2100..2101, 2103, 2105..2106, 2109, 2116, 2121	compatibility decomposition	[UnicodeData]

Notation:

.. indicates an inclusive range

, indicates an alternative

<xxxx> refers to a compatibility mapping tag as defined in [CompatibilityTags]

Xx refers to a particular value for the General Category property defined in [UnicodeData]

<+> means a folding is contained in the Unicode data files, but is not recommended

(*) indicates a target that is subject to another folding, other than case folding

[Ed. Note: Additional foldings that have been suggested

Slash folding (eg. 2044 FRACTION SLASH to 002F SOLIDUS)
Additional look-alike foldings (visually confusable)

would need tables for the latter to add them]

(**) These can be found at "http//oss.software.ibm.com/cvs/icu/~checkout~/icu4j/src/com/ibm/text/resources/Transliterator_Fullwidth_Halfwidth.txt?cvsroot=ICU4J"

5.0 Notes and Guidelines

5.1 General notes

5.1.1 Case folding

The [CaseFolding] data file provides case folding information.

5.1.2 Diacritic folding

Diacritic folding goes beyond the decomposition and removal of accents, umlauts, cedillas etc. that is provided by the canonical decompositions, but also includes barred, slashed forms etc, as well as hooks, descenders, etc.

5.1.3 Letter forms

Greek letter forms should be folded for Greek text. They should not be folded for mathematical and scientific usage as doing so would conflate very distinct concepts (e.g. angle (THETA) and temperature (THETA SYMBOL) to give an examples of common usage in physics).

[Ed. note: Letterforms need to have 'final sigma' and 'final Hebrew' added to them. ]

5.1.4 Semantically neutral foldings

Text that is subjected to a standard Unicode rendering process should display the same whether or not the semantically neutral foldings have been applied (in fact, in practice it is far more likely that un-folded data will display poorly on such a system).

Therefore these foldings are candidates for permanent data transformations.

The following foldings are semantically neutral

Positional forms folding
Vertical forms folding
Canonical decomposition and composition, in particular Normalization Forms NFC and NFD

5.1.5 Locale based foldings

Locale based foldings would fold characters that are 'nearly equivalent' in a particular langauage. For example, a locale based folding for Swedish could match

oe - o umlaut o slash - o umlaut ss to sharp s y to u and v to w

etc. In other words, locale based foldings would be different for some user groups using the same script (in this case Latin). The recommended way to implement locale based searching based on sorting tables is found in [Collation].

5.1.6 Compatibility decompositions

Compatibility decomposition provides a fixed combination of several foldings and expansions. It is in fact the source of most of the foldings in the table in section 4.0. There are two ways to subdivide compatibility decompositions

by compatibility decomposition type (the value in <> in the date file, e.g. <super>)
by explicitly limiting the range of source characters

The specifications in section 4.0 uses the first method, whenever the compatibility tag is well defined and meaningful. Where it is too broad, e.g., for the <compat> tag, foldings are further subdivided by defining specific ranges of source characters.

Using compatibility decomposition is convenient, since existing algorithms for Normalization may provide them. However, the full decomposition includes several foldings that may o not be appropriate for the given purpose. By selectively not applying the decomposition to certain character ranges given in Section 4.0, one can in effect limit the compatibility decomposition to only the desired foldings.

5.2 Problematic foldings or expansions

Some foldings can have unintended consequences, including inadvertent changes in the semantics of the text. In most cases, it is best to be conservative and avoid problematic foldings altogether. There are two general exceptions to this rule. The first is the case of identifier matching. If a folding has a prohibited character as input or as one of the output characters, it will not match any legal identifier. Therefore, for properly restricted inputs, one may safely use fixed combinations of foldings, such as NFKC. The other exception is the case of more extensive string pre-processing, discussed below.

5.2.1 Fraction expansion

Fraction expansion as defined in the compatibility decompositions can lead to a drastic change of the semantics of a string and can lead to term boundary issues for searching. For example: Expanding the fraction in this string: DIGIT 5 + VULGAR FRACTION ONE QUARTER turns it into DIGIT 5 + DIGIT 1 + FRACTION SLASH + DIGIT 4. This now will be found by a search for "51". Because of the semantics of FRACTION SLASH the expansion changed the numeric value from "5 and a quarter" into "51 over 4". Fraction expansion is therefore best avoided altogether.

By modifying the fraction expansion from the standard compatibility decomposition and inserting a THIN SPACE before the fraction, it is possible to prevent the expanded fraction from coalescing with preceding digits. However, if there are no preceding digits, no THIN SPACE must be added, or strings containing the expanded fraction would no longer match strings with already expanded fractions which presumeably would not contain THIN SPACE characters.

5.2.2 Bullet expansions

If a circled bullet character is simply replaced by its contents, e.g. CIRCLED DIGIT 5 is replaced by DIGIT 5, the separation from the surrounding text is lost, and the DIGIT 5 could run together with adjacent numbers. For bullet characters using parenthesized or dotted letters or digit, this issue is somewhat mitigated by fact that the bullet itself contains punctuation. Bullet characters are commonly used like footnote marks to refer to other text, in other words, they do not just occur at the beginning of bulleted lines.

5.2.3 Spacing accents substitution

Spacing accents are mapped by compatibility decomposition to SPACE + non-spacing accent. This inappropriately introduces a space character into the term, as well as introducing non-spacing marks where none were in the data before. This former is especially problematic where the matching operation is bounded by spaces.

5.2.4 Math folding

The set of compatibility decompositions include the folding of letterlike mathematical symbols to their nearest ASCII or Hebrew equivalent. In particular the Hebrew characters used as letterlike symbols do not have RIGHT TO LEFT directionality and the set of such letters in mathematical usage is sufficiently restricted that such folding makes little sense, except in pure 'looks like' style searches.

5.2.5 Various "cluster" expansions

Unicode contains many clusters, e.g. square symbols, some of the letterlike characters that are made up of several characters. 'Decomposing' these may or may not be the right thing for search equivalence. Parenthesized characters and numbers would probably be immune to the term boundaries issues raised earlier, but the story is less clear for others.

5.2.5 Jamo expansion

For more information on Jamo expansion see section 10.4 Hangul (addition) in [Unicode 3.2].

5.2.6 Preserving semantics

To a limited extent, the problems surrounding bullet expansion can be mitigated by inserting a THIN SPACE around the expansion to set the expanded text off from the surrounding text. This must be applied after any space folding, as otherwise the THIN SPACE will become SPACE and might be considered a search term delimiter.

References

[Ed. note: These may need to be updated. Remove unneeded ones before publication]

References

[CaseFolding]	Data file ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt
[Case Mapping]	Mark Davis, Unicode Technical Report #21: Case Mapping, http://www.unicode.org/unicode/reports/tr21
[Character Mapping Tables]	Mark Davis, Unicode Technical Report #22: Character Mapping Tables, http://www.unicode.org/unicode/reports/tr22
[Collation]	Mark Davis, Unicode Technical Report #10: Collation, http://www.unicode.org/unicode/reports/tr10>
[CompatibilityTags]
[Charts]	The online code charts can be found at http://www.unicode.org/charts/ An index to characters names with links to the corresponding chart is found at http://www.unicode.org/charts/charindex.html
[EAW]	Unicode Standard Annex #11, East Asian Width. http://www.unicode.org/unicode/reports/tr11 For a definition of East Asian Width
[FAQ]	Unicode Frequently Asked Questions http://www.unicode.org/unicode/faq/ For answers to common questions on technical issues.
[Glossary]	Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.
[Normalization]	Unicode Standard Annex #15: Unicode Normalization Forms http://www.unicode.org/unicode/reports/tr15/
[Reports]	Unicode Technical Reports http://www.unicode.org/unicode/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[TUS]	The Unicode Standard, Version 3.0, (Reading, Massachusetts: Addison-Wesley Developers Press 2000) or online as http://www.unicode.org/unicode/uni2book/u2.html
[U3.1]	Unicode Standard Annex #27: Unicode 3.1 http://www.unicode.org/unicode/reports/tr27/
[U3.2]	Unicode Standard Annex #28: Unicode 3.2 http://www.unicode.org/unicode/reports/tr28/
[UCD]	Unicode Character Database. http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html For and overview of the Unicode Character Database and a list of its associated files
[UXML]	Unicode Technical Report #20: Unicode in XML and other Markup Languages http://www.unicode.org/unicode/reports/tr20/
[Versions]	Versions of the Unicode Standard http://www.unicode.org/unicode/standard/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them.
[XML]	Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, Eds., Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendation 6-October-2000, <http://www.w3.org/TR/REC-xml>

Modifications

Changes from Tracking Number 0

Modifications are not tracked yet

Copyright © 2001-2002 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

Proposed Draft Unicode Technical Report #XX

Character Foldings

NOTE TO UTC:

Contents

1.0 Scope

5.1 General notes

5.1.1 Case folding

5.1.2 Diacritic folding

5.1.3 Letter forms

5.1.4 Semantically neutral foldings

5.1.5 Locale based foldings

5.1.6 Compatibility decompositions

5.2 Problematic foldings or expansions

5.2.1 Fraction expansion

5.2.2 Bullet expansions

5.2.3 Spacing accents substitution

5.2.4 Math folding

5.2.5 Various "cluster" expansions

5.2.5 Jamo expansion

5.2.6 Preserving semantics

References