Unicode Technical Report #11
Unicode Character Property "East Asian Width"

Revision	3.0
Authors	Asmus Freytag
Date	Jun 11, 1999
This Version	http://www.unicode.org/unicode/reports/tr11-3
Previous Version	http://www.unicode.org/unicode/reports/tr11-2
Latest Version	http://www.unicode.org/unicode/reports/tr11

Summary

This report presents the specifications of a informative property for Unicode characters that is useful when interoperating with East Asian Legacy character sets.

Status of this document

This document has been considered and approved by the Unicode Technical Committee for publication as a Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative or as normative specification. Please mail corrigenda and other comments to [email protected].

East Asian Width Property

Overview

In mixed-width, East Asian, legacy encodings there is a concept of an inherent width of a character. For a fixed pitch font, this width translates to a display width of either one half or a whole unit width. A common name for this unit width is "Em". It is customarily the height of the letter 'M', but since in East Asian fonts the standard character cell is square, it is the same as the unit width.

NOTE: the character width for a fixed pitch Latin font like Courier is 3/5 of an em.

Layout and line breaking (to cite only two examples) in an East Asian context show systematic variations depending on the value of the East-Asian Width property (even for non-fixed pitch fonts). Further, the same information is useful in creating correct transcoding tables for East Asian character sets.

Scope

The East Asian Width property provides a useful concept for implementations that

have to interwork with East Asian legacy character encodings
support both East Asian and Western typography and line layout
need to associate fonts with unmarked text runs containing East Asian characters

This Unicode Technical Report does not provide rules or specifications of how this property might be used in font design or line layout, since, while a useful property for this purpose, it is only one of several character properties that would need to be considered.

Description

By convention, 1/2 Em wide characters of East Asian legacy encodings are called "half-width" (or hankaku characters in Japanese), the others are called correspondingly "full-width" (or zenkaku) characters. Legacy encodings often use a single byte for the half-width characters and two bytes for the full-width characters. In the Unicode Standard, no such distinction is made, but understanding the distinction is often necessary when interchanging data with legacy systems, especially when fixed size buffers are involved.

Some character blocks in the compatibility zone contain characters that are explicitly marked "half-width" and "full-width" in their character name but for all other characters the width property must be implicitly derived. Some characters behave differently in East Asian context than in non-East Asian content. Their default width property is considered ambiguous and needs to be resolved into an actual width property based on context.

This technical report assigns to each Unicode character one of the six values Ambiguous, Full Width, Half Width, Narrow, Wide, or Not East Asian Neutral (defined below) as its default width property. For any given operation, these six default properties resolve into only two property values narrow and wide, depending on context.

Definitions

All terms not defined here shall be as defined in the Unicode Standard.

East Asian Width - in the context of interoperating with East Asian legacy character encodings and implementing East Asian typography, character width is an abstract concept. It can take on two values, narrow and wide. The actual display width of a glyph is given by the font. An important class of fixed width legacy fonts contains glyphs of just two widths with the wider glyphs twice as wide as the narrower glyph.

East Asian Wide (W) - There are wide characters that are defined as full-width and also wide characters that are implicitly wide (such as the Unified Han Ideographs or Squared Katakana Symbols) because they occur only in the context of East Asian typography where they are wide characters.

East Asian FullWidth (FW) - East Asian Wide characters that are defined as full width and therefore are compatibility equivalents of implicitly narrow but unmarked characters elsewhere in the Unicode Standard. FW characters form a proper subset of W characters.

East Asian Narrow (Na) - There are narrow characters that are defined as half-width and also characters that are half-width by implication because they have full-width clones (all of ASCII is an example).

East Asian Half-width (HW) - Narrow characters that are defined as half-width and therefore are compatibility characters of implicitly wide, but unmarked characters elsewhere in the Unicode Standard. HW characters form a proper subset of N characters.

Note: Because half-width punctuation behaves in some important ways like ideographic punctuation, it is useful to distinguish characters defined as half-width from characters that are narrow by implication. Since this information cannot be trivially derived from the block names, it is provided explicitly below.

East Asian Ambiguous (A) - Characters that occur in East Asian legacy character sets as wide characters, and as narrow characters in their own local or non-East Asian usage (Examples are the Greek and Cyrillic Alphabet found in East Asian character sets, but also some of the mathematical symbols). Ambiguous characters require context to resolve their width.

Note: Because East Asian legacy character sets do not always include complete case pairs of Latin characters, two members of a pair may have different EA Width properties:
Ambiguous: 	01D4    LATIN SMALL LETTER U WITH CARON
NEA Neutral:	01D3    LATIN CAPITAL LETTER U WITH CARON

Not East Asian (Neutral) - All characters that do not occur in legacy East Asian character sets. By extension, they also do not occur in East Asian typography. (There is no traditional Japanese way of typesetting Devanagari, for example). Narrow and Neutral characters are treated the same under the recommendations below, so their distinction is a matter of convenience.

diagram (informative)

Figure 1: Venn diagram showing the set relations for the five of the six categories.

Relation to "full-width" and "half-width"

When converting a DBCS mixed-width encoding to and from Unicode, the full-width characters in such a mixed-width encoding are mapped to the full-width compatibility characters in the FFxx block, whereas the corresponding half-width characters are mapped to ordinary Unicode characters (e.g. ASCII in U+0021..U+007E, plus a few other scattered characters).

In the context of interoperability with DBCS character encodings, that restricted set of Unicode characters in the General Scripts area can be construed as half-width, rather than full-width. (This applies only to the restricted set of characters which can be paired with the full-width compatibility characters.)

In the context of interoperability with DBCS character encodings, all other Unicode characters which are not explicitly marked as half-width can be construed as full-width.

In any other context, Unicode characters not explicitly marked as being either full-width or half-width compatibility forms should be construed as unmarked as to half-width versus full-width status.

Seen in this light, the "half-width" and "full-width" properties are not unitary character properties in the same sense as "space" or "combining" or "alphabetic". They are, instead, relational properties of a pair of characters, one of which is explicitly encoded as a half-width or full-width form for compatibility in mapping to DBCS mixed-width character encodings.

What is "full-width" by default today could in theory become "half-width" tomorrow by the introduction of another character on the SBCS part of a mixed-width code page somewhere, requiring the introduction of another full-width compatibility character to complete the mapping. Since the single byte part of mixed-width character sets is limited, there are not going to be many candidates and neither UTC and WG2 have any intention to add additional compatibility characters for this purpose.

Conformance

East Asian Width is an informative character property.

Recommendation (informative)

When interchanging data

Wide characters always map to full-width characters in the mixed-width set
Wide characters never map to non East Asian legacy character encodings
Narrow (and neutral) characters always map to half-width characters in the mixed-width set
Half-width characters always map to half-width characters in the mixed-width set
Ambiguous characters always map to full-width characters in East Asian legacy character encodings
Ambiguous characters always map to regular (narrow) characters in non-East Asian legacy character encodings

When processing or displaying data

Wide characters behave like ideographs in important ways. In fixed pitched fonts, they take up one Em of space.
Half-width characters behave like ideographs in some ways, In fixed pitched fonts, they take up 1/2 Em of space.
Narrow characters behave like Western characters in important ways, In fixed pitched East Asian fonts, they take up 1/2 Em of space.
Ambiguous characters behave like wide or narrow characters depending on context (language tag, associated font, source of data, or explicit markup all can provide the context)

Classifications (informative)

The classifications presented here are based on the most widely used mixed-width legacy character sets in use in East Asia as of this writing. In particular, the assignment of the neutral or ambiguous categories depend on the contents of these character sets. For example, an implementation that knows a-priori, that it only needs to interchange data with the Japanese Shift-JIS character set, but not other East Asian character sets, could reduce the number of characters in the ambiguous classification to those actually encoded in Shift-JIS. Or such a reduction could be done implicitly at runtime in the context of interoperating with Shift-JIS fonts or data sources. Conversely, if additional character sets are created and widely adopted for legacy purposes, more characters would need to be classified as ambiguous.

All characters not listed here are by default classified as non-East Asian neutral.

East Asian Width classification of characters of the Unicode Standard, Version 3.0

This information is available as the file EastAsianWidth.txt in the Unicode Character Database.

Background

What ISO/IEC 10646:1993 says

ISO 10646 is silent on the terms "half-width" and "full-width" except to say that the characters so named are provided for compatibility.

What the Unicode Standard, Version 2.1 says

The Unicode Standard states (p. 6-130):

In the context of conversion to and from such mixed-width encodings, all characters in the General Scripts area [i.e. 0000-1FFF] should be construed as half-width (hankaku) characters.

This sentence, as it stands, is misleading in that it implies that everything in the range U+0000..U+1FFF is half-width.

All characters in the CJK Phonetics and Symbols area [i.e. 3000-33FF] and the Unified CJK Ideograph area [i.e. 4E00-9FFF], along with the characters in the CJK Compatibility Ideographs [i.e. F900-FAFF], CJK Compatibility Forms [i.e. FE30-FE4F], and Small Form Variants blocks [i.e. FE50-FE6F], should be construed as full-width (zenkaku) characters.

This is correct, with one exception 303F IDEOGRAPHIC HALF FILL SPACE

Other Compatibility Area [i.e. F900-FFFF] characters outside of the current block should be construed as half-width characters. The characters of the Symbols Area are neutral regarding their width semantics.

Like the first, this sentence, is misleading in that it fails to account for the ambiguous width property of many symbols.

It should clearly be noted that statements made in the Unicode Standard in Chapter 6 (Character Block Descriptions) do not have normative status. Chapters 3, 4, and 7 (Charts) have normative status. The rest of the book, including Chapter 6 is provided to give as much information as possible to help people understand and implement the characters correctly. But it is dangerous to make legalistic arguments based on the text of Chapter 6, since there is rather large leeway for the editors of the Unicode Standard to modify and augment such explanatory text as new issues arise or old ones require more clarification.

The intent of the existing paragraph is not to create a property but to account for the fact that there are full-width forms encoded in the ranges U+FF01..U+FF5E and U+FFE0..U+FFE6.

What the Unicode Standard, Version 3.0 says

Unicode 3.0 formally introduces East Asian Width as an informative character property. The discussion of this issue has moved to section 10.3 Katakana, and defers to this Technical Report for details. The data file with the classifications is now one of the contributory files the Unicode Character Database. See http://www.unicode.org/unicode/standard/versions for more information on versions of this standard and UnicodeCharacterDatabase.html on the Unicode Character Database for more information on the Unicode Character Database.

Acknowledgments

Michel Suignard provided extensive input into the analysis and source material for the detail assignments of these properties.

Changes from previous revisions:

First draft technical report version. Extensive formatting to fit the template. Split Wide into Wide and FullWidth to capture the characters with explicit FullWidth characteristics.

First Technical Report Version. Remove list of 'unassigned' characters. Add some informative text and make other editorial changes requested at UTC meeting #78.

Second Technical Report Version. Added UTF-8 and names annotations to the table. Minor wording changes. HTML fixes.

Third Technical Report Version: Added the classifications for new characters in Unicode 3.0. Moved the classifications into EastAsianWidth.txt in the Unicode Character Database. Minor wording changes.

Copyright

Copyright © 1998-1999 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

Unicode Home Page: http://www.unicode.org

Unicode Technical Reports: http://www.unicode.org/unicode/reports/techreports.html

Unicode Technical Report #11 Unicode Character Property "East Asian Width"