Fullwidth and Halfwidth

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Sep 19 1997 - 13:57:03 EDT


In the context of the misdirected discussion about the W3C
DOM Core Level 1 Draft which has been showing up on this list,
John Cowan made a number of observations regarding the status
of halfwidth and fullwidth characters, as documented in the
Unicode Standard.

I will try to clarify the intent of the discussion of halfwidth
and fullwidth forms on page 6-130 of the standard.

First, though, it should clearly be noted that statements
made in the Unicode Standard in Chapter 6 (Character Block
Descriptions) do not have normative status. Chapters 3, 4,
and 7 (Charts) have normative status. The rest of the book,
including Chapter 6 is provided basically to give as much
information as possible to help people understand and
implement the characters correctly. But it is dangerous to
make legalistic arguments based on the text of Chapter 6,
since there is rather large leeway for the editors of the
Unicode Standard to modify and augment such explanatory
text as new issues arise or old ones require more clarification.

>
> ISUNG.US.ORACLE.COM wrote:
>
> > John Cowan wrote:
> >
> > >The status of U+1100-11FF and U+AC00-D7A3 is doubtful. Officially,
> > >the first block (Hangul Jamo) is halfwidth and the second block
> > >(Hangul Syllables) is neither, but they both look fullwidth to me.
> >
> > Both Hangul Jamo and syllables at Row 11 and Row AC ~ D7 are all
> > fullwidth. There are halfwidth Hangul Jamo at Row FF.
>
> Yes, that is what I think too, as it seems reasonable. Unfortunately,
> it contradicts the letter of the Unicode Standard (p. 6-130):
>
> # In the context of conversion to and from such mixed-width encodings,
> # all characters in the General Scripts area [i.e. 0000-1FFF]
> # should be construed as halfwidth (*hankaku*) characters.

In my opinion, this sentence, as it stands, is misleading in that
it implies that everything in the range U+0000..U+1FFF is halfwidth--
an implication that John has clearly drawn.

The intent, however, is different. The issue basically arises because
there are fullwidth forms encoded in the ranges U+FF01..U+FF5E and
U+FFE0..U+FFE6. When converting a DBCS mixed-width encoding to and
from Unicode, the fullwidth characters in such a mixed-width encoding
are mapped to the fullwidth compatibility characters in the FFxx
block, whereas the corresponding halfwidth characters are mapped to
ordinary Unicode characters (e.g. ASCII in U+0021..U+007E, plus a
few other scattered chararacters).

In the context of interoperability with DBCS character encodings,
that restricted set of Unicode characters in the
General Scripts area can be construed as halfwidth, rather than
fullwidth. (This applies only to the restricted set of characters
which can be paired with the fullwidth compatibility characters.)

In the context of interoperability with DBCS character encodings,
all other Unicode characters which are not explicitly marked as
halfwidth can be construed as fullwidth.

In any other context, Unicode characters not explicitly marked as
being either fullwidth or halfwidth compatibility forms should
be construed as unmarked as to halfwidth versus fullwidth status.

Please note that "halfwidth" and "fullwidth" are not unitary
character properties in the same sense as "space" or "combining"
or "alphabetic". They are, instead, relational properties of
a pair of characters, one of which is explicitly encoded as
a halfwidth or fullwidth form for compatibility in mapping to
DBCS mixed-width character encodings. I consider it a mistake
to promulgate API's such as isFullwidth or isHalfwidth defined
on Unicode characters; what is "fullwidth" by default today
could become "halfwidth" tomorrow by the introduction of another
character on the SBCS part of a mixed-width code page somewhere,
requiring the introduction of another fullwidth compatibility
character to complete the mapping. Hopefully, with the existence
of Unicode, we won't see more extensions of the mixed-width
character sets we have to map to, but in any case, treating
relational properties that are contingent on mixed-width
character set encodings the same as universal character
properties is mixing apples and oranges.

>
> That purports to include the combining jamo at 1100-11FF. The rest of
> the paragraph says:
>
> # All characters in the CJK Phonetics and Symbols area [i.e. 3000-33FF]
> # and the Unified CJK Ideograph area [i.e. 4E00-9FFF], along with
> # the characters in the CJK Compatibility Ideographs [i.e. F900-FAFF],
> # CJK Compatibility Forms [i.e. FE30-FE4F], and Small Form Variants
> # blocks [i.e. FE50-FE6F], should be construed as fullwidth (*zenkaku*)
> # characters. Other Compatibility Area [i.e. F900-FFFF] characters
> # outside of the current block should be construed as halfwidth
> # characters. The characters of the Symbols Area are neutral regarding
> # their width semantics.

This is clearly a case of an attempt to add explanatory text which
ended up overspecifying and thereby missed the mark. This text should
be changed in the next edition of the standard to avoid such
misunderstandings.

>
> Note that the Standard is silent on the halfwidth/fullwidth status of the
> Hangul Syllables area.
>
> As far as I can tell, ISO 10646 is silent on the terms "halfwidth" and
> "fullwidth" except to say that the characters so named are provided
> for compatibility.

That is correct. ISO/IEC 10646 does not consider character properties
(other than combining and mirroring) to be part of its charter.
The developers of the Unicode Standard, on the other hand, consider
character properties to be an integral part of the full specification
of the universal character encoding.

--Ken Whistler

>
> --
> John Cowan http://www.ccil.org/~cowan cowan@ccil.org
> e'osai ko sarji la lojban
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT