An abstract characterization of an AbstractCharacter class

From: John Cowan (john_cowan@hotmail.com)
Date: Tue Jun 17 1997 - 11:36:26 EDT


Due to network problems, I can read mail at cowan@ccil.org, but
can't post/reply/send from there. Please direct all replies to
cowan@ccil.org, not the HotMail address. Thanks.

The following document is a rather long explanation of the Java class
AbstractCharacter, which I am creating. I hope that Unicoders
who are programmers, and even those who aren't, will send me
whatever criticisms they may have in the interest of making this
class better. (I have no connection with Sun or Javasoft, so
there is nothing "official" about AbstractCharacter, but I will
be providing it free of charge to whoever wants it.)

For those not familiar with Java, I should mention that "char"
refers to a 16-bit unsigned value (suitable for holding UCS-2 or
Unicode or UTF-16 character codes), "int" refers to a 32-bit
signed value (suitable for holding a UCS-4 character code), and
"String" refers to an immutable sequence of chars.

==start here==

The purpose of AbstractCharacter is to encapsulate all Unicode
issues around surrogate pairs, non-spacing marks, and conjoining
characters, so that programmers can process streams of rationalized
objects rather than raw 16-bit values. Two AbstractCharacters
are considered equal if they represent "the same thing". Thus
an AbstractCharacter built from a LATIN CAPITAL LETTER A (U+0041)
followed by a COMBINING CIRCUMFLEX (U+0302), and one built from a
LATIN CAPITAL LETTER A WITH CIRCUMFLEX (U+00C2) would be equal.
AbstractCharacter objects are immutable: their values cannot be
changed once they are created.

For convenience, AbstractCharacters can be created directly from
single chars or from Strings: in the latter case, only the first
AbstractCharacter in the String is used. More typically, a stream
of AbstractCharacters is created from a stream of chars, which
may come from a file, a String, an array of chars, or any
Java Reader class. AbstractCharacter objects may then
be retrieved in sequence from the stream; each one will contain
one or more chars either directly from the char stream or derived
from the chars in the stream by decomposition.

Streams of AbstractCharacters have two main properties which
determine how the chars are packaged up into AbstractCharacters.
The "mode" property, which is retrieved and specified with the
"getMode" and "setMode" methods, may have one of three values:
REVERSIBLE, CANONICAL, or COMPATIBILITY. The default is
COMPATIBILITY. The "syllableMode" property, retrieved and specified
with "getSyllableMode" and "setSyllableMode", may have one of two
values: LETTER or SYLLABLE. The default is LETTER.

The "mode" property controls what decompositions are applied to
chars in the char stream before grouping chars into AbstractCharacter
objects. REVERSIBLE mode means that only reversible canonical
decompositions are done; these are all the canonical decompositions
specified in the Unicode Standard except those which produce more
than one spacing character. In addition, the algorithmic
decompositions of Hangul syllables into Hangul jamo are considered
reversible canonical. CANONICAL mode means that all canonical
decompositions are applied. COMPATIBILITY mode means that all
decompositions are applied. In any case, all surrogate pairs are
recognized and processed as single entities.

The "syllableMode" property controls which groups of chars are placed
into a single AbstractCharacter. In LETTER mode, an AbstractCharacter
contains a single base character plus all applicable non-spacing
marks that follow it. (If non-spacing marks appear in the char
stream before any base characters, they are treated as if applied
to ZWNBSP (U+FEFF)). In SYLLABLE mode, an AbstractCharacter may
contain multiple base characters if they are conjoining jamo;
this type of AbstractCharacter is called "syllabic", and the
"isSyllabic" method returns true for such an Abstract Character.
Syllabic AbstractCharacters contain all the jamo
corresponding to a single Hangul syllable. Even in SYLLABLE mode,
it is possible to receive AbstractCharacters corresponding to
individual jamo if they do not form proper syllables.

Once an AbstractCharacter has been created, several values may
be extracted from it. The "value" method returns a fully
decomposed String corresponding to the chars grouped into an
AbstractCharacter. All non-spacing marks have been reordered
in accordance with the Canonical Reordering Algorithm. Equality
of AbstractCharacters is equivalent to equality of their values.

The "contents" method returns an array of ints corresponding to
the UCS-4 codes of the characters grouped into the AbstractCharacter.
It returns the same information as the "value" method, but in
a different format; all surrogate pairs will have been translated
to their UCS-4 equivalents.

The "base" method returns an int corresponding to the UCS-4
code of the base character in the AbstractCharacter. In the
case of syllabic AbstractCharacters, the base is the
UCS-4 code of the corresponding character from the Hangul Syllables
block, or -1 if there is no such character (indicating an archaic
Hangul syllable). The same int returned by "base" is also returned
by "hashCode", so that AbstractCharacters are hashed according to
their bases.

The "composedValue" and "composedContents" methods are similar
to the "value" and "contents" methods, except that all reversible
canonical decompositions are first recomposed. It does not matter
whether the original chars were decomposable or not; these methods
attempt to apply every possible reversible decomposition (in
reverse) until the String or array of ints, respectively, is as
small as possible.

AbstractCharacter is aware of the decompositions and canonical
ordering priorities of all the characters in the Unicode Standard
2.0, in accordance with version 2.0.14 of the character properties
table. It assumes that all undefined and private-use chars
are spacing marks with no decomposition or jamo-like conjoining
behavior. However, the methods "setOrderingPriority",
"setDecomposition", and "setConjoiningType" allow the internal tables
to be augmented for private-use chars only. These methods are
global in effect, altering the behavior of all AbstractCharacter
streams. Changing them has no effect on existing AbstractCharacter
objects already read from streams.

Finally, the constant AbstractCharacter ZIGAMORPH is provided.
It has a base of 0xFFFF and a value of "\uFFFF", and represents the
non-character U+FFFF.

== end here ==

Comments?

---------------------------------------------------------
Get Your *Web-Based* Free Email at http://www.hotmail.com
---------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT