An Aburdly Brief Introduction to Unicode (was Re: Perception ...)

From: Tom Lord (lord@emf.net)
Date: Thu Feb 22 2001 - 00:39:58 EST


We've seen several posts about the perception that Unicode is a
16 bit character set encoding. Among those, we've heard anecdotes
about the problems people have introducing newcomers to Unicode.

Here is a chapter of a reference manual I've been working on.
The original manual can be found at http://www.regexps.com, along
with some useful Unicode software (a fast regular expression matcher,
a database for C, and some handy data structures).

The manual as a whole is covered by the GNU Free Documentation License
(http://www.gnu.org), but the plain-text version in this message
may be reproduced unconditionally.

Thomas Lord
regexps.com

                Absurdly Brief Introduction to Unicode

        copyright 2001, Thomas Lord, regexps.com, Pittsburgh PA
        Permission is granted to reproduce this text verbatim, without
        further restrictions, except that this copyright notice and
        permission statement must be included. Permission is
        granted to reproduce this text with modifications, provided
        that this copyright notice and permission statement are
        included, and the copy is clearly marked as "modified from
        the original".

This chapter is a very succinct introduction to the Unicode character
set. It may be useful when trying to read this manual, but it is not
intended to be a thorough introduction. One place to learn more about
Unicode is the web site of the Unicode Consortium:
http://www.unicode.org. The current definition of Unicode is published
as The Unicode Standard Version 3.0 by the Unicode Consortium.

                              Characters

Unicode defines a set of _abstract_characters_. Roughly speaking,
abstract characters represent indivisable marks that people use in
writing systems to convey information. In western alphabets, for
example, latin small letter A is the name of an abstract
character. That name doesn't refer to a in a particular font, but
rather to the idea of small A in general.

Unicode includes a number of abstract characters which are formatting
marks: they give an indication of how adjacent characters should be
rendered but do not themselves correspond to what one might ordinarily
think of as a "written character".

Unicode includes a number of abstract characters which are control
characters: they have traditional (and sometimes standard) meaning in
computing, but do not correspond to any feature of human writing.

Unicode includes a number of abstract characters which are usually
combined with other characters (such as diacritical marks and vowel
marks).

The goal of Unicode is to encode the complete set of abstract
characters used in human writing, sufficient to describe all written
text.

The situation is complicated by three factors: the necessarily large
size of a global character set; the occaisionaly arbitrary decisions
that must be made about what counts as an abstract character and what
does not; and the generally acknowledged desirability of supporting
bijective mappings between a variety of older character sets and
subsets of Unicode.

                             Code Points

A _code_point_ is an integer value which is assigned to an abstract
character. Each character receives a unique code point.

By convention, code points are always written in hexadecimal notation,
prefixed by the string U+. Usually, no less than four hexadecimal
digits are written.

Unicode code points are in the closed range U+0000..U+10FFFF. Thus,
it requires at least 21 bits to hold a Unicode code point. Sometimes
people say that "Unicode is a 16-bit character set.": that is an
error.

There are (now and for the forseeable future) many more code points
than abstract characters. Revisions to Unicode add new characters and,
sometimes, recommend against using some old characters, but once a
code point has been "assigned", that assignment never changes.

                       Some Special Code Points

Unicode code points U+0000..U+007F are essentially the same as ASCII
code points.

Unicode code points U+0000..U+00FF are essentially the same as ISO
8859-1 code points ("Latin 1").

Two code points represent non-characters. These are U+FFFE and
U+FFFF. Programs are free to give these values special meaning
internally.

The code point U+FEFF is assigned to the formatting character
"zero-width no-break space". This character has a special significance
when it occurs in certain serialized representations of Unicode
text. This is described in the next section.

Code points in the range U+D800..U+DFFF are called _surrogates_. They
are not assigned to abstract characters. Instead, they are used in
pairs as one way to represent a code point in the range
U+10000..U+10FFFF. This is also described in the next section.

                            Encoding Forms

If Unicode code points occupy 21-bits of storage, how is a string of
Unicode text represented? There are two recommended alternatives
called UTF-8 and UTF-16. Collectively, systems of representing
strings are known as _encoding_forms_.

The definition of an encoding form consists of a _code_unit_ (an
unsigned integer type with a fixed number of bits, usually fewer than 21 )
and a rule describing a bijective mapping between code points and
sequences of code units. UTF-8 uses 8-bit code units. UTF-16 uses 16
bit code units.

In UTF-8, code points in the range U+0000..U+007F are stored in a
single code unit (one byte). Other code points are represented by a
sequence of two or more code units, each byte in the range 80..FF. The
details of these multi-byte sequences are available in countless
Unicode reference materials.

In UTF-16, code points in the range U+0000..U+FFFF are stored in a
single 16-bit code unit. Other code points are represented by a pair
of surrogates, each stored in one code unit. Again, the details of
multi-code-unit sequences are readily available elsewhere.

Not every sequence of 8-bit values is a valid UTF-8 string. Not every
sequence of 16-bit values is a valid UTF-16 string. Strings that are
not valid are called "ill-formed".

When stored in the memory of a running program, UTF-16 code units are
almost certainly stored in the native byte order of the machine. In
files and when transmitted, two byte orders are possible. When byte
order distinctions are important, the names UTF-16be (big-endian) and
UTF-16le (little-endian) are used.

When a stream of text has a UTF-16 encoding form, and when its byte
order is not known in advance, it is marked with a byte order mark. A
byte order mark is the formatting character "zero-width no-break
space" (U+FEFF ) occuring as the first character in the stream. By
examining the first two bytes of such a stream, and assuming that
those bytes are a byte order mark, programs can determine the
byte-order of code units within the stream. When a byte order mark is
present, it is not considered to be part of the text which it marks.

Another encoding form has been standardized that may become popular in
the future: UTF-32. In UTF-32, code units are 32 bits and each code
point is stored in a single code unit.

                         Character Properties

In addition to naming a set of abstract characters, and assigning
those characters to code points, the definition of Unicode assigns
each character a collection of _character_properties_.

The possible properties a character may have and their meanings are
too numerous to list here. Three examples are:

general category -- such as "lowercase letter", "uppercase letter",
"decimal digit", etc.

decimal digit value -- if the character is used as a decimal digit,
this property is its numeric value.

case mappings -- the default lowercase character corresponding to an
uppercase character, and so forth.

The Unicode consortium publishes definitions of various character
properties and distributes text files listing those properties for
each code point. For more information, visit http://www.unicode.org.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT