Coping with Change
Q: I don't see why I should update to Unicode
5.1. Are there any important new characters?
A. If your implementation is still at Unicode 3.0 or 4.0, then quite a few important characters have
been added in more recent versions. For example, a significant number of characters
important for support of languages in India and Southeast Asia have been added. For East Asia,
characters have been added to fill out compatibility with important
standards such as JIS X 213, GB 18030, and HKSCS. All of the characters are important to some user community.
[MD, KW]
Q. Which characters exactly were added?
A. If you
look at
http://www.unicode.org/Public/UNIDATA/DerivedAge.txt you
can see which have been added to each successive version of the standard.
[MD]
Q: Fonts and input methods or keyboards are really expensive to
produce. Do I have to support all the new characters for them?
A. Supporting the latest version of Unicode does not require that you have fonts or
keyboards for all the characters. You always have a choice of what
repertoire of Unicode characters you want to support in your product.
Fonts and keyboards can be added incrementally.
[MD]
Q: But what else would I want to support in
the latest version of the standard?
A. Even if you are not supplying keyboards and fonts you will
probably need your software to handle the properties of the new
characters correctly.
[MD]
Q: Why should I support Unicode properties?
Unicode properties are widely used under the
covers. Text parsers will use them to separate out letters from
punctuation and symbols. Anything that uses regular expressions, such as XMLSchema, will use them. They are used in uppercase/lowercase
conversions, and in case-insensitive matching. They also coordinate with
the latest versions of the Unicode Collation Algorithm, for sorting.
In globalization coding guidelines, we strongly recommend that
hard-coded expressions like
if ('a' <= x && x <= 'z' || 'A' <= x && x <= 'Z') doSomething();
should normally change to use appropriate Unicode properties,
something like the following (depending on what was originally meant):
if (getCategory(x) == LETTER) doSomething(); or
if (getCategory(x) == LETTER && getScript(x) == LATIN) doSomething();
Using an old version of Unicode will mean that new characters will be
ignored in such processing, or included where they are not meant to be.
Importantly, fixes in properties—even for old characters—are made
over time, and using the latest version of the properties ensures that
you have the most accurate data you can.
[MD]
Q. Is it cost-effective to update the Unicode character
properties in my product?
A. There are good reasons to always update the Unicode
characters properties to the latest version when you can, since the cost
is rather small (i.e. typically updating data tables) compared to the
benefits. For servers and middleware, the support for new Unicode characters will typically amount to just
updating the property tables appropriately.
[MD]
Q: How do I find out about all the different versions of Unicode?
A. Documentation of the contents of each version of the Unicode
Standard is found on the
Enumerated Versions page.
[MD]
Q: How do I cite the Unicode Standard in my references?
A. See Versions
of the Unicode Standard.
[MD]
Q: How much does the Unicode Standard change between different
versions?
A. Characters can be added in each major or minor version of the
standard. Properties and other specifications can be added or changed.
However, all changes are subject to the Unicode stability policy. See
the
Character Encoding Stability Policy for more information.
[MD]
Q. What are Unicode Named Character Sequences?
A. These are something new added to the standard starting
with Version 4.1. Character sequences are simply any
sequence of Unicode characters, typically indicated between
angle brackets: <U+0063, U+0061, U+0074>. However, occasionally
there is a need to supply an official name in the standard
to a particular character sequence, as for use in mapping
to other standards. These are called Named Character Sequences.
See UAX #34, Unicode Named Character Sequences, for a
complete explanation. [KW]
Q. Are Unicode Named Character Sequences guaranteed to be stable?
A. Yes. Once a particular Unicode Named Character Sequence has been finally approved, it will not be removed or changed. In order to allow sufficient time for review of Named Character Sequences, a two-step process is used. First a Named Character Sequence is provisionally approved and is listed in
NamedSequencesProv.txt in the
Unicode Character Database. Only later, after any feedback and any required corrections, is a Named Character Sequence listed in
NamedSequences.txt. Such entries are then stable. [KW]