[Unicode]  Frequently Asked Questions Home | Site Map | Search

Coping with Change

Q: I don't see why I should update to Unicode 6.3. Are there any important new characters?

A: If your implementation is still at Unicode 4.0 or 5.0, then quite a few important characters have been added in more recent versions. For example, a significant number of characters important for support of languages in India and Southeast Asia have been added. For East Asia, characters have been added to fill out compatibility with important standards such as JIS X 213, GB 18030, and HKSCS. Additionally, many symbols have been added that are important for interoperability with the Japanese television standard and Japanese mobile phones. All of the characters are important to some user community.

Q: Which characters exactly were added?

A: If you look at http://www.unicode.org/Public/UCD/latest/ucd/DerivedAge.txt you can see which have been added to each successive version of the standard.

Q: Fonts and input methods or keyboards are really expensive to produce. Do I have to support all the new characters for them?

A: Supporting the latest version of Unicode does not require that you have fonts or keyboards for all the characters. You always have a choice of what repertoire of Unicode characters you want to support in your product. Fonts and keyboards can be added incrementally.

Q: But what else would I want to support in the latest version of the standard?

A: Even if you are not supplying keyboards and fonts you will probably need your software to handle the properties of the new characters correctly. There is also a major update to the handling of bidirectional text in Unicode 6.3.

Q: Why should I support Unicode properties?

Unicode properties are widely used under the covers. Text parsers will use them to separate out letters from punctuation and symbols. Anything that uses regular expressions, such as XMLSchema, will use them. They are used in uppercase/lowercase conversions, and in case-insensitive matching. They also coordinate with the latest versions of the Unicode Collation Algorithm, for sorting.

In globalization coding guidelines, we strongly recommend that hard-coded expressions like

if ('a' <= x && x <= 'z' || 'A' <= x && x <= 'Z') doSomething();

should normally change to use appropriate Unicode properties, something like the following (depending on what was originally meant):

if (getCategory(x) == LETTER) doSomething(); or
if (getCategory(x) == LETTER && getScript(x) == LATIN) doSomething();

Using an old version of Unicode will mean that new characters will be ignored in such processing, or included where they are not meant to be. Importantly, fixes in properties—even for old characters—are made over time, and using the latest version of the properties ensures that you have the most accurate data you can.

Q: Is it cost-effective to update the Unicode character properties in my product?

A: There are good reasons to always update the Unicode characters properties to the latest version when you can, since the cost is rather small (i.e. typically updating data tables) compared to the benefits. For servers and middleware, the support for new Unicode characters will typically amount to just updating the property tables appropriately.

Q: How do I find out about all the different versions of Unicode?

A: Documentation of the contents of each version of the Unicode Standard is found on the Enumerated Versions page.

Q: How do I cite the Unicode Standard in my references?

A: See Versions of the Unicode Standard.

Q: How much does the Unicode Standard change between different versions?

A: Characters can be added in each major or minor version of the standard. Properties and other specifications can be added or changed. However, all changes are subject to the Unicode stability policy. See the Character Encoding Stability Policy for more information.

Q: What are Unicode Named Character Sequences?

A: These were added to the standard starting with Version 4.1. Character sequences are simply any sequence of Unicode characters, typically indicated between angle brackets: <U+0063, U+0061, U+0074>. However, occasionally there is a need to supply an official name in the standard to a particular character sequence, as for use in mapping to other standards. These are called Named Character Sequences. See UAX #34, Unicode Named Character Sequences, for a complete explanation.

Q: Are Unicode Named Character Sequences guaranteed to be stable?

A: Yes. Once a particular Unicode Named Character Sequence has been finally approved, it will not be removed or changed. In order to allow sufficient time for review of Named Character Sequences, a two-step process is used. First a Named Character Sequence is provisionally approved and is listed in NamedSequencesProv.txt in the Unicode Character Database. Only later, after any feedback and any required corrections, is a Named Character Sequence listed in NamedSequences.txt. Such entries are then stable.