Bidirectional Category Proposal #1

Authors: John I. McConnellJohnMcCo@microsoft.com,
F. Avery Bishop AveryB@microsoft.com, David Brown DBrown@microsoft.com
Majd Abbar MAbbar@microsoft.com, Ronen Yacobi A-RonenY@microsoft.com

18-Nov-1997

Proposal

This memo describes a proposal to change the value of four entries in the Unicode 2.0 character database. The latest version of this database is available from the Unicode Consortium web site at http://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt . Specifically, the proposal changes the bidirectional category of four characters. Table 1 lists these changes.

Table 1 Proposed Changes

Code Value

Glyph

Unicode 2.0 Character Name

Current Bidirectional Category

Proposed Bidirectional Category

U+002E

.

FULL STOP

European Separator

Common Separator

U+2007

 

FIGURE SPACE

European Separator

Common Separator

U+0026

&

AMPERSAND

Strong Left-to-Right

Other Neutral

U+0040

@

COMMERCIAL AT

Strong Left-to-Right

Other Neutral

The effect of these changes is to alter the visual order of text containing these characters and text from right-to-left writing systems such as Arabic and Hebrew. The overall intent of the proposal is to better match such behavior with user expectations and existing practice.

The change to FULL STOP improves the display of decimal numbers. The changes to AMPERSAND and COMMERCIAL AT improve the display of certain email addresses and URLs. The change to FIGURE SPACE is for consistency with other characters and has no significant effect on existing data. All four changes affect only the resolution of weak neutrals (steps P0 through P3). This limits the changes of behavior to cases where these characters are adjacent to numbers.

If the Consortium accepts the proposal, it would also require changing a few entries in Table 3-5 on page 3-17 and Table 4-4 on page 4-11 of the Unicode Standard. Note that there are no changes required to the Unicode bidirectional algorithm itself.

Rationale

One of the implicit design criteria for the Unicode Bidirectional Algorithm is that most text should not require explicit directional formatting codes. The current bidirectional category assignments reflect the designers’ best attempt to meet that criterion. However since the publication of Unicode, there have been two developments that alter the basis for those initial assignments.

The first development is that the very first Unicode-based software, for example Microsoft Office, has entered the regions and customers have had a chance to voice their opinions. In particular, users have now had a chance to see the results of conversion of existing documents to Unicode. Although in general the conversion has gone smoothly, users have complained about some specific behaviors and the corruption of their data.

The second development is the growth of the PC and Internet. These phenomena are worldwide and have affected popular culture in many regions. In particular, they has introduced important new uses for some characters that were formerly rare or non-existent in the local writing system. For example, the COMMERCIAL AT, while originally a Latin-specific character, has become common worldwide because of its use within email names. Right-to-left text within a URL is still relatively rare and we do not claim that our proposal is a general solution. But our proposal would improve the display of the most common occurrence today namely right-to-left user names and text queries.

Although it is always a serious matter to change a standard, the Consortium has a small window of opportunity to do so now with minimum detriment. The amount of affected software is still small. There are currently few Java implementations with bidirectional support on the market although several are in late stages of development. There are proposals to extend URLs to UTF-8 but few commercial implementations yet. In a few months it may well be too late.

Test Cases

This section shows the effect of the proposed changes on several important cases. In each test case we follow the same conventions as the Unicode 2.0 book, that is, uppercase letters correspond to strong right-to-left characters whereas lowercase letters correspond to strong left-to-right characters. In addition, we have also included examples using Arabic and Hebrew text. In all the examples except as noted, the embedding level is right-to-left. Results that differ from the current values in Unicode 2.0 are shaded.

Decimals

Customer feedback has shown that both the COMMA and the FULL STOP are used as decimal points in Arabic. Currently, the COMMA has the bidirectional category COMMON SEPARATOR whereas the FULL STOP has the category EUROPEAN SEPARATOR. The proposal would give both characters the category COMMON SEPARATOR.

Table 2 Decimal Numbers

Logical Order

Current Visual Order

Proposed Visual Order

ADD 0.5 CUPS (Arabic)

SPUC 5.0 DDA

SPUC 0.5 DDA

ADD 0.5 CUPS (Hebrew)

SPUC 0.5 DDA

SPUC 0.5 DDA

Email Addresses

Of the proposed changes, only the COMMERCIAL AT sign has a significant effect on the layout of email addresses. Although the use of right-to-left characters is non-standard, there is growing use of Arabic characters in the Middle East for the username portion of the address, especially on intranets. The domain names remain left-to-right.

Table 3 Email Addresses

Logical Order

Current Visual Order

Proposed Visual Order

ALI@unicode.org

@unicode.comILA

unicode.org@ILA

Uniform Resource Locators

According to RFC 1738 section 2.2, the seven characters listed in Table 9 have special meaning within a URL. These characters are reserved exclusively for use by schemes. There are at least two proposals to extend the URL syntax to the entire Unicode repertoire using UTF-8. Commercial implementations are likely to appear within a year. With the advent of right-to-left characters within URLs, proper display would require changes to the bidirectional category. This proposal would have the effect of making all of these special characters neutrals. This would reduce the need for explicit formatting characters in URLs.

Table 4 URL Reserved Characters

Code Value

Glyph

Unicode 2.0 Character Name

Current Bidirectional Category

Proposed Bidirectional Category

U+0026

&

AMPERSAND

Strong Left-to-Right

Other Neutral

U+002E

.

FULL STOP

European Separator

Common Separator

U+002F

/

SOLIDUS

European Separator

Common Separator

U+003A

:

COLON

Common Separator

Common Separator

U+003E

=3D

EQUALS SIGN

Other Neutral

Other Neutral

U+003F

?

QUESTION MARK

Other Neutral

Other Neutral

U+0040

@

COMMERCIAL AT

Strong Left-to-Right

Other Neutral

There are many schemes that use these characters so it is impossible to list all test cases. However a typical use would be to separate parameters. For example a query to a search engine might use the ‘&’ to separate the search parameter from other parameters.

Table 5 URL Scheme

Logical Order

Current Visual Order

Proposed Visual Order

…&query=3DALI&…

…&ILA=3D&query

…&ILA=3Dquery&…

Windows® Shortcuts

Microsoft Windows assigns a special role to the ampersand in resources such as menu items and dialogs: the character following the ampersand is the keyboard shortcut for that item. Unfortunately the strong left-to-right attribute of the ampersand causes such resource files to print improperly when localizers are editing the resources. Although this is platform specific, Windows software accounts for an enormous percentage of localized software in right-to-left scripts and many third-party products and tools depend on this behavior. Changing this would remove a considerable obstacle to localization of software for the Middle East.

Table 6 Resource Shortcuts

Logical Order

Current Visual Order

Proposed Visual Order

&PRINT

&TNIRP

TNIRP&

Conclusion

The authors believe that this proposal would more closely match user expectations for visual order of right-to-left text and expedite the development of software for regions that use such text. This improvement would promote the acceptance of Unicode for an important emerging software market.