The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Fri Oct 24, 2014 3:35 pm

All times are UTC - 6 hours [ DST ]




Post new topic Reply to topic  [ 7 posts ] 
Author Message
 Post subject: Compatibility Characters
PostPosted: Sun Sep 23, 2012 7:46 pm 
Offline

Joined: Sat Aug 06, 2011 9:02 am
Posts: 43
Isn't there some contradiction between this paragraph on page 19

"Interoperability does not require that all external characters can be mapped to single Unicode characters; encoding a compatibility character is not necessary when a character in another standard can be represented as a sequence of existing Unicode characters. For example the Shift-JIS encoding 0x839E for JIS X 0213 katakana letter ainu to can simply be mapped to the Unicode character sequence <U+30C8, U+309A>. However, in cases where no appropriate mapping is available, the requirement for interoperability and convertibility may be met by encoding a compatibility character for one-to-one mapping to another standard."

and this paragraph on page 20

"Allocation. The Compatibility and Specials Area contains a large number of compatibility characters, but the Unicode Standard also contains many compatibility characters that do not appear in that area. These include examples such as U+2163 “IV” roman numeral four, ... "

if we consider that the character U+2163 can be mapped to the sequence <U+2161, U+2164>.

By the way, I found this file which seems to define the Unicode character blocks. Could someone tell me where this Compatibility and Specials Area start ?

Thanks in advance for the answer.


Top
 Profile  
 
 Post subject: Re: Compatibility Characters
PostPosted: Sun Sep 23, 2012 8:51 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
The general descriptions in chapter 1 and 2 of the Unicode Standard are, in a sense, like an overview. They are not intended to be read as if they had the same rigor as the rules of an algorithm. (For an example of an algorithm, read UAX#9).

A very fundamental fact about the Unicode Standard is that there is no algorithm for deciding on the encoding of a character. There are general rules and principles to follow, as well as an evolving set of precedents, but the actual decision is ultimately left to judgment.

This goes even more so for the encoding decisions made to arrive at the initial repertoire for the Unicode Standard. Certain things that are now firmly outlawed were standard practice, like the encoding of precomposed characters. Yet, in presenting the overview, it's impossible to delineate all the historic, current and future aspects of the character encoding work - nobody would be able to follow it. Instead, you have what you might call a slightly idealized presentation that focuses on important general principles, while purposefully not electing to account for all the pesky exceptions.

"Area" isn't a well-defined term, but one that's convenient to use at times. There is no data file, but some of the discussion uses the term and defines specific areas by example, by graphic, or some other means as a shorthand for the discussion at hand.

When Unicode was mostly the BMP, the use of this term was more common (you can see it surviving in some block names, like "Private Use Area"). The area you are interested in would have started following the Private Use Area and more or less extended to the end of the BMP.


Top
 Profile  
 
 Post subject: Re: Compatibility Characters
PostPosted: Mon Sep 24, 2012 6:54 am 
Offline

Joined: Sat Aug 06, 2011 9:02 am
Posts: 43
Thanks for your reply.

"A very fundamental fact about the Unicode Standard is that there is no algorithm for deciding on the encoding of a character."

But I'm assuming that there is a way of identifying the compatibility characters in UnicodeData.txt, or am I wrong again ?


Top
 Profile  
 
 Post subject: Re: Compatibility Characters
PostPosted: Mon Sep 24, 2012 3:17 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
Belloc wrote:
I'm assuming that there is a way of identifying the compatibility characters in UnicodeData.txt, or am I wrong again ?


If you had finished reading the Standard as closely as you've been reading the sections you cite here, you would have found somewhere (can't recall where, but I remember helping draft it) a statement that there are several kinds of compatibility characters. There are those that do have a "compatibility decomposition", and those that while being added "for compatibility" do not.

There's an argument to be made that not all characters with compatibility decompositions could ever be considered "non-essential" in their own right (my term here). There are usage contexts where many of them cannot easily be replaced by mapping or a combination of style and mapping without losing something essential about the encoded text.

That gives the whole concept of compatibility character something of an "in the eye of the beholder" quality - there's less black and white out there than you'd at first expect.

If you look for the character "<" in UnicodeData.txt you might learn something to your advantage - read Section 17.1 as well.


Top
 Profile  
 
 Post subject: Re: Compatibility Characters
PostPosted: Tue Sep 25, 2012 8:28 am 
Offline

Joined: Sat Aug 06, 2011 9:02 am
Posts: 43
From this paragraph on page 20 of Chapter 2 General Structure

"Usage. The fact that a particular character is considered a compatibility character does not mean that that character is deprecated in the standard. The use of most compatibility characters in general text interchange is unproblematic. Some, however, such as the Arabic positional forms or other compatibility characters which assume information about particular layout conventions, such as presentation forms for vertical text, can lead to problems when used in general interchange. Caution is advised for their use.

Therefore, it seems to me indispensable to have a formal way (I mean something that could be used in a computer program) to identify any compatibility character in the Unicode Character Database.

But how can one identify the character U+2502 BOX DRAWINGS LIGHT VERTICAL as a compatibility character ? Note that the Character Compatibility Decomposition mapping field for this character is empty in UnicodeData.txt .


Top
 Profile  
 
 Post subject: Re: Compatibility Characters
PostPosted: Tue Sep 25, 2012 9:54 am 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
The "caution" refers to: "...the Arabic positional forms or other compatibility characters which assume information about particular layout conventions..."

When and for what purpose do you need to "identify the character U+2502 BOX DRAWINGS LIGHT VERTICAL as a compatibility character?"

Why would your implementation care?


Top
 Profile  
 
 Post subject: Re: Compatibility Characters
PostPosted: Tue Sep 25, 2012 11:59 am 
Offline

Joined: Sat Aug 06, 2011 9:02 am
Posts: 43
Thanks


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 2 guests


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com