Re: Unicode 7.0 goals and ++

From: Jukka K. Korpela <jkorpela_at_cs.tut.fi>
Date: Tue, 12 Jul 2011 08:59:17 +0300

2011-07-11 21:57, Ken Whistler wrote:

> On 7/10/2011 4:58 PM, Ernest van den Boogaard wrote:
>> For the long term, I suggest Unicode should aim for this:
>>
>> Unicode 6.5 should claim: There will be a *Unicode dictionary*,
>> limiting and reducing ambiguous semantics within Unicode
>> (Background: e.g. the word "character" will have one single crisp
>> definition, /or/ can be specified to & at any special point).
>
> That kind of terminological purity isn't going to occur.

That's possible, even probable, if people who could do the clarification
don't want to do it.

> The word "character" has been
> used ambiguously for decades in the IT industry, and has other general
> language usage as well.

So do many other words, too. Terminology isn't about changing the
meanings of words in everyday language. It's about defining terms,
perhaps using common-language words but assigning technical meanings to
them.

> The Unicode Consortium has a glossary of terms:
>
> http://www.unicode.org/glossary/

Yes, and it's mostly useful and well-written. But the "definition" for
character is really a mess. For example, "(1) The smallest component of
written language that has semantic value" doesn't make sense. What is
the semantic value of the letter "e"? Does that definition answer the
question whether "" is one character or two?

"Abstract character" is even worse. "A unit of information used for the
organization, control, or representation of textual data." So a bit is a
character, isn't it?

> But it is basically hopeless to try to legislate away linguistic
> ambiguity in a term like "character".

You're not referring to "character" as a term; rather, as a word in English.

I think part of the problem is that Unicode has widely been
misrepresented as providing a unique number (code point) for every
character (see e.g. http://www.unicode.org/standard/WhatIsUnicode.html
), and it is difficult to take back such statements - which are an
important part of Unicode evangelism. We can keep saying it only if the
word "character" is used loosely enough. The statement is effectively a
truism: Unicode has a unique number for every code point designated as a
character code point (and for other code points, too, of course).

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/
Received on Tue Jul 12 2011 - 01:04:20 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 12 2011 - 01:04:26 CDT