From: Asmus Freytag (email@example.com)
Date: Mon Dec 29 2008 - 16:52:34 CST
On 12/29/2008 11:04 AM, Jukka K. Korpela wrote:
> Asmus Freytag wrote:
>> The problem stems from the fact that in this kind of scenario 8) is no
>> longer unique in the encoding sense.
> Pardon? Now I'm _really_ confused. Why should it be "unique"? Is any
> sequence of Ascii characters "unique" in the sense that it has only
> one possible semantic interpretation? Or even any single character?
The string "abc" should encode the sequence of letters "abc". That
relation should be unique. What "abc" stands for, is up to the context
of the document, just as the meaning of many words depends on context.
Originally, emoticons started as a punny way of using punctuation. If
things had remained there, there would not be a need for this
discussion. However, nowadays most users of these things pick them from
a list of symbols and don't care what the fallback string is that the
software inserts into the text buffer. What used to be a punny way of
using punctuation has become de-facto markup for text elements. Just
like the TeX markup for mathematical symbols, or > in HMTL. That's
why we are having this discussion, and that's why having a dedicated
encoding could provide benefits.
>> In order to determine whether
>> text containing 8) intends to encode the digit eight followed by the
>> close paren or in fact intends to encode an emoticon you now need out of
>> band information.
> Yes, or a good guess. But how is this different from interpreting
> sequences of characters in general? The string "88" might be intended
> to mean an integer in decimal notation, or in hexadecimal notation, or
> it could be just a string of digits used as a code or a label, or it
> could be someone's emoticon.
Right - and if you need to machine parse such texts, you might need
disambiguating information. However, the use case here is that the
display is fixed, and it's up to the user to make the distinction. If I
dsiplay (7 - [X] where [X] stands for the symbol that some convention
has associated with 8), then most ordinary users cannot override this by
knowing hat [X] really looks like 8) in the text buffer and (7 - 8) was
meant. (Only a small percentage of users of emoticons know more than a
few ASCII strings for them and few users in general know enough to
abstract from what they see to an underlying code stream).
>> Requiring out of band information for text content
>> is certainly not ideal. Therefore, if there were dedicated character
>> codes for emoticons (especially those using short, and therefore
>> occurring strings of punctuation marks as fallbacks) the ability to
>> used them as a unique way to encode common emoticons would be a
>> definite benefit.
> You could say the same, with _much_ higher practical motivation, about
> many other strings. For example, should we also have characters
> corresponding to commonly used strings with special meanings, such as
> "***", "1234" (commonly used generically to denote a 4-digit string),
> "---" or longer (to denote a horizontal line), or "//" or "°C" or
> "km"? Oops, some of these already exist as Unicode characters - do I
> need to say more?
Yes, because in none of these cases is there a display variation, except
where these symbols were coded as composites (historically, all of these
character codes derive from East Asian practice of condensing them into
single display cell - without the need for compatibility to existing
sets, they would not be encoded).
> If you are saying that characters should be interpretable without out
> of band information, shouldn't you start worrying about individual
> characters, like "." (which could be a decimal point, a full stop, a
> separator of fields, or something else) and "I" (which could be a
> "normal" letter, or the uppercase equivalent of Turkish dotless i, or
> the roman numeral one)?
No, I'm not arguing for unlimited semantic encoding. Unicode's design
point is that the display on the receiving end can unambiguously confer
the intent of the author in terms of the identity and ordering of the
written symbols. In all your cases, the ambiguity is not in which signs
or symbols are conveyed, but in how to read the message on a higher level.
PS: Minor exceptions have been made for mathematical notation, to the
extent that it its possible to denote the intended meaning of writing
symbols adjacent to each other, by using invisible operators. I
supported that exception, because of the special nature and usage for
mathematical (near) plain text, but I also supported Unicode's very
early and very firm rejection of a "decimal point" character.
This archive was generated by hypermail 2.1.5 : Fri Jan 02 2009 - 15:33:07 CST