Re: Unicode in source

From: G. Adam Stanislav ([email protected])
Date: Thu Jul 22 1999 - 22:07:17 EDT

Next message: John Cowan: "Re: German penny symbol"
Previous message: John Cowan: "Re: Unicode in source"
Maybe in reply to: G. Adam Stanislav: "Re: Unicode in source"
Next in thread: G. Adam Stanislav: "Re: Unicode in source"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Thu, Jul 22, 1999 at 05:04:25PM -0700, Kenneth Whistler wrote:
> Adam,
>
> > What's a BOM? I tried to look it up on AltaVista and got a lot of religious
> > references in what seemed like Portuguese ("Bom Jesus"). I suspect that is
> > not what you are talking about. :-)
>
> Byte Order Mark (one of the functions of U+FEFF).

Thank you (and everyone else who replied). I am quite familiar with the
function, I just did not recognize the acronym.

> I know people have taken to expecting a search on AltaVista to provide the
> answer to anything they might want to know, but users of the Unicode Standard
> might want to consider consulting the standard itself. "BOM" is defined
> in the glossary of the Unicode Standard, Version 2.0, on page G-1.

No, I did not expect much from Altavista in this case (and my Unicode Standard
book was about two miles from me at the time I was asking), but I thought
I'd give it a try before asking.

As for the UTF-8 topic, the original discussion was about the use of Unicode
in programming languages, whether it should be used, and whether it was even
realistic. I think it would be great if programming languages were extended
to use Unicode as an OPTION. It is hard, perhaps impossible, for many existing
languages to decree that from now on all code be written in Unicode.

It is possible to add Unicode support without making such drastic changes
by making compilers recognize UTF-8. Such a compiler can process both code
by programmers who opt not to use Unicode, and those who would welcome the
use of Unicode.

That other encodings may be preferable in other uses of Unicode, I have never
disputed. It is completely irrelevant that I misunderstood your use of the
term TCHAR or mistook UTF-16 for UTC-2 (and I am not aiming this at you, Ken;
I have received so many other replies in the last hour or so, some of them
bordering on hostile).

All I am trying to say is that adding Unicode support to existing programming
languages is doable, and that UTF-8 seems (to me) the way of least resistance
without breaking existing code, and without forcing everyone to use Unicode
even if they do not want to.

It would also seem (again, to me, I am just expressing my personal opinion,
not trying to start a Holy War -- again, not aiming this at you, Ken) the
best solution for a compiler that added such UTF-8 support to convert the
UTF-8 source code into 32-bit value (possibly 16-bit under Windows to take
advantage of existing system libraries) before it even reaches the lexer.

As for the problem of the same character having different possible Unicode
representations, it seems (TO ME) that using decomposition of such characters
would be the path of least resistance (as opposed to trying to find the
composed character). This would make the resultant identifiers longer, but
it can be done considerably faster than the alternative.

At no time was I suggesting that the databases of the Shaolin Temple,
or those of secret government agencies should all be in UTF-8, or, for
that matter that if programmers of Japan create a new language that
is all in Kanji and/or Kana that it must be in UTF-8.

We have different encodings because we have different needs. My discussion
was about a very specific need - that of adding Unicode support to
existing programming languages, such as C, Pascal, NASM. I believe
that you and I (and Markus and John Cowan) are all on the same wave
length here, even if we may each be slightly emphasizing different
aspects of the question.

Two of my current projects involve the possibility of developing a
new language (one for each project). If and when it comes to that,
I shall make them both UTF-8 aware on the source code level and will
make them decompose accented characters inside the lexer, providing
the linkers can handle that (of course, if they cannot, I can always
have the compiler change the identifiers to seven bits before
producing the assembly output or the object code).

I enjoyed my discussion with you,

Adam

Next message: John Cowan: "Re: German penny symbol"
Previous message: John Cowan: "Re: Unicode in source"
Maybe in reply to: G. Adam Stanislav: "Re: Unicode in source"
Next in thread: G. Adam Stanislav: "Re: Unicode in source"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT