From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Sun Mar 13 2005 - 05:51:06 CST
On Sun, 13 Mar 2005 suzume@mx82.tiki.ne.jp wrote:
> I apologize for the level of the questions. If the place is not right
> I'd appreciate to get pointers to lists where I can get information.
As far as I have understood, it is OK to discuss issues of practical use
of Unicode as well, not just the development of the Unicode standard.
The charter at http://www.unicode.org/consortium/distlist.html
looks rather liberal to me.
> I have had issues with both since I realized that, contrary to unicode
> supporting OSX apps (TextEdit to give a simple example, but also most
> text editors on OSX) the above apps translate all the Japanese (and
> French non ascii characters) to non human readable entities that make
> direct editing of output files almost impossible.
As you told in your reply to my question, "entities" refer to HTML
entities such as é and character references like Ӓ.
These are actually quite distinct concepts, though commonly confused with
each other in HTML tutorials - and even specifications! But what's
common is that they relate to markup languages - basically, SGML, XML, or
anything based on them, such as HTML. They are not at the character level
but "higher". They make sense in markup only, though you might sometimes
see them as erroneously generated in other contexts as well.
If you use a "Save as HTML" or "Save as Web page" or something similar in
a text processing program, it is quite possible that your Unicode
characters get stored as entity references or character references.
There is nothing wrong with this per se (in most cases; some programs
generate incorrect character references, though). But if you wish to edit
the document later, e.g. using Notepad, you will see the entity or
character references, and things really get awkward. (Occasional
occurrences of such references are not problematic, but e.g. normal
French, not to mention normal Japanese, are difficult to handle that way.)
> I seem to not understand the reality of what unicode is and I thus am
> stuck with files and no way to convert them to human readable output.
Probably the problem is not your understanding of Unicode but the issue of
entity and character references, which are completely external to Unicode,
though often used in markup to present Unicode characters. And perhaps the
most difficult part is how the various programs work. Some programs may
have options that control whether and how Unicode characters are replaced
by entity or character references. Moreover, they may have options for
setting the encoding of the HTML document, and this may affect the
situation.
I just tried OpenOffice on Windows, and created a document with e acute
and a kanji character, then used File/Save as with default settings. The
e acute was saved as é and the kanji character as a character
reference. The latter part is understandable, since the default encoding
in HTML documents created with OpenOffice is windows-1252, which contains
no kanji characters, so a character reference is really the only way.
There's no good explanation to é as far as I can see. If I change
the encoding to utf-8 (via Tools/Options/Load and save/HTML compatibility,
or something like that), then the kanji character is saved as such, utf-8
encoded. The é entity reference still appears!
> Why do those tools favor a non-human readable output form ? Is there a
> valid technical reason to do so ?
There might be. One of the reasons is that the document's encoding might
not allow all characters to be represented as such, and character
references offer a universal way to overcome such limitations.
But programs might also use such output form for no good reason.
On the other hand, if you represent all non-ASCII characters using
entity references or character references, you can use ASCII for any HTML
document, and in any case, your data will be "7-bit safe", i.e. it can
even be sent over a connection that does something nasty to octets with
the most significant bit set.
> Are there easy ways to convert from one to the other ?
There's Free Recode, http://recode.progiciels-bpi.ca/
which can perform an impressive amount of code conversions.
It can deal with HTML as well, see
http://recode.progiciels-bpi.ca/manual/HTML.html#HTML
but beware that it uses rather odd terminology: it refers to "HTML
charsets" when it actually means HTML format. (Normally "charset" means
character encoding, at the character level, without any notion of
entity references or character references.)
> Are there other forms a unicode character can take ?
Sure, in the same sense: in addition to encodings such as utf-8 and utf-7,
there are coding systems at another level, but they all depend on the data
format. I have presented some samples of such notations at
http://www.cs.tut.fi/~jkorpela/chars.html#esc
> When I started as a html writer, about 10 years ago, I used to convert
> my French accented letters to html entities to "make sure" that they'd
> be displayed properly.
That's a widespread idea, but it never had a very good basis, and nowadays
even less so. It used to be relevant (and might still occasionally be
relevant) if you author, say, in a Mac environment and upload your
documents onto a server that runs Unix. It might then happen that the
software you use for uploading performs a wrong character encoding
conversion (or doesn't do a conversion when it should). But if you have
used only ASCII characters (and e.g. wrote accented letters using entity
references), then no such conversion is needed, and no conceivable
conversion will harm you either, since conversions would leave ASCII
characters intact.
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Sun Mar 13 2005 - 05:52:09 CST