Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again) from Philippe Verdy on 2012-07-10 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 11 Jul 2012 06:32:33 +0200

2012/7/10 Naena Guru <naenaguru_at_gmail.com>

> I wanted to see how hard it is to edit a page in Notepad. So I made a copy
> of my LIYANNA page and replaced the character entities I used for Unicode
> Sinhala, accented Pali and Sanskrit with their raw letters. Notepad forced
> me to save the file in UTF-8 format. I ran it through W3C Validator. It
> passed HTML5 test with the following warning:
>
> [image: Warning] Byte-Order Mark found in UTF-8 File.
>
> The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause
> problems for some text editors and older browsers. You may want to consider
> avoiding its use until it is better supported.
>
> The BOM is the first character of the file. There are myriad hoops that
> non-Latin users go through to do things that we routinely do. This problem
> I saw right at the inception. I already know why romanizing is so good.
> Don't you?
>

You should probably ignore this non-critical warning now ; it is only for
extremely strict compatibility with deprecated softwares that should have
been updated since long for obvious security and performance reasons.

Those old browsers are deprecating fast (due to the massive and fast spread
of security attacks, automatic security updates to close issues competely
(instead of just by preventive virus detection based on code bahavior or
code patterns which will never be complete and fast enough to react to
these extremely frequent attacks).

Older editors do not have the cumfort that newer editors have. The memory
usage of these newer editors are no longer a problem (notably for web
developers that have systems largely above what theiur average users have),
and systems capable of running them have never been so cheap. In addition,
memory and storage costs have dramatically decreased.

We are more concerned about the bandwidth usage, so your web editing
platform should include an optimisation process and converters that will
automatically use a compact representation (numeric character references
for example can be sent by your server as raw UTF-8, in addition the server
can now support on-the-fly data compression over the HTTP sessions ; there
also exists frontend proxies that will do that for you without requiring
you to change the development/editing methods you use.

Most text editors even in Linux can now open sucessfully UTF-8 files
starting by a BOM without complaining. Just like Notepad does since long.
And they allow you to change this edit mode before saving.

Most text processors will silently discard the U+FEFF character (it should
be safe to do that everywhere, given that U+FEFF should no longer be used
for anything else than BOM's)

[side node]
But Notepad has another problem since long : it cannot sucessfully open a
text file whose lines are terminated by LF only, it absolutely wants them
to be converted using CR+LF sequences ; this problem is much more severe
than the use of a leading BOM.
As well, Excell cannot successfully decode an UTF-8 encoded CSV file. But
it can autoamtically recognize it if you used instead the "import data"
function. This is inconsistant (also it still does not allow specifying how
to convert numbers using dots instead of commas, when running it on a
non-English user locale, you need to manually use a search/replace
function; it does not allow selecting the date format for CSV file imports,
making searhd/replacements operations is not trivial on date fields ; no
question is asked to the user, it only uses implicits defaults even when
they are wrong, most of the time for actual cases of CSV files).
[/side node]

But It has nothing to do with your problem of romanization or behavior with
Latin. BOMs are only absent from old 8-bit character sets that are no
longer recommanded in any modern Internet protocols ; and from 7-bit ASCII
used only for internal technical data but not for any text intended to be
read and translated.

Only UTF-8 support is mandatory now. And that's fine. HTTP headers or URLs
require a specufinc encoding but webservers and designing tools can ta ke
care of that

Everythng else is optional and will require an explicit metadata (the
exceptions being UTF-16 and UTF-32 which are not well suited for
interchanges across heterogeneous networks and independant realms, but used
mostly for internal processes, for which you absolutely don't need any byte
order change, so for which you don't even need any BOM: If there's one, you
can safely discard it from the input strings, adjusting the length and
offset positions in the source if that source is randomly seeakable ; you
don't need to adjust these lengths and/or positions if the source is a
serial input stream which is not seekable in the backward direction or
randomly seekable in the forward direction in a fast direct manner without
reading all intermediate positions.)
Received on Tue Jul 10 2012 - 23:38:32 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 10 2012 - 23:38:57 CDT