Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again) from Naena Guru on 2012-07-15 (Unicode Mail List Archive)

From: Naena Guru <naenaguru_at_gmail.com>
Date: Sun, 15 Jul 2012 14:50:00 -0500

Hey, Philippe,

Your input is much appreciated. So, in a nutshell, I don't have to worry.
One of these days I need to crunch down (minify) the CSS and JavaScript
pages. I left them readily readable so that techs like you could easily
read them in place in any browser without having to pretty print. The pages
are not big by any standard and they download pretty fast. Your earlier
point about WOFF is what I am going to try and tackle today (Sunday).

In the meanwhile, thanks again.

On Tue, Jul 10, 2012 at 11:32 PM, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2012/7/10 Naena Guru <naenaguru_at_gmail.com>
>
>> I wanted to see how hard it is to edit a page in Notepad. So I made a
>> copy of my LIYANNA page and replaced the character entities I used for
>> Unicode Sinhala, accented Pali and Sanskrit with their raw letters. Notepad
>> forced me to save the file in UTF-8 format. I ran it through W3C Validator.
>> It passed HTML5 test with the following warning:
>>
>> [image: Warning] Byte-Order Mark found in UTF-8 File.
>>
>> The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to
>> cause problems for some text editors and older browsers. You may want to
>> consider avoiding its use until it is better supported.
>>
>> The BOM is the first character of the file. There are myriad hoops that
>> non-Latin users go through to do things that we routinely do. This problem
>> I saw right at the inception. I already know why romanizing is so good.
>> Don't you?
>>
>
> You should probably ignore this non-critical warning now ; it is only for
> extremely strict compatibility with deprecated softwares that should have
> been updated since long for obvious security and performance reasons.
>
> Those old browsers are deprecating fast (due to the massive and fast
> spread of security attacks, automatic security updates to close issues
> competely (instead of just by preventive virus detection based on code
> bahavior or code patterns which will never be complete and fast enough to
> react to these extremely frequent attacks).
>
> Older editors do not have the cumfort that newer editors have. The memory
> usage of these newer editors are no longer a problem (notably for web
> developers that have systems largely above what theiur average users have),
> and systems capable of running them have never been so cheap. In addition,
> memory and storage costs have dramatically decreased.
>
> We are more concerned about the bandwidth usage, so your web editing
> platform should include an optimisation process and converters that will
> automatically use a compact representation (numeric character references
> for example can be sent by your server as raw UTF-8, in addition the server
> can now support on-the-fly data compression over the HTTP sessions ; there
> also exists frontend proxies that will do that for you without requiring
> you to change the development/editing methods you use.
>
> Most text editors even in Linux can now open sucessfully UTF-8 files
> starting by a BOM without complaining. Just like Notepad does since long.
> And they allow you to change this edit mode before saving.
>
> Most text processors will silently discard the U+FEFF character (it should
> be safe to do that everywhere, given that U+FEFF should no longer be used
> for anything else than BOM's)
>
> [side node]
> But Notepad has another problem since long : it cannot sucessfully open
> a text file whose lines are terminated by LF only, it absolutely wants them
> to be converted using CR+LF sequences ; this problem is much more severe
> than the use of a leading BOM.
> As well, Excell cannot successfully decode an UTF-8 encoded CSV file.
> But it can autoamtically recognize it if you used instead the "import data"
> function. This is inconsistant (also it still does not allow specifying how
> to convert numbers using dots instead of commas, when running it on a
> non-English user locale, you need to manually use a search/replace
> function; it does not allow selecting the date format for CSV file imports,
> making searhd/replacements operations is not trivial on date fields ; no
> question is asked to the user, it only uses implicits defaults even when
> they are wrong, most of the time for actual cases of CSV files).
> [/side node]
>
> But It has nothing to do with your problem of romanization or behavior
> with Latin. BOMs are only absent from old 8-bit character sets that are no
> longer recommanded in any modern Internet protocols ; and from 7-bit ASCII
> used only for internal technical data but not for any text intended to be
> read and translated.
>
> Only UTF-8 support is mandatory now. And that's fine. HTTP headers or URLs
> require a specufinc encoding but webservers and designing tools can ta ke
> care of that
>
> Everythng else is optional and will require an explicit metadata (the
> exceptions being UTF-16 and UTF-32 which are not well suited for
> interchanges across heterogeneous networks and independant realms, but used
> mostly for internal processes, for which you absolutely don't need any byte
> order change, so for which you don't even need any BOM: If there's one, you
> can safely discard it from the input strings, adjusting the length and
> offset positions in the source if that source is randomly seeakable ; you
> don't need to adjust these lengths and/or positions if the source is a
> serial input stream which is not seekable in the backward direction or
> randomly seekable in the forward direction in a fast direct manner without
> reading all intermediate positions.)
>
>
Received on Sun Jul 15 2012 - 18:56:55 CDT

This archive was generated by hypermail 2.2.0 : Sun Jul 15 2012 - 18:56:59 CDT