Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again) from Naena Guru on 2012-07-10 (Unicode Mail List Archive)

From: Naena Guru <naenaguru_at_gmail.com>
Date: Tue, 10 Jul 2012 01:40:19 -0500

Thank you Otto.

Sorry for delay in replying. I spent the entire Sunday replying Jaques
twins.

You are absolutely right about choice between ISO-8859-1 and UTF-8. I
shouldn't have said 'using ISO-8859-1 is advantageous over UTF-8' It is
efficient if your pages are written in a language that uses single byte
codepoints. When you mix multi-byte based codepoints, like you said, the
ideal is to have them in their raw form. But in practice, this is not as
easy as we think.

Actually, the trade-off is not great for me because I use only little
non-SBCS characters. Each 2-byte character would end up as six bytes in a
Hex char entity. If you want to control the look of your web site, then you
probably have to have expensive software to do it. As for poor me, I use
CSS, JavaScript and HTML inside HTML-Kit.

HTML5 assumes UTF-8 as the character set if you do not declare one
explicitly. My current pages are in HTML 4.

As I said, I use HTML-Kit (and Tools). If I have raw Unicode Sinhala in the
HTML or Javascript, it messes them and gives you character-not-found for
them on the web page. I must have character entities if I need the comfort
of HTML-Kit. There are web sites that help you process your SBCS and
multi-byte mixed text to make character entities for non Latin-1
characters. I used them when making my only page that has them (Liyanna).
Stop and think why there are such websites. (Search text to unicode). The
world outside Latin-1 is a harsh one.

If I want to have raw Unicode Sinhala, PTS Pali or IAST Sanskrit, I have to
use Notepad instead of HTML-Kit. It is hard to code without color-coded
text.

I wanted to see how hard it is to edit a page in Notepad. So I made a copy
of my LIYANNA page and replaced the character entities I used for Unicode
Sinhala, accented Pali and Sanskrit with their raw letters. Notepad forced
me to save the file in UTF-8 format. I ran it through W3C Validator. It
passed HTML5 test with the following warning:

[image: Warning] Byte-Order Mark found in UTF-8 File.

The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause
problems for some text editors and older browsers. You may want to consider
avoiding its use until it is better supported.

The BOM is the first character of the file. There are myriad hoops that
non-Latin users go through to do things that we routinely do. This problem
I saw right at the inception. I already know why romanizing is so good.
Don't you?

UTF-8 encoding is this RFC:
http://www.ietf.org/rfc/rfc2279.txt

This is the table it gives on the way UTF-8 encoding works:
0000 0000-0000 007F 0xxxxxxx <==== ASCII
0000 0080-0000 07FF 110xxxxx 10xxxxxx <=== Latin -1 plus higher
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx <== Unicode Sinhala

0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx

Observe that Latin 'a' transforms from UCS-2 to two coded bytes with UTF-8
and Unicode Sinhala Ayanna goes from two to three.
Unicode Sinhala: 0D80 - 0DFF
a = Hex 61 = Bin 0110 0001 ->
UTF-8 Template: 110xxxxx 10xxxxxx
UTF-8 Encoding: 11000001 10100001 = Hex C1 A1

ayanna = Hex 0D85 = Bin 0000 11011000 0101 ->
UTF-8 Template: 1110xxxx 10xxxxxx 10xxxxxx
UTF-8 encoding: 11100000 10110110 10000101 = Hex E0 B6 85

Thanks for your input. It is appreciated.

On Wed, Jul 4, 2012 at 2:25 PM, Otto Stolz <Otto.Stolz_at_uni-konstanz.de>wrote:

> Hello Naena Guru,
>
> on 2012-07-04, you wrote:
>
>> The purpose of
>> declaring the character set as iso-8859-1 than utf-8 is to avoid doubling
>> and trebling the size of the page by utf-8. I think, if you have
>> characters
>> outside iso-8859-1 and declare the page as such, you get
>> Character-not-found for those locations. (I may be wrong).
>>
>
> You are wrong, indeed.
>
> If you declare your page as ISO-8859-1, every octet
> (aka byte) in your page will be understood as a Latin-1
> character; hence you cannot have any other character
> in your page. So, your notion of “characters outside
> iso-8859-1” is completely meaningless.
>
> If you declare your page as UTF-8, you can have
> any Unicode character (even PUA characters) in
> your page.
>
> Regardless of the charset declaration of your page,
> you can include both Numeric Character References
> and Character Entity References in your HTML source,
> cf., e.g., <http://www.w3.org/TR/html401/**charset.html#h-5.3<http://www.w3.org/TR/html401/charset.html#h-5.3>
> >.
> These may refer to any Unicode character, whatsoever.
> However, they will take considerably more storage space
> (and transmission bandwidth) than the UTF-8 encoded
> characters would take.
>
> Good luck,
> Otto Stolz
>
>
>
Received on Tue Jul 10 2012 - 01:41:52 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 10 2012 - 01:41:52 CDT