Re: UTF-8 BOM (Re: Charset declaration in HTML)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 12 Jul 2012 16:14:50 +0200

2012/7/12 Steven Atreju <snatreju_at_googlemail.com>:
> UTF-8 is a bytestream, not multioctet(/multisequence).
Not even. UTF-8 is a text-stream, not made of arbitrary sequences of
bytes. It has a lot of internal semantics and constraints. Some things
are very meaningful, some play absolutely no role at all and could
even be disacarded from digital signature schemes (this includes
ignoring BOMs wherever they are, and ignoring the encoding effectiely
useed in checksum algorithms, whose first step will be to uniformize
and canonicalize the encoding into a single internal form before
processing).
The effective binary encoding of text streams should NOT play any
semantic role (all UTFs should completely be equivalent on the text
interface, the bytestream low level is definitely not suitable for
handling text and should not play any role in any text parser or
collator).
Received on Thu Jul 12 2012 - 09:16:43 CDT

This archive was generated by hypermail 2.2.0 : Thu Jul 12 2012 - 09:16:44 CDT