Re: (Informational only: UTF-8 BOM and the real life)

From: Steven Atreju <snatreju_at_googlemail.com>
Date: Fri, 27 Jul 2012 23:22:20 +0200

"Doug Ewell" <doug_at_ewellic.org> wrote:

 |As a programmer, I can attest that we are no more receptive to being
 |called "duds" than any other professionals. Constructive suggestions
 |focused on the end product, instead of the competence of the person,
 |might get a response.

You're of course right. The tone was rude.

 |Steven Atreju wrote:
 |
 |> Well, i still see a bug in the Unicode Standard here.
 |> Whereas for the multioctet UTFs there is «The BOM is not
 |> considered part of the content of the text» (Conformance, 3.10,
 |> D98, D101), i cannot find any such clarifying text for it's usage
 |> as a signature.
 |
 |There really isn't as much difference between using U+FEFF "as a byte
 |order mark" and using it "as a signature" as this makes it seem. The
 |definitions you quote have to do with whether U+FEFF is treated as a
 |BOM/signature or as a zero-width no-break space.

I don't understand what you are saying here.
And i do think that more people are uncertain about wether this
has been left off intentionally (which i personally would assume
given the assumed grade of the people involved and the amount of
time that the standard exists and has been reviewed).

I really think that a clarification in equal spirit to those of
D98 and D101 (but maybe with different content :) would be an
improvement of the Unicode Standard.

Once more i want to point out that on Unix/POSIX systems the file
content can be seen as a whole, and i hope and think that this
will not change. This situation is completely different than on
Windows, which had textfiles with appended (separated by ^Z or so)
meta information that was invisible in normal text editors already
in the ninetees (or even earlier, but i don't know).

I.e., this is why we do have this messy text OR binary file I/O
distinction like O_BINARY (for open(2)), "b" (for fopen(3)) or
binmode (perl(1)). Because without those a text file will see
End-Of-File at the ^Z, not at the real end of the file. (Which
rises the immediate question why the Microsoft programmers did not
embed the meta information in this section at the end of the file.
But i don't really want to know.)
Anyway. On Unix a UTF-8 file *will* show the BOM, because it is
file content. I.e.:

  |?0%0[tmp]$ hexdump -C text
  |00000000 ef bb bf 49 20 64 6f 6e 27 74 20 77 61 6e 74 20 |...I don't want |
  |00000010 74 6f 20 73 65 65 20 79 6f 75 2c 20 65 76 65 72 |to see you, ever|
  |00000020 21 0a 53 68 65 20 70 75 74 20 6f 6e 20 68 65 72 |!.She put on her|
  |00000030 20 63 6f 61 74 20 61 6e 64 20 6c 65 66 74 2e 0a | coat and left..|
  |00000040

is shown (because even bad english is displayed) as

  |?0%0[tmp]$ v text
  |<U+FEFF>I don't want to see you, ever!
  |She put on her coat and left.

in an UTF-8 locale and

  |?0%0[tmp]$ LESSCHARSET=ascii v text
  |<EF><BB><BF>I don't want to see you, ever!
  |She put on her coat and left.

otherwise. And i like that, because it is the truth. But it of
course implies that it will show up exactly like this wherever the
signature occurs.

 |> No, the real issue is that the programmers are duds.
 |> Or they were unsure about it all...
 |> Anyway, i've told them they were duds, and as i didn't get any
 |> response sofar, i was right.
 |
 |As a programmer, I can attest that we are no more receptive to being
 |called "duds" than any other professionals. Constructive suggestions
 |focused on the end product, instead of the competence of the person,
 |might get a response.

So i apologize again. I want to state however that the company
in question is heavily automatized and full of robots. People
have to face Modern Times. At least in the manufacturing. (Why
do i own a bicycle of them? Because people get jobs there, which
they would not have otherwise, *there*. But real craftsmanship
products, like those from http://www.manufactum.de, or old Rolls
Royce or whatever, are of course preferable.)
So do the programmers have to face the same conditions? I don't really
think so. They prefer driving plain text readers up the wall.
Successfully.

 |--
 |Doug Ewell | Thornton, Colorado, USA
 |http://www.ewellic.org | @DougEwell ­

  Steven
Received on Fri Jul 27 2012 - 16:23:52 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 27 2012 - 16:23:53 CDT