Re: Is the binaryness/textness of a data format a property?

From: Richard Wordingham via Unicode <unicode_at_unicode.org>
Date: Fri, 20 Mar 2020 14:49:24 +0000

On Fri, 20 Mar 2020 13:46:25 +0100
Adam Borowski via Unicode <unicode_at_unicode.org> wrote:

> On Fri, Mar 20, 2020 at 12:21:26PM +0000, Costello, Roger L. via
> Unicode wrote:
> > [Definition] Property: an attribute, quality, or characteristic of
> > something.
> >
> > JPEG is a binary data format.
> > CSV is a text data format.
> >
> > Question #1: Is the binaryness/textness of a data format a
> > property?
> >
> > Question #2: If the answer to Question #1 is yes, then what is the
> > name of this binaryness/textness property?

I'd suggest 'texthood' as the correct English term.

> I'm afraid this question is too fuzzy to have a proper answer.
>
> For example, most Unix-heads will tell you that UTF16LE is a binary
> rather than text format. Microsoft employees and some members of
> this list will disagree.

Some files change type on changing operating system. Digital's old RMS
formats included as basic text files in which each record (roughly a
line) started with a binary 2-byte length field. Text records on
magnetic tape typically started with an ASCII length count!

> Then you have Postscript -- nothing but basic ASCII, yet utterly
> unreadable for a (sane) human.

No worse than a hex dump - in fact, a lot more readable. Indeed, are
you not aware of the concept of a write-only programming language?

> If you want _my_ definition of a file being _technically_ text, it's:
> * no bytes 0..31 other than newlines and tabs (even form feeds are out
> nowadays)
> * correctly encoded for the expected charset (and nowadays, if that's
> not UTF-8 Unicode, you're doing it wrong)
> * no invalid characters

Unassigned characters are perfectly reasonable in a text file. Surely
you aren't saying that a text file using the characters new to Unicode
13.0 should, at present, usually be regarded as a binary file?

> But besides this narrow technical meaning -- is a Word document
> "text"? And if it is, why not Powerpoint? This all falls apart.

Well, a .docx file isn't text - it's a variety of ZIP file, which is
binary. Indeed, as word files naturally include pictures, it very much
isn't a text file. A .doc file is more like an image dump of a file
system. A .rtf file on the other hand, probably is a text file -
though I've a feeling there are variants that aren't *A*SCII.

Richard.
Received on Fri Mar 20 2020 - 09:49:50 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 20 2020 - 09:49:50 CDT