Re: Is the binaryness/textness of a data format a property? from Adam Borowski via Unicode on 2020-03-20 (Unicode Mail List Archive)

From: Adam Borowski via Unicode <unicode_at_unicode.org>
Date: Fri, 20 Mar 2020 15:41:23 +0100

On Fri, Mar 20, 2020 at 07:22:45AM -0700, J Decker via Unicode wrote:
> On Fri, Mar 20, 2020 at 5:48 AM Adam Borowski via Unicode <
> > For example, most Unix-heads will tell you that UTF16LE is a binary rather
> > than text format. Microsoft employees and some members of this list will
> > disagree.
[...]
> > If you want _my_ definition of a file being _technically_ text, it's:
> > * no bytes 0..31 other than newlines and tabs (even form feeds are out
> > nowadays)
> > * correctly encoded for the expected charset (and nowadays, if that's not
> > UTF-8 Unicode, you're doing it wrong)
> > * no invalid characters
>
> Just a minor note...
> In the case of UTF8, this means no bytes 0xF8-0xFF will ever be used; every
> valid utf8 codeunit has at least 1 bit off.

Yeah, but I allowed for ancient encodings, some of which do use these bytes.
(I do discriminate against UTF16 and shift-state ones, they're too broken.)

Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF or
U+11000..U+7FFFFFFF (or possibly even up to 2³⁶ or 2⁴²), which has its uses
but is not well-formed Unicode.

> I wouldn't be so picky about 'no bytes 0-31' because \t, \n, \x1b(ANSI
> codes) are all quite usable...

\t is tab, \n a newline (blah blah blah \r).

As for \e (\x1b), that's higher-level markup. I do use it -- hey, you can
"apt/dnf install colorized-logs" for my tools -- but that's beyond plain
text.

喵!

-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good.
⢿⡄⠘⠷⠚⠋⠀                                       -- <willmore> on #linux-sunxi
⠈⠳⣄⠀⠀⠀⠀

Received on Fri Mar 20 2020 - 09:41:39 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 20 2020 - 09:41:40 CDT