Re: ASCII control codes in sequences of multibyte character sets

From: Steffen <sdaoden_at_gmail.com>
Date: Mon, 02 Sep 2013 13:40:02 +0200

Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
 |There's also the legacy VISCII 8-bit encoding standard (for Vietnamese)
 |that uses some positions of ASCII control characters (though not CR, LF,
 |and TAB) for some recombined letters with diacritics.
 |
 |Note that EBCDIC code pages are easily mappable using a standard
 |permutation of codes, to be compatible with ISO 646 and MIME. For example
 |Punched cards have almost disappeared everywhere today

… good to know ;
I'll archive this.

--steffen

attached mail follows:


There's also the legacy VISCII 8-bit encoding standard (for Vietnamese)
that uses some positions of ASCII control characters (though not CR, LF,
and TAB) for some recombined letters with diacritics.

Note that EBCDIC code pages are easily mappable using a standard
permutation of codes, to be compatible with ISO 646 and MIME. For example
there's a bijective permutation of ISO-8859-1 (Latin-1) to
EBCDIC-compatible encoding. This permutation is possible in fact for most
EBCDIC codepages, and all ISO 8859-* encodings have an equivalent EBCDIC
version using such simple permutation of codes.

Notes:

(there are also variants about how to convert newlines into sequences of
controls, where the roles of LF and NEL may be swapped in the EBCDIC
permutation: NEL is the prefered newline for EBCDIC that initially did not
have CR or LF; LF is the prefered newline in POSIX, but in POSIX C there's
a standard way "\n" to specify the correct one : use '\n', don't use the
numeric value 10 ; the standard I/O library will translate it to the
appropriate sequence for the storage stream : CR, or LF, or CR+LF, or NEL
according to the target compilation platform ; these libraries give you
control if such translation of newlines will occur, and POSIX then makes
the distrincion bet ween text files, that are translated, and binary files
; text files may also have padding characters in their storage, invisible
as at the higher application level when reading characters from the stream,
or added when writing newlines; this was used with punched cards of fixed
widths, where all positions on the same card after the newline were
considered as padding, as well as DEL used to erase manual punching errors).

Punched cards have almost disappeared everywhere today (I personnaly saw
that they were still used in 1989 by the Turkish army for transmissions
with NATO computing systems, because of ineroperability with other storage
medias, and lack of trust in numeric networks or modem communications, or
lack of reliability of magnetic tapes. I have absolutely no idea if they
are still using them (numeric networks with military-grade encryption is
usable from everywhere and will be much faster, and can also be transported
on the normal Internet via secure VPNs).

But padding of file records with fixed length is still widely used today
(most often such padding now uses the NUL control in binary formats, or the
SPACE character in database fields).

2013/8/31 Doug Ewell <doug_at_ewellic.org>

> I assume you're not considering SCSU, which uses bytes in the ASCII
> control range (but not 0x0A or 0x0D) to control the encoding parameters
> in single-byte mode, and which can be switched into a so-called "Unicode
> mode" in which any byte value may appear.
>
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell ­
>
>
> -------- Original Message --------
> Subject: ASCII control codes in sequences of multibyte character sets
> From: Steffen "Daode" Nurpmeso <sdaoden_at_gmail.com>
> Date: Fri, August 30, 2013 2:23 pm
> To: unicode_at_unicode.org
>
> Hello character plus experts,
> i'm wondering wether there are any multibyte character sets known
> which use the numerical values of ASCII control characters that
> are vital to Unix/POSIX (plus) as part of multibyte sequences?
> In particular U+000A and U+000D?
>
>
>
>
Received on Mon Sep 02 2013 - 06:42:22 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 02 2013 - 06:42:24 CDT