SECS & VSECS: Small European Character Sets

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Fri Aug 14 1998 - 15:19:34 EDT


Small European Character Sets
-----------------------------

I have recently spent quite some time working out a proposal for two
Unicode/ISO 10646 subsets that are so small that I hope they will become
widely implemented in Europe and America. Both are specifically designed
to be suitable for systems where characters are represented in
low-resolution fixed-width fonts. This includes for instance your xterm
and Emacs window under Unix (or more general VT100 emulators and source
code editors), but also applications such as portable LCD devices
(pager, mobile phones), where only a small subset of Unicode makes sense
to be implemented and where no single 8-bit set can cover a reasonable
number of languages. These subsets are not really intended for
applications such as the publishing industry, where these display
restrictions do not exist and larger Unicode subsets or even full
implementations might be adequate.

The two subsets are:

 - Very Simple European Character Set (VSECS)
   345 characters, basically the superset of Latin 1-4,9,10,15 and CP1251
   plus a very few ISO 6397 characters

   Rows Positions (Cells)
   00 20-7E A0-FF
   01 00-13 16-2B 2E-31 34-3E 41-48 4A-4D 50-7E 92
   02 C6-C7 D8-DD
   20 13-15 18-1A 1C-1E 20-22 26 30 39-3A AC
   21 22 26 5B-5E 90-93
   26 6A
   FF FD

 - Simple European Character Set (SECS)
   683 characters, covers in addition to VSECS also Cyrillic, Greek,
   MS-DOS blockgraphics, and a moderate set of mathematical characters
   that is likely to be used in academic email and source code comments.

   Rows Positions (Cells)
   00 20-7E A0-FF
   01 00-13 16-2B 2E-31 34-3E 41-48 4A-4D 50-7E 92
   02 BC-BD C6-C7 D8-DD
   03 84-86 88-8A 8C 8E-A1 A3-CE D1 D5-D6 F1
   04 01-0C 0E-4F 51-5C 5E-5F 90-91
   20 13-15 17-1A 1C-1E 20-22 26 30 32-34 39-3A 70 7F-83 A7 AC
   21 02 15-16 1A 1D 22 24 26 5B-5E 90-95 A4-A7 D0-D5
   22 00-09 0B-0C 12-13 18-1A 1D-1E 24-2A 3C 43 45 48-49 58 5F-62 64-65
   22 6A-6B 82-8B 95 97 A4-A7 C2-C3 C5
   23 00 08-0B 10 15 20-21 29-2A
   25 00 02 0C 10 14 18 1C 24 2C 34 3C 50-6C 80 84 88 8C 90-93 A0 B2
   25 BA BC C4 CB
   26 10-12 3A-3C 40 42 6A-6B 6D-6F
   27 13 17
   FF FD

VSECS is somewhat similar to ISO 6937 with some bugs fixed (e.g., the
Euro symbol is included, as are the directed quotation marks).

SECS is somewhat similar to Microsoft/Adobe WGL4. I think SECS is much
better than WGL4, because WGL4 contains many letters for which I could
not find out where they are used (for at least three I am sure they
never existed). SECS contains the following 91 characters that are not
part of WGL4:

  Rows Positions (Cells)
  02 BC-BD
  03 D1 D5-D6 F1
  20 34 70 80-83
  21 02 15 1A 1D 24 A4-A7 D0-D5
  22 00-01 03-05 07-09 0B-0C 13 18 1D 24-28 2A 3C 43 45 49 58 5F 62
  22 6A-6B 82-8B 95 97 A4-A7 C2-C3 C5
  23 00 08-0B 15 29-2A
  26 10-12 6D-6F
  27 13 17
  FF FD

Almost all of these are a set of basic mathematic characters that most
high school students should be familiar with. They are very useful to
have available in academic email discussions and source code comments.
It would be nice if the authors of WGL4 considered seriously to extend
their Unicode subset by those few dozen elementary math symbols. Then
SECS would become a subset of WGL4. VSECS is already a subset of WGL4
except for U+FFFD.

The mathematical symbols of SECS will hopefully provide for US
developers who do not specialize in i18n issues some motivation to get
interested in 16-bit character sets, as they are more relevant for their
personal use than the accented characters of crazy Europeans.

My dream is that something like SECS becomes rather soon the common
minimum repertoire in Unix X11 fonts and printer fonts. VSECS is
intended as an intermediate step for applications where the size of the
character set is critical and only Latin script support is required.

I do not think SECS contains any useless symbol. I know for each letter
and symbol why it is in there and in which languages or fields it is
used. Just ask.

Much more information on the two sets is available from

  http://www.cl.cam.ac.uk/~mgk25/ucs/vsecs.html
  http://www.cl.cam.ac.uk/~mgk25/ucs/secs.html

Much better than just looking at these web pages is to download the
database (Perl needed) that generated them from

  http://www.cl.cam.ac.uk/~mgk25/ucs/secs.tar.gz

Then you can play around with them and test the subset properties with
regard to other sets easily yourself.

If you want to see example glyphs on the HTML output of this script,
then you'll also need

  http://www.cl.cam.ac.uk/~mgk25/ucs/glyphs.zip

The uniset Perl script allows you to comfortably build up your own
database of character collections, to merge and subtract them and to
generate Unicode subsets and study their relations with other subsets.
The mapping files from the Unicode Consortium can be used directly as
input.

Please let me know what you think about SECS and VSECS and if this is
something you would like to see widely implemented.

Markus

-- 
Markus G. Kuhn, Security Group, Computer Lab, Cambridge University, UK
email: mkuhn at acm.org,  home page: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT