From: Arcane Jill (firstname.lastname@example.org)
Date: Mon Jan 24 2005 - 02:41:50 CST
> -----Original Message-----
> From: email@example.com [mailto:firstname.lastname@example.org]On
> Behalf Of Marcin 'Qrczak' Kowalczyk
> Sent: 21 January 2005 22:49
> To: email@example.com
> Subject: Re: Subject: Re: 32'nd bit & UTF-8
> Let's assume that I design a programming language, specify that its
> source files should be encoded in UTF-8, don't mention anything about
> BOM, implement a compiler ...
It just so happens that I'm designing a programming language. (A toy, at
present - purely an intellectual exercise. But you never know. One day...?).
But I am, at least implementing a compiler for it.
But why on Earth would I specify its encoding? Why on Earth would /anyone/
specify an encoding in a modern (indeed, future) programming language? My
language is specified in CHARACTERS.
My in-development parser goes through a decoding phase. The decoding phase
translates the bytes from the source code file into /characters/. These
characters are then fed into the lexer generating phase.
Currently the decoding phase can auto-detect all UTF encodings correctly,
either with or without a BOM. The lexing phase doesn't care. It's all
characters by then. Auto-detection is guaranteed to be possible (and correct)
in my case, because the grammar dictates that the very first thing in the
source file MUST be a comment. This guarantees that the two characters of the
file WILL be non-NUL ASCII (possibly preceeded by a BOM), and given this
restriction, auto-detection of UTFs is accurate given only the first four bytes
of the source file. The grammar for the rest of the file permits (for example)
non-ASCII operators, identifiers, etc.
In the future, it should be possible that other (non-auto-detectable) encodings
be permitted, having been specified either in some OOB way, because, as I said,
the lexing phase doesn't care.
Of course, this is all for my own amusement. I've been writing this for a month
or so now. I didn't bother with lex, flex, yacc, bison, etc., because they
weren't sufficiently Unicode.
In other words, my response to your comment above: "Let's assume that I design
a programming language, specify that its source files should be encoded in
UTF-8..." is this. Let's not. It's silly.
This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 11:38:03 CST