Re: Subject: Re: 32'nd bit & UTF-8

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Mon Jan 24 2005 - 02:41:50 CST

Next message: Michael \(michka\) Kaplan: "Re: Subject: Re: 32'nd bit & UTF-8"

Previous message: Lars Kristan: "RE: I Heart Huckabees"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Martin Duerst: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Marcin 'Qrczak' Kowalczyk
> Sent: 21 January 2005 22:49
> To: unicode@unicode.org
> Subject: Re: Subject: Re: 32'nd bit & UTF-8
>
>
> Let's assume that I design a programming language, specify that its
> source files should be encoded in UTF-8, don't mention anything about
> BOM, implement a compiler ...

It just so happens that I'm designing a programming language. (A toy, at
present - purely an intellectual exercise. But you never know. One day...?).
But I am, at least implementing a compiler for it.

But why on Earth would I specify its encoding? Why on Earth would /anyone/
specify an encoding in a modern (indeed, future) programming language? My
language is specified in CHARACTERS.

My in-development parser goes through a decoding phase. The decoding phase
translates the bytes from the source code file into /characters/. These
characters are then fed into the lexer generating phase.

Currently the decoding phase can auto-detect all UTF encodings correctly,
either with or without a BOM. The lexing phase doesn't care. It's all
characters by then. Auto-detection is guaranteed to be possible (and correct)
in my case, because the grammar dictates that the very first thing in the
source file MUST be a comment. This guarantees that the two characters of the
file WILL be non-NUL ASCII (possibly preceeded by a BOM), and given this
restriction, auto-detection of UTFs is accurate given only the first four bytes
of the source file. The grammar for the rest of the file permits (for example)
non-ASCII operators, identifiers, etc.

In the future, it should be possible that other (non-auto-detectable) encodings
be permitted, having been specified either in some OOB way, because, as I said,
the lexing phase doesn't care.

Of course, this is all for my own amusement. I've been writing this for a month
or so now. I didn't bother with lex, flex, yacc, bison, etc., because they
weren't sufficiently Unicode.

In other words, my response to your comment above: "Let's assume that I design
a programming language, specify that its source files should be encoded in
UTF-8..." is this. Let's not. It's silly.

Jill

Next message: Michael \(michka\) Kaplan: "Re: Subject: Re: 32'nd bit & UTF-8"
Previous message: Lars Kristan: "RE: I Heart Huckabees"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Martin Duerst: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 11:38:03 CST